Chapter 3 R syntax and data structures
3.1 Data Types
Variables can contain values of specific types within R. The data types that R uses include:
"numeric"
for any numerical value, including whole numbers and decimals."character"
for text values, denoted by using quotes (““) around value."integer"
for whole numbers (e.g.,2L
, theL
indicates to R that it’s an integer). It behaves similar to thenumeric
data type for most tasks or functions."logical"
datatypes areTRUE
andFALSE
in all capital letters (the Boolean data type). Thelogical
data type can also be specified usingT
forTRUE
in all capital letters, andF
forFALSE
.T
andF
are not recommended for use in R, as they can be confused with other functions or variables.
The table below provides examples of each of the commonly used data types:
Data Type | Examples |
---|---|
Numeric: | 1, 1.5, 20, pi |
Character: | “anytext”, “5”, “TRUE” |
Integer: | 2L, 500L, -17L |
Logical: | TRUE, FALSE, T, F |
3.2 Data Structures
So far we have seen variables with a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c
), factors (factor
), matrices (matrix
) and data frames (data.frame
).
3.2.1 Vectors
A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s basically just a collection of values, mainly either numbers:
or characters:
or logical values:
Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.
For example, if you were to try to create the following vector:
R will coerce it into:
The values in a vector are called elements.
Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a vector.
Let’s create a vector of genome lengths and assign it to a variable called glengths
.
Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c()
(the combine function). All of the values are put within the parentheses and separated with a comma.
# Create a numeric vector and store the vector as a variable called 'glengths'
glengths <- c(4.6, 3000, 50000)
glengths
Note your environment shows the glengths
variable is numeric (num) and tells you the glengths
vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values) as denoted by the [1:3].
A vector can also contain characters. Create another vector called species
with three elements, where each element corresponds with the genome sizes vector (in Mb).
# Create a character vector and store the vector as a variable called 'species'
species <- c("ecoli", "human", "corn")
species
Exercise
Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths
and species
). Assign this combined vector to a new variable called combined
. Hint: you will need to use the combine c()
function to do this.
Print the combined
vector in the console, what looks different compared to the original vectors? What do you think notice about the output of the combined
vector?
3.2.2 Factors
A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level).
For instance, if we have four animals and the first animal is female, the second and third are male, and the fourth is female, we could create a factor that appears like a vector, but has integer values stored under-the-hood. The integer value assigned is a one for females and a two for males. The numbers are assigned in alphabetical order, so because the f- in females comes before the m- in males in the alphabet, females get assigned a one and males a two. In later lessons we will show you how you could change these assignments.
Let’s create a factor vector and explore a bit more. We’ll start by creating a character vector describing three different levels of expression. Perhaps the first value represents expression in mouse1, the second value represents expression in mouse2, and so on and so forth:
# Create a character vector and store the vector as a variable called 'expression'
expression <- c("low", "high", "medium", "high", "low", "medium", "high")
Now we can convert this character vector into a factor using the factor()
function:
So, what exactly happened when we applied the factor()
function?
The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are low
, medium
, and high
.
So now that we have an idea of what factors are, when would you ever want to use them?
Factors are extremely valuable for many operations often performed in R and are necessary for many statistical methods, as you’ll see. As an example, if you want to color your plots by treatment type, then you would need the treatment variable to be a factor.
Exercises
Let’s say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.
Create a vector named
samplegroup
with nine elements: 3 control (“CTL”) values, 3 knock-out (“KO”) values, and 3 over-expressing (“OE”) values.Turn
samplegroup
into a factor data structure.
3.2.3 Matrix
A matrix
in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure and are usually of numeric datatype.
3.2.4 Data Frame
A data.frame
is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame
is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors). In the data frame pictured below, the first column is character, the second column is numeric, the third is character, and the fourth is logical.
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.
We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame()
function, and giving the function the different vectors we would like to bind together. This function will only work for vectors of the same length.
We can see that a new variable called df
has been created in our Environment
within a new section called Data
. In the Environment
, it specifies that df
has 3 observations of 2 variables. What does that mean? In R, rows always come first, so it means that df
has 3 rows and 2 columns.
Exercise
Create a data frame called favorite_books
with the following vectors as columns: