Chapter 3 R syntax and data structures

3.1 Data Types

Variables can contain values of specific types within R. The data types that R uses include:

  • "numeric" for any numerical value, including whole numbers and decimals.

  • "character" for text values, denoted by using quotes (““) around value.

  • "integer" for whole numbers (e.g., 2L, the L indicates to R that it’s an integer). It behaves similar to the numeric data type for most tasks or functions.

  • "logical" datatypes are TRUE and FALSE in all capital letters (the Boolean data type). The logical data type can also be specified using T for TRUE in all capital letters, and FforFALSE. T and F are not recommended for use in R, as they can be confused with other functions or variables.

The table below provides examples of each of the commonly used data types:

Data Type Examples
Numeric: 1, 1.5, 20, pi
Character: “anytext”, “5”, “TRUE”
Integer: 2L, 500L, -17L
Logical: TRUE, FALSE, T, F

3.2 Data Structures

So far we have seen variables with a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c), factors (factor), matrices (matrix) and data frames (data.frame).

3.2.1 Vectors

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s basically just a collection of values, mainly either numbers:

numeric vector

or characters:

character vector

or logical values:

logical vector

Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.

For example, if you were to try to create the following vector:

R will coerce it into:

Alt text

The values in a vector are called elements.

Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a vector.

Let’s create a vector of genome lengths and assign it to a variable called glengths.

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

# Create a numeric vector and store the vector as a variable called 'glengths'
glengths <- c(4.6, 3000, 50000)
glengths

Note your environment shows the glengths variable is numeric (num) and tells you the glengths vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values) as denoted by the [1:3].

A vector can also contain characters. Create another vector called species with three elements, where each element corresponds with the genome sizes vector (in Mb).

# Create a character vector and store the vector as a variable called 'species'
species <- c("ecoli", "human", "corn")
species

Exercise

Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c() function to do this. Print the combined vector in the console, what looks different compared to the original vectors? What do you think notice about the output of the combined vector?


3.2.2 Factors

A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level).

For instance, if we have four animals and the first animal is female, the second and third are male, and the fourth is female, we could create a factor that appears like a vector, but has integer values stored under-the-hood. The integer value assigned is a one for females and a two for males. The numbers are assigned in alphabetical order, so because the f- in females comes before the m- in males in the alphabet, females get assigned a one and males a two. In later lessons we will show you how you could change these assignments.

factors

Let’s create a factor vector and explore a bit more. We’ll start by creating a character vector describing three different levels of expression. Perhaps the first value represents expression in mouse1, the second value represents expression in mouse2, and so on and so forth:

# Create a character vector and store the vector as a variable called 'expression'
expression <- c("low", "high", "medium", "high", "low", "medium", "high")

Now we can convert this character vector into a factor using the factor() function:

# Turn 'expression' vector into a factor
expression <- factor(expression)

So, what exactly happened when we applied the factor() function?

factor_new

The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are low, medium, and high.

So now that we have an idea of what factors are, when would you ever want to use them?

Factors are extremely valuable for many operations often performed in R and are necessary for many statistical methods, as you’ll see. As an example, if you want to color your plots by treatment type, then you would need the treatment variable to be a factor.


Exercises

Let’s say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

  1. Create a vector named samplegroup with nine elements: 3 control (“CTL”) values, 3 knock-out (“KO”) values, and 3 over-expressing (“OE”) values.

  2. Turn samplegroup into a factor data structure.


3.2.3 Matrix

A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure and are usually of numeric datatype.

matrix

3.2.4 Data Frame

A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors). In the data frame pictured below, the first column is character, the second column is numeric, the third is character, and the fourth is logical.

dataframe

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function, and giving the function the different vectors we would like to bind together. This function will only work for vectors of the same length.

# Create a data frame and store it as a variable called 'df'
df <- data.frame(species, glengths)

We can see that a new variable called df has been created in our Environment within a new section called Data. In the Environment, it specifies that df has 3 observations of 2 variables. What does that mean? In R, rows always come first, so it means that df has 3 rows and 2 columns.


Exercise

Create a data frame called favorite_books with the following vectors as columns:

titles <- c("Catch-22", "Pride and Prejudice", "Nineteen Eighty Four")
pages <- c(453, 432, 328)