# Chapter 3 R syntax and data structures

## 3.1 Data Types

Variables can contain values of specific types within R. The six **data types** that R uses include:

`"numeric"`

for any numerical value, including whole numbers and decimals.`"character"`

for text values, denoted by using quotes ("") around value.`"integer"`

for whole numbers (e.g.,`2L`

, the`L`

indicates to R that it's an integer). It behaves similar to the`numeric`

data type for most tasks or functions.`"logical"`

datatypes are`TRUE`

and`FALSE`

in all capital letters (the Boolean data type). The`logical`

data type can also be specified using`T`

for`TRUE`

in all capital letters, and`F`

for`FALSE`

The table below provides examples of each of the commonly used data types:

Data Type | Examples |
---|---|

Numeric: | 1, 1.5, 20, pi |

Character: | “anytext”, “5”, “TRUE” |

Integer: | 2L, 500L, -17L |

Logical: | TRUE, FALSE, T, F |

## 3.2 Data Structures

So far we have seen variables with a single value. **Variables can store more than just a single value, they can store a multitude of different data structures.** These include, but are not limited to, vectors (`c`

), factors (`factor`

), matrices (`matrix`

) and data frames (`data.frame`

).

### 3.2.1 Vectors

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It's basically just a collection of values, mainly either numbers:

or characters:

or logical values:

**Note that all values in a vector must be of the same data type.** If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.

For example, if you were to try to create the following vector:

R will coerce it into:

The values in a vector are called *elements*.

Each **element** contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a vector.

Let's create a vector of genome lengths and assign it to a variable called `glengths`

.

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using `c()`

(the combine function). All of the values are put within the parentheses and separated with a comma.

```
# Create a numeric vector and store the vector as a variable called 'glengths'
<- c(4.6, 3000, 50000)
glengths glengths
```

*Note your environment shows the glengths variable is numeric (num) and tells you the glengths vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values) as denoted by the [1:3].*

A vector can also contain characters. Create another vector called `species`

with three elements, where each element corresponds with the genome sizes vector (in Mb).

```
# Create a character vector and store the vector as a variable called 'species'
<- c("ecoli", "human", "corn")
species species
```

**Exercise**

Try to create a vector of numeric and character values by *combining* the two vectors that we just created (`glengths`

and `species`

). Assign this combined vector to a new variable called `combined`

. *Hint: you will need to use the combine c() function to do this*.
Print the

`combined`

vector in the console, what looks different compared to the original vectors?### 3.2.2 Factors

A **factor** is a special type of vector that is used to **store categorical data**. Each unique category is referred to as a **factor level** (i.e. category = level).

For instance, if we have four animals and the first animal is female, the second and third are male, and the fourth is female, we could create a factor that appears like a vector, but has integer values stored under-the-hood. The integer value assigned is a one for females and a two for males. The numbers are assigned in alphabetical order, so because the f- in females comes before the m- in males in the alphabet, females get assigned a one and males a two. In later lessons we will show you how you could change these assignments.

Let's create a factor vector and explore a bit more. We'll start by creating a character vector describing three different levels of expression. Perhaps the first value represents expression in mouse1, the second value represents expression in mouse2, and so on and so forth:

```
# Create a character vector and store the vector as a variable called 'expression'
<- c("low", "high", "medium", "high", "low", "medium", "high") expression
```

Now we can convert this character vector into a *factor* using the `factor()`

function:

```
# Turn 'expression' vector into a factor
<- factor(expression) expression
```

So, what exactly happened when we applied the `factor()`

function?

The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are `low`

, `medium`

, and `high`

.

So now that we have an idea of what factors are, when would you ever want to use them?

Factors are extremely valuable for many operations often performed in R and are necessary for many statistical methods, as you'll see. As an example, if you want to color your plots by treatment type, then you would need the treatment variable to be a factor.

**Exercises**

Let's say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

Create a vector named

`samplegroup`

with nine elements: 3 control ("CTL") values, 3 knock-out ("KO") values, and 3 over-expressing ("OE") values.Turn

`samplegroup`

into a factor data structure.

### 3.2.3 Matrix

A `matrix`

in R is a collection of vectors of **same length and identical datatype**. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure and are usually of numeric datatype.

### 3.2.4 Data Frame

A `data.frame`

is the *de facto* data structure for most tabular data and what we use for statistics and plotting. A `data.frame`

is similar to a matrix in that it's a collection of vectors of the **same length** and each vector represents a column. However, in a dataframe **each vector can be of a different data type** (e.g., characters, integers, factors). In the data frame pictured below, the first column is character, the second column is numeric, the third is character, and the fourth is logical.

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing **vectors** together to **form the columns**. We do this using the `data.frame()`

function, and giving the function the different vectors we would like to bind together. *This function will only work for vectors of the same length.*

```
# Create a data frame and store it as a variable called 'df'
<- data.frame(species, glengths) df
```

We can see that a new variable called `df`

has been created in our `Environment`

within a new section called `Data`

. In the `Environment`

, it specifies that `df`

has 3 observations of 2 variables. What does that mean? In R, rows always come first, so it means that `df`

has 3 rows and 2 columns.

**Exercise**

Create a data frame called `favorite_books`

with the following vectors as columns:

```
<- c("Catch-22", "Pride and Prejudice", "Nineteen Eighty Four")
titles <- c(453, 432, 328) pages
```