Chapter 5 Data Frames

In this section, we are going to achieve two things. Firstly, you are going to create a data frame, and secondly, you are going to load in some data.

Previously, we created two variables called group and sex. We also have the test scores and ages of these individuals too, so lets record those:

age <- c(20, 22, 49, 41, 35, 47, 18, 33, 21, 24, 22, 28)
score <- c(70, 89, 56, 60, 68, 62, 93, 63, 71, 65, 54, 67)

We now have 4 variables of the same size in our environment - age, sex, group, and score. Each of them are the same size (ie vectors with 4 elements) and the first entry for age (i.e. age[1]) corresponds to the same person for sex[1]. All of these 4 variables correspond to the same data set, but R doesn’t know this yet, we need to tell it!

To do this, we need to create a dataframe.

mydata <- data.frame(age, sex, group, score)
mydata
##    age    sex   group score
## 1   20   male     low    70
## 2   22   male     low    89
## 3   49   male     low    56
## 4   41   male     low    60
## 5   35   male    high    68
## 6   47 female    high    62
## 7   18 female    high    93
## 8   33 female    high    63
## 9   21 female control    71
## 10  24 female control    65
## 11  22 female control    54
## 12  28 female control    67

Note that data is now completely self contained, and if you were to make changes to say, your original age variable stored in a vector, it will not make any changes to age stored in your data frame.

When you have large data frames, you might want to check what variables you have stored in there. To do this, you can ask R to return the names of each of the variables using the names() function.

names(mydata)
## [1] "age"   "sex"   "group" "score"

You can also compactly display the internal structure of an R object

str(mydata)
## 'data.frame':    12 obs. of  4 variables:
##  $ age  : num  20 22 49 41 35 47 18 33 21 24 ...
##  $ sex  : Factor w/ 2 levels "male","female": 1 1 1 1 1 2 2 2 2 2 ...
##  $ group: Factor w/ 3 levels "low","high","control": 1 1 1 1 2 2 2 2 3 3 ...
##  $ score: num  70 89 56 60 68 62 93 63 71 65 ...

This gives you a very basic overview of your data, but is a very helpful tool in displaying the breakdown of what is contained in an object.

You might want to get some specific data out of your data frame, as opposed to the full 4 columns. You need to be specific in asking R to return you this information. The simplest way is to make use of the $ operator to extract the desired information. For example, lets say you want to extract the scores.

mydata$score
##  [1] 70 89 56 60 68 62 93 63 71 65 54 67