Chapter 5 Data Frames
In this section, we are going to achieve two things. Firstly, you are going to create a data frame, and secondly, you are going to load
in some data.
Previously, we created two variables called group
and sex
. We also have the test scores and ages of these individuals too, so lets record those:
age <- c(20, 22, 49, 41, 35, 47, 18, 33, 21, 24, 22, 28)
score <- c(70, 89, 56, 60, 68, 62, 93, 63, 71, 65, 54, 67)
We now have 4 variables of the same size in our environment - age
, sex
, group
, and score
. Each of them are the same size (ie vectors with 4 elements) and the first entry for age (i.e. age[1]
) corresponds to the same person for sex[1]
. All of these 4 variables correspond to the same data set, but R
doesn’t know this yet, we need to tell it!
To do this, we need to create a dataframe.
mydata <- data.frame(age, sex, group, score)
mydata
## age sex group score
## 1 20 male low 70
## 2 22 male low 89
## 3 49 male low 56
## 4 41 male low 60
## 5 35 male high 68
## 6 47 female high 62
## 7 18 female high 93
## 8 33 female high 63
## 9 21 female control 71
## 10 24 female control 65
## 11 22 female control 54
## 12 28 female control 67
Note that data is now completely self contained, and if you were to make changes to say, your original age variable stored in a vector, it will not make any changes to age stored in your data frame.
When you have large data frames, you might want to check what variables you have stored in there. To do this, you can ask R
to return the names of each of the variables using the names()
function.
names(mydata)
## [1] "age" "sex" "group" "score"
You can also compactly display the internal structure of an R object
str(mydata)
## 'data.frame': 12 obs. of 4 variables:
## $ age : num 20 22 49 41 35 47 18 33 21 24 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 1 1 1 2 2 2 2 2 ...
## $ group: Factor w/ 3 levels "low","high","control": 1 1 1 1 2 2 2 2 3 3 ...
## $ score: num 70 89 56 60 68 62 93 63 71 65 ...
This gives you a very basic overview of your data, but is a very helpful tool in displaying the breakdown of what is contained in an object.
You might want to get some specific data out of your data frame, as opposed to the full 4 columns. You need to be specific in asking R
to return you this information. The simplest way is to make use of the $
operator to extract the desired information. For example, lets say you want to extract the scores.
mydata$score
## [1] 70 89 56 60 68 62 93 63 71 65 54 67