14.7 Data frames and data management
14.7.1 The basics
- Data frames are the usual format for data sets
- …are
"lists"
of vectors of the same length under the hood - …look like matrices, but columns can contain objects of different class
- …are
- Functions
data.frame()
: Create a data frameas.data.frame()
: Convert into a data framesummary()
undstr()
: Get oversight of a data frame’s contenthead()
andtail()
: Inspect the data- ``View()
: Show data frame (formerly
fix()) + **Beware**.. you can't continue to work when the data window of
fix()``` is open! names()
: Display variablenames (and use to rename)na.omit()
: Delete missings “listwise”, i.e. rows that contain at least one missingis.na()
: Generate logical vector indicating the missings
- More
object$var1
: Access variablevar1
in data frameobject
as.numeric(object$var1)
: Convert class of variable into numericNA
: What was that?!?!
14.7.2 The attach()-function
attach()
: When you attach a data frame you can use variable names directly without referring to the dataframe (R understands that)- Problem: It can cause all sorts of errors since you might loose oversight (especially in more complex scripts)!
- Avoid it and…
- …work with ankers
$
to access variables - …specify data frame in the function where possible, e.g.
lm(...., data=yourdataframe)
- …use other ways.
- …work with ankers
14.7.3 Example: The basics
getwd() # Get working directory
library(foreign) # Load package "foreign"
ls("package:foreign") # Check package content
?swiss # Check out the object
# What is this about?
# Visit: https://opr.princeton.edu/archive/pefp/switz.aspx
swiss2 <- swiss # Load data set
# attach() function
View(swiss2)
attach(swiss2)
names(swiss2)
Education
detach(swiss2)
Education
swiss2$Education
# Get info on data set
str(swiss2)
summary(swiss2)
head(swiss2)
#fix(swiss2)
tail(swiss2)
# Create a data frame
data <- data.frame(id=1:3, # !
weight=c(20,27,24),
size=c("small", "large", "medium"))
data
14.7.4 Logic of accessing subsets of data frames
- Same logic as for vectors but two dimensions
dataframe[rows,columns]
- Replace rows/columns by vector indicating position (numerical, logical, character)
- Logic similar for other object classes such as lists (remember vectors/lists)
# Q: What does the following code do?
swiss[2:4, c(1,2,4)] # indices, c() necessary when numbers are not connected
Fertility | Agriculture | Education | |
---|---|---|---|
Delemont | 83.1 | 45.1 | 9 |
Franches-Mnt | 92.5 | 39.7 | 5 |
Moutier | 85.8 | 36.5 | 7 |
swiss[swiss$Fertility > 75 & swiss$Agriculture > 75, c(1:3)]
Fertility | Agriculture | Examination | |
---|---|---|---|
Conthey | 75.5 | 85.9 | 3 |
Herens | 77.3 | 89.7 | 5 |
Sierre | 92.2 | 84.6 | 3 |
subset(swiss, Fertility > 75 & Agriculture > 75)[, c(1:3)]
Fertility | Agriculture | Examination | |
---|---|---|---|
Conthey | 75.5 | 85.9 | 3 |
Herens | 77.3 | 89.7 | 5 |
Sierre | 92.2 | 84.6 | 3 |
swiss[, c("Fertility", "Agriculture")]
Fertility | Agriculture | |
---|---|---|
Courtelary | 80.2 | 17.0 |
Delemont | 83.1 | 45.1 |
Franches-Mnt | 92.5 | 39.7 |
Moutier | 85.8 | 36.5 |
Neuveville | 76.9 | 43.5 |
Porrentruy | 76.1 | 35.3 |
Broye | 83.8 | 70.2 |
Glane | 92.4 | 67.8 |
Gruyere | 82.4 | 53.3 |
Sarine | 82.9 | 45.2 |
Veveyse | 87.1 | 64.5 |
Aigle | 64.1 | 62.0 |
Aubonne | 66.9 | 67.5 |
Avenches | 68.9 | 60.7 |
Cossonay | 61.7 | 69.3 |
Echallens | 68.3 | 72.6 |
Grandson | 71.7 | 34.0 |
Lausanne | 55.7 | 19.4 |
La Vallee | 54.3 | 15.2 |
Lavaux | 65.1 | 73.0 |
Morges | 65.5 | 59.8 |
Moudon | 65.0 | 55.1 |
Nyone | 56.6 | 50.9 |
Orbe | 57.4 | 54.1 |
Oron | 72.5 | 71.2 |
Payerne | 74.2 | 58.1 |
Paysd’enhaut | 72.0 | 63.5 |
Rolle | 60.5 | 60.8 |
Vevey | 58.3 | 26.8 |
Yverdon | 65.4 | 49.5 |
Conthey | 75.5 | 85.9 |
Entremont | 69.3 | 84.9 |
Herens | 77.3 | 89.7 |
Martigwy | 70.5 | 78.2 |
Monthey | 79.4 | 64.9 |
St Maurice | 65.0 | 75.9 |
Sierre | 92.2 | 84.6 |
Sion | 79.3 | 63.1 |
Boudry | 70.4 | 38.4 |
La Chauxdfnd | 65.7 | 7.7 |
Le Locle | 72.7 | 16.7 |
Neuchatel | 64.4 | 17.6 |
Val de Ruz | 77.6 | 37.6 |
ValdeTravers | 67.6 | 18.7 |
V. De Geneve | 35.0 | 1.2 |
Rive Droite | 44.7 | 46.6 |
Rive Gauche | 42.8 | 27.7 |
# We'll learn a more convenient function later on!
14.7.5 Recoding variables
- Either do it manually (see below) or…
- …using the
plyr
packagemapvalues()
: Recode a categorical vectorcut()
: Recode a continuous variable into a categorical one
- Always check wether recoding worked (very common error!)
table(variable1, variablevar2)
: Contingency table for the two variablesstr()
andsummary()
: Check whether variables in the data set have expected distributions and beware of missings!
14.7.5.1 Example: Recoding variables
# MANUEL CLASSIC WAY
swiss2 <- swiss # Make a copy of the data set
names(swiss) # Display variables
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)
Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality | |
---|---|---|---|---|---|---|
Min. :35.00 | Min. : 1.20 | Min. : 3.00 | Min. : 1.00 | Min. : 2.150 | Min. :10.80 | |
1st Qu.:64.70 | 1st Qu.:35.90 | 1st Qu.:12.00 | 1st Qu.: 6.00 | 1st Qu.: 5.195 | 1st Qu.:18.15 | |
Median :70.40 | Median :54.10 | Median :16.00 | Median : 8.00 | Median : 15.140 | Median :20.00 | |
Mean :70.14 | Mean :50.66 | Mean :16.49 | Mean :10.98 | Mean : 41.144 | Mean :19.94 | |
3rd Qu.:78.45 | 3rd Qu.:67.65 | 3rd Qu.:22.00 | 3rd Qu.:12.00 | 3rd Qu.: 93.125 | 3rd Qu.:21.70 | |
Max. :92.50 | Max. :89.70 | Max. :37.00 | Max. :53.00 | Max. :100.000 | Max. :26.60 |
swiss2$d.catholic <- NA # generate new variable in dataset
View(swiss2)
swiss2$d.catholic[swiss2$Catholic <= 50] <- 0 # replace values conditional on Catholic
swiss2$d.catholic[swiss2$Catholic > 50] <- 1 # replace values conditional on Catholic
table(swiss2$d.catholic, swiss2$Catholic) # check recoding
/ | 2.15 | 2.27 | 2.4 | 2.56 | 2.82 | 2.84 | 3.3 | 4.2 | 4.43 | 4.52 | 4.97 | 5.16 | 5.23 | 5.62 | 6.1 | 7.72 | 8.52 | 8.65 | 9.96 | 11.22 | 12.11 | 13.79 | 15.14 | 16.92 | 18.46 | 24.2 | 33.77 | 42.34 | 50.43 | 58.33 | 84.84 | 90.57 | 91.38 | 92.85 | 93.4 | 96.83 | 97.16 | 97.67 | 98.22 | 98.61 | 98.96 | 99.06 | 99.46 | 99.68 | 99.71 | 100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
names(swiss2) # show variable names
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
## [7] "d.catholic"
names(swiss2)[7] <- "dummy.catholic"
# PLYR: "NEW" WAY
# For recoding character variables simply refer to text with ""
library(plyr)
# mapvalues()
swiss2$Examination2 <- mapvalues(swiss2$Examination, from = c(3, 37), to = c(NA, NA))
# cut()
swiss2$Examination2 <- cut(swiss2$Examination2,
breaks=c(-Inf, 12, 22, Inf),
labels=c("low","medium","high")) # greater than or equal to
14.7.5.2 Exercise: Recoding variables
- Save the data set swiss in a new object called
swiss2
. - Recode the variable
Infant.Mortality
in your new data setswiss2
so thatvalues <= 18
are coded as0
,18 < values <= 20
as1
,20 < values <= 21
as2
and21 < values <= 27
as3
. Do this using both the classic way and thecut()
function and name the respective variablesinf.mort.cla
andinf.mort.cut
. - Check if your coding worked and check the class of the two new variables/objects.