14.7 Data frames and data management

14.7.1 The basics

  • Data frames are the usual format for data sets
    • …are "lists" of vectors of the same length under the hood
    • …look like matrices, but columns can contain objects of different class


  • Functions
    • data.frame(): Create a data frame
    • as.data.frame(): Convert into a data frame
    • summary() und str(): Get oversight of a data frame’s content
    • head() and tail(): Inspect the data
    • ``View(): Show data frame (formerlyfix()) + **Beware**.. you can't continue to work when the data window offix()``` is open!
    • names(): Display variablenames (and use to rename)
    • na.omit(): Delete missings “listwise”, i.e. rows that contain at least one missing
    • is.na(): Generate logical vector indicating the missings


  • More
    • object$var1: Access variable var1 in data frame object
    • as.numeric(object$var1): Convert class of variable into numeric
    • NA: What was that?!?!


14.7.2 The attach()-function

  • attach(): When you attach a data frame you can use variable names directly without referring to the dataframe (R understands that)
  • Problem: It can cause all sorts of errors since you might loose oversight (especially in more complex scripts)!


  • Avoid it and…
    • …work with ankers $ to access variables
    • …specify data frame in the function where possible, e.g. lm(...., data=yourdataframe)
    • …use other ways.


14.7.3 Example: The basics

getwd() # Get working directory
library(foreign) # Load package "foreign"
ls("package:foreign") # Check package content


?swiss # Check out the object
  # What is this about?
  # Visit: https://opr.princeton.edu/archive/pefp/switz.aspx
swiss2 <- swiss # Load data set


# attach() function
View(swiss2)
attach(swiss2)
names(swiss2)
Education
detach(swiss2)
Education
swiss2$Education

# Get info on data set
str(swiss2)
summary(swiss2)
head(swiss2)
#fix(swiss2)
tail(swiss2)

# Create a data frame
data <- data.frame(id=1:3, # !
                    weight=c(20,27,24),
                    size=c("small", "large", "medium"))
data


14.7.4 Logic of accessing subsets of data frames

  • Same logic as for vectors but two dimensions
    • dataframe[rows,columns]
    • Replace rows/columns by vector indicating position (numerical, logical, character)
  • Logic similar for other object classes such as lists (remember vectors/lists)


# Q: What does the following code do?

swiss[2:4, c(1,2,4)] # indices, c() necessary when numbers are not connected
Fertility Agriculture Education
Delemont 83.1 45.1 9
Franches-Mnt 92.5 39.7 5
Moutier 85.8 36.5 7
swiss[swiss$Fertility > 75 & swiss$Agriculture > 75, c(1:3)]
Fertility Agriculture Examination
Conthey 75.5 85.9 3
Herens 77.3 89.7 5
Sierre 92.2 84.6 3
subset(swiss, Fertility > 75 & Agriculture > 75)[, c(1:3)]
Fertility Agriculture Examination
Conthey 75.5 85.9 3
Herens 77.3 89.7 5
Sierre 92.2 84.6 3
swiss[, c("Fertility", "Agriculture")]
Fertility Agriculture
Courtelary 80.2 17.0
Delemont 83.1 45.1
Franches-Mnt 92.5 39.7
Moutier 85.8 36.5
Neuveville 76.9 43.5
Porrentruy 76.1 35.3
Broye 83.8 70.2
Glane 92.4 67.8
Gruyere 82.4 53.3
Sarine 82.9 45.2
Veveyse 87.1 64.5
Aigle 64.1 62.0
Aubonne 66.9 67.5
Avenches 68.9 60.7
Cossonay 61.7 69.3
Echallens 68.3 72.6
Grandson 71.7 34.0
Lausanne 55.7 19.4
La Vallee 54.3 15.2
Lavaux 65.1 73.0
Morges 65.5 59.8
Moudon 65.0 55.1
Nyone 56.6 50.9
Orbe 57.4 54.1
Oron 72.5 71.2
Payerne 74.2 58.1
Paysd’enhaut 72.0 63.5
Rolle 60.5 60.8
Vevey 58.3 26.8
Yverdon 65.4 49.5
Conthey 75.5 85.9
Entremont 69.3 84.9
Herens 77.3 89.7
Martigwy 70.5 78.2
Monthey 79.4 64.9
St Maurice 65.0 75.9
Sierre 92.2 84.6
Sion 79.3 63.1
Boudry 70.4 38.4
La Chauxdfnd 65.7 7.7
Le Locle 72.7 16.7
Neuchatel 64.4 17.6
Val de Ruz 77.6 37.6
ValdeTravers 67.6 18.7
V. De Geneve 35.0 1.2
Rive Droite 44.7 46.6
Rive Gauche 42.8 27.7
# We'll learn a more convenient function later on!


14.7.5 Recoding variables

  • Either do it manually (see below) or…
  • …using the plyr package
    • mapvalues(): Recode a categorical vector
    • cut(): Recode a continuous variable into a categorical one


  • Always check wether recoding worked (very common error!)
    • table(variable1, variablevar2): Contingency table for the two variables
    • str() and summary(): Check whether variables in the data set have expected distributions and beware of missings!



14.7.5.1 Example: Recoding variables

# MANUEL CLASSIC WAY
swiss2 <- swiss # Make a copy of the data set
names(swiss) # Display variables
## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"
str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)
Fertility Agriculture Examination Education Catholic Infant.Mortality
Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00 Min. : 2.150 Min. :10.80
1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00 1st Qu.: 5.195 1st Qu.:18.15
Median :70.40 Median :54.10 Median :16.00 Median : 8.00 Median : 15.140 Median :20.00
Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98 Mean : 41.144 Mean :19.94
3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00 3rd Qu.: 93.125 3rd Qu.:21.70
Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00 Max. :100.000 Max. :26.60
swiss2$d.catholic <- NA # generate new variable in dataset
View(swiss2)
swiss2$d.catholic[swiss2$Catholic <= 50] <- 0 # replace values conditional on Catholic
swiss2$d.catholic[swiss2$Catholic > 50] <- 1  # replace values conditional on Catholic
table(swiss2$d.catholic, swiss2$Catholic) # check recoding
/ 2.15 2.27 2.4 2.56 2.82 2.84 3.3 4.2 4.43 4.52 4.97 5.16 5.23 5.62 6.1 7.72 8.52 8.65 9.96 11.22 12.11 13.79 15.14 16.92 18.46 24.2 33.77 42.34 50.43 58.33 84.84 90.57 91.38 92.85 93.4 96.83 97.16 97.67 98.22 98.61 98.96 99.06 99.46 99.68 99.71 100
0 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
names(swiss2)   # show variable names
## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"
## [7] "d.catholic"
names(swiss2)[7] <- "dummy.catholic"

# PLYR: "NEW" WAY
# For recoding character variables simply refer to text with ""
library(plyr)

# mapvalues()
swiss2$Examination2 <- mapvalues(swiss2$Examination, from = c(3, 37), to = c(NA, NA))

# cut()
swiss2$Examination2 <- cut(swiss2$Examination2,
                     breaks=c(-Inf, 12, 22, Inf),
                     labels=c("low","medium","high")) # greater than or equal to


14.7.5.2 Exercise: Recoding variables

  1. Save the data set swiss in a new object called swiss2.
  2. Recode the variable Infant.Mortality in your new data set swiss2 so that values <= 18 are coded as 0, 18 < values <= 20 as 1, 20 < values <= 21 as 2 and 21 < values <= 27 as 3. Do this using both the classic way and the cut() function and name the respective variables inf.mort.cla and inf.mort.cut.
  3. Check if your coding worked and check the class of the two new variables/objects.


14.7.5.3 Solution: Recoding variables