14.7 Data frames and data management
14.7.1 The basics
- Data frames are the usual format for data sets
- …are
"lists"of vectors of the same length under the hood - …look like matrices, but columns can contain objects of different class
- …are
- Functions
data.frame(): Create a data frameas.data.frame(): Convert into a data framesummary()undstr(): Get oversight of a data frame’s contenthead()andtail(): Inspect the data- ``View()
: Show data frame (formerlyfix()) + **Beware**.. you can't continue to work when the data window offix()``` is open! names(): Display variablenames (and use to rename)na.omit(): Delete missings “listwise”, i.e. rows that contain at least one missingis.na(): Generate logical vector indicating the missings
- More
object$var1: Access variablevar1in data frameobjectas.numeric(object$var1): Convert class of variable into numericNA: What was that?!?!
14.7.2 The attach()-function
attach(): When you attach a data frame you can use variable names directly without referring to the dataframe (R understands that)- Problem: It can cause all sorts of errors since you might loose oversight (especially in more complex scripts)!
- Avoid it and…
- …work with ankers
$to access variables - …specify data frame in the function where possible, e.g.
lm(...., data=yourdataframe) - …use other ways.
- …work with ankers
14.7.3 Example: The basics
getwd() # Get working directory
library(foreign) # Load package "foreign"
ls("package:foreign") # Check package content
?swiss # Check out the object
# What is this about?
# Visit: https://opr.princeton.edu/archive/pefp/switz.aspx
swiss2 <- swiss # Load data set
# attach() function
View(swiss2)
attach(swiss2)
names(swiss2)
Education
detach(swiss2)
Education
swiss2$Education
# Get info on data set
str(swiss2)
summary(swiss2)
head(swiss2)
#fix(swiss2)
tail(swiss2)
# Create a data frame
data <- data.frame(id=1:3, # !
weight=c(20,27,24),
size=c("small", "large", "medium"))
data14.7.4 Logic of accessing subsets of data frames
- Same logic as for vectors but two dimensions
dataframe[rows,columns]- Replace rows/columns by vector indicating position (numerical, logical, character)
- Logic similar for other object classes such as lists (remember vectors/lists)
# Q: What does the following code do?
swiss[2:4, c(1,2,4)] # indices, c() necessary when numbers are not connected| Fertility | Agriculture | Education | |
|---|---|---|---|
| Delemont | 83.1 | 45.1 | 9 |
| Franches-Mnt | 92.5 | 39.7 | 5 |
| Moutier | 85.8 | 36.5 | 7 |
swiss[swiss$Fertility > 75 & swiss$Agriculture > 75, c(1:3)]| Fertility | Agriculture | Examination | |
|---|---|---|---|
| Conthey | 75.5 | 85.9 | 3 |
| Herens | 77.3 | 89.7 | 5 |
| Sierre | 92.2 | 84.6 | 3 |
subset(swiss, Fertility > 75 & Agriculture > 75)[, c(1:3)]| Fertility | Agriculture | Examination | |
|---|---|---|---|
| Conthey | 75.5 | 85.9 | 3 |
| Herens | 77.3 | 89.7 | 5 |
| Sierre | 92.2 | 84.6 | 3 |
swiss[, c("Fertility", "Agriculture")]| Fertility | Agriculture | |
|---|---|---|
| Courtelary | 80.2 | 17.0 |
| Delemont | 83.1 | 45.1 |
| Franches-Mnt | 92.5 | 39.7 |
| Moutier | 85.8 | 36.5 |
| Neuveville | 76.9 | 43.5 |
| Porrentruy | 76.1 | 35.3 |
| Broye | 83.8 | 70.2 |
| Glane | 92.4 | 67.8 |
| Gruyere | 82.4 | 53.3 |
| Sarine | 82.9 | 45.2 |
| Veveyse | 87.1 | 64.5 |
| Aigle | 64.1 | 62.0 |
| Aubonne | 66.9 | 67.5 |
| Avenches | 68.9 | 60.7 |
| Cossonay | 61.7 | 69.3 |
| Echallens | 68.3 | 72.6 |
| Grandson | 71.7 | 34.0 |
| Lausanne | 55.7 | 19.4 |
| La Vallee | 54.3 | 15.2 |
| Lavaux | 65.1 | 73.0 |
| Morges | 65.5 | 59.8 |
| Moudon | 65.0 | 55.1 |
| Nyone | 56.6 | 50.9 |
| Orbe | 57.4 | 54.1 |
| Oron | 72.5 | 71.2 |
| Payerne | 74.2 | 58.1 |
| Paysd’enhaut | 72.0 | 63.5 |
| Rolle | 60.5 | 60.8 |
| Vevey | 58.3 | 26.8 |
| Yverdon | 65.4 | 49.5 |
| Conthey | 75.5 | 85.9 |
| Entremont | 69.3 | 84.9 |
| Herens | 77.3 | 89.7 |
| Martigwy | 70.5 | 78.2 |
| Monthey | 79.4 | 64.9 |
| St Maurice | 65.0 | 75.9 |
| Sierre | 92.2 | 84.6 |
| Sion | 79.3 | 63.1 |
| Boudry | 70.4 | 38.4 |
| La Chauxdfnd | 65.7 | 7.7 |
| Le Locle | 72.7 | 16.7 |
| Neuchatel | 64.4 | 17.6 |
| Val de Ruz | 77.6 | 37.6 |
| ValdeTravers | 67.6 | 18.7 |
| V. De Geneve | 35.0 | 1.2 |
| Rive Droite | 44.7 | 46.6 |
| Rive Gauche | 42.8 | 27.7 |
# We'll learn a more convenient function later on!14.7.5 Recoding variables
- Either do it manually (see below) or…
- …using the
plyrpackagemapvalues(): Recode a categorical vectorcut(): Recode a continuous variable into a categorical one
- Always check wether recoding worked (very common error!)
table(variable1, variablevar2): Contingency table for the two variablesstr()andsummary(): Check whether variables in the data set have expected distributions and beware of missings!
14.7.5.1 Example: Recoding variables
# MANUEL CLASSIC WAY
swiss2 <- swiss # Make a copy of the data set
names(swiss) # Display variables## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
str(swiss)## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)| Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality | |
|---|---|---|---|---|---|---|
| Min. :35.00 | Min. : 1.20 | Min. : 3.00 | Min. : 1.00 | Min. : 2.150 | Min. :10.80 | |
| 1st Qu.:64.70 | 1st Qu.:35.90 | 1st Qu.:12.00 | 1st Qu.: 6.00 | 1st Qu.: 5.195 | 1st Qu.:18.15 | |
| Median :70.40 | Median :54.10 | Median :16.00 | Median : 8.00 | Median : 15.140 | Median :20.00 | |
| Mean :70.14 | Mean :50.66 | Mean :16.49 | Mean :10.98 | Mean : 41.144 | Mean :19.94 | |
| 3rd Qu.:78.45 | 3rd Qu.:67.65 | 3rd Qu.:22.00 | 3rd Qu.:12.00 | 3rd Qu.: 93.125 | 3rd Qu.:21.70 | |
| Max. :92.50 | Max. :89.70 | Max. :37.00 | Max. :53.00 | Max. :100.000 | Max. :26.60 |
swiss2$d.catholic <- NA # generate new variable in dataset
View(swiss2)
swiss2$d.catholic[swiss2$Catholic <= 50] <- 0 # replace values conditional on Catholic
swiss2$d.catholic[swiss2$Catholic > 50] <- 1 # replace values conditional on Catholic
table(swiss2$d.catholic, swiss2$Catholic) # check recoding| / | 2.15 | 2.27 | 2.4 | 2.56 | 2.82 | 2.84 | 3.3 | 4.2 | 4.43 | 4.52 | 4.97 | 5.16 | 5.23 | 5.62 | 6.1 | 7.72 | 8.52 | 8.65 | 9.96 | 11.22 | 12.11 | 13.79 | 15.14 | 16.92 | 18.46 | 24.2 | 33.77 | 42.34 | 50.43 | 58.33 | 84.84 | 90.57 | 91.38 | 92.85 | 93.4 | 96.83 | 97.16 | 97.67 | 98.22 | 98.61 | 98.96 | 99.06 | 99.46 | 99.68 | 99.71 | 100 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
names(swiss2) # show variable names## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
## [7] "d.catholic"
names(swiss2)[7] <- "dummy.catholic"
# PLYR: "NEW" WAY
# For recoding character variables simply refer to text with ""
library(plyr)
# mapvalues()
swiss2$Examination2 <- mapvalues(swiss2$Examination, from = c(3, 37), to = c(NA, NA))
# cut()
swiss2$Examination2 <- cut(swiss2$Examination2,
breaks=c(-Inf, 12, 22, Inf),
labels=c("low","medium","high")) # greater than or equal to14.7.5.2 Exercise: Recoding variables
- Save the data set swiss in a new object called
swiss2. - Recode the variable
Infant.Mortalityin your new data setswiss2so thatvalues <= 18are coded as0,18 < values <= 20as1,20 < values <= 21as2and21 < values <= 27as3. Do this using both the classic way and thecut()function and name the respective variablesinf.mort.claandinf.mort.cut. - Check if your coding worked and check the class of the two new variables/objects.