Big data and Social Science

14.7 Data frames and data management

14.7.1 The basics

Data frames are the usual format for data sets
- …are "lists" of vectors of the same length under the hood
- …look like matrices, but columns can contain objects of different class

Functions
- data.frame(): Create a data frame
- as.data.frame(): Convert into a data frame
- summary() und str(): Get oversight of a data frame’s content
- head() and tail(): Inspect the data
- ``View(): Show data frame (formerlyfix()) + **Beware**.. you can't continue to work when the data window offix()``` is open!
- names(): Display variablenames (and use to rename)
- na.omit(): Delete missings “listwise”, i.e. rows that contain at least one missing
- is.na(): Generate logical vector indicating the missings

More
- object$var1: Access variable var1 in data frame object
- as.numeric(object$var1): Convert class of variable into numeric
- NA: What was that?!?!

14.7.2 The attach()-function

attach(): When you attach a data frame you can use variable names directly without referring to the dataframe (R understands that)
Problem: It can cause all sorts of errors since you might loose oversight (especially in more complex scripts)!

Avoid it and…
- …work with ankers $ to access variables
- …specify data frame in the function where possible, e.g. lm(...., data=yourdataframe)
- …use other ways.

14.7.3 Example: The basics

getwd() # Get working directory
library(foreign) # Load package "foreign"
ls("package:foreign") # Check package content


?swiss # Check out the object
  # What is this about?
  # Visit: https://opr.princeton.edu/archive/pefp/switz.aspx
swiss2 <- swiss # Load data set


# attach() function
View(swiss2)
attach(swiss2)
names(swiss2)
Education
detach(swiss2)
Education
swiss2$Education

# Get info on data set
str(swiss2)
summary(swiss2)
head(swiss2)
#fix(swiss2)
tail(swiss2)

# Create a data frame
data <- data.frame(id=1:3, # !
                    weight=c(20,27,24),
                    size=c("small", "large", "medium"))
data

14.7.4 Logic of accessing subsets of data frames

Same logic as for vectors but two dimensions
- dataframe[rows,columns]
- Replace rows/columns by vector indicating position (numerical, logical, character)
Logic similar for other object classes such as lists (remember vectors/lists)

# Q: What does the following code do?

swiss[2:4, c(1,2,4)] # indices, c() necessary when numbers are not connected

	Fertility	Agriculture	Education
Delemont	83.1	45.1	9
Franches-Mnt	92.5	39.7	5
Moutier	85.8	36.5	7

swiss[swiss$Fertility > 75 & swiss$Agriculture > 75, c(1:3)]

	Fertility	Agriculture	Examination
Conthey	75.5	85.9	3
Herens	77.3	89.7	5
Sierre	92.2	84.6	3

subset(swiss, Fertility > 75 & Agriculture > 75)[, c(1:3)]

	Fertility	Agriculture	Examination
Conthey	75.5	85.9	3
Herens	77.3	89.7	5
Sierre	92.2	84.6	3

swiss[, c("Fertility", "Agriculture")]

	Fertility	Agriculture
Courtelary	80.2	17.0
Delemont	83.1	45.1
Franches-Mnt	92.5	39.7
Moutier	85.8	36.5
Neuveville	76.9	43.5
Porrentruy	76.1	35.3
Broye	83.8	70.2
Glane	92.4	67.8
Gruyere	82.4	53.3
Sarine	82.9	45.2
Veveyse	87.1	64.5
Aigle	64.1	62.0
Aubonne	66.9	67.5
Avenches	68.9	60.7
Cossonay	61.7	69.3
Echallens	68.3	72.6
Grandson	71.7	34.0
Lausanne	55.7	19.4
La Vallee	54.3	15.2
Lavaux	65.1	73.0
Morges	65.5	59.8
Moudon	65.0	55.1
Nyone	56.6	50.9
Orbe	57.4	54.1
Oron	72.5	71.2
Payerne	74.2	58.1
Paysd’enhaut	72.0	63.5
Rolle	60.5	60.8
Vevey	58.3	26.8
Yverdon	65.4	49.5
Conthey	75.5	85.9
Entremont	69.3	84.9
Herens	77.3	89.7
Martigwy	70.5	78.2
Monthey	79.4	64.9
St Maurice	65.0	75.9
Sierre	92.2	84.6
Sion	79.3	63.1
Boudry	70.4	38.4
La Chauxdfnd	65.7	7.7
Le Locle	72.7	16.7
Neuchatel	64.4	17.6
Val de Ruz	77.6	37.6
ValdeTravers	67.6	18.7
V. De Geneve	35.0	1.2
Rive Droite	44.7	46.6
Rive Gauche	42.8	27.7

# We'll learn a more convenient function later on!

14.7.5 Recoding variables

Either do it manually (see below) or…
…using the plyr package
- mapvalues(): Recode a categorical vector
- cut(): Recode a continuous variable into a categorical one

Always check wether recoding worked (very common error!)
- table(variable1, variablevar2): Contingency table for the two variables
- str() and summary(): Check whether variables in the data set have expected distributions and beware of missings!

14.7.5.1 Example: Recoding variables

# MANUEL CLASSIC WAY
swiss2 <- swiss # Make a copy of the data set
names(swiss) # Display variables

## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"

str(swiss)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

summary(swiss)

Fertility	Agriculture	Examination	Education	Catholic	Infant.Mortality
Min. :35.00	Min. : 1.20	Min. : 3.00	Min. : 1.00	Min. : 2.150	Min. :10.80
1st Qu.:64.70	1st Qu.:35.90	1st Qu.:12.00	1st Qu.: 6.00	1st Qu.: 5.195	1st Qu.:18.15
Median :70.40	Median :54.10	Median :16.00	Median : 8.00	Median : 15.140	Median :20.00
Mean :70.14	Mean :50.66	Mean :16.49	Mean :10.98	Mean : 41.144	Mean :19.94
3rd Qu.:78.45	3rd Qu.:67.65	3rd Qu.:22.00	3rd Qu.:12.00	3rd Qu.: 93.125	3rd Qu.:21.70
Max. :92.50	Max. :89.70	Max. :37.00	Max. :53.00	Max. :100.000	Max. :26.60

swiss2$d.catholic <- NA # generate new variable in dataset
View(swiss2)
swiss2$d.catholic[swiss2$Catholic <= 50] <- 0 # replace values conditional on Catholic
swiss2$d.catholic[swiss2$Catholic > 50] <- 1  # replace values conditional on Catholic
table(swiss2$d.catholic, swiss2$Catholic) # check recoding

/	2.15	2.27	2.4	2.56	2.82	2.84	3.3	4.2	4.43	4.52	4.97	5.16	5.23	5.62	6.1	7.72	8.52	8.65	9.96	11.22	12.11	13.79	15.14	16.92	18.46	24.2	33.77	42.34	50.43	58.33	84.84	90.57	91.38	92.85	93.4	96.83	97.16	97.67	98.22	98.61	98.96	99.06	99.46	99.68	99.71	100
0	1	1	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1

names(swiss2)   # show variable names

## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"
## [7] "d.catholic"

names(swiss2)[7] <- "dummy.catholic"

# PLYR: "NEW" WAY
# For recoding character variables simply refer to text with ""
library(plyr)

# mapvalues()
swiss2$Examination2 <- mapvalues(swiss2$Examination, from = c(3, 37), to = c(NA, NA))

# cut()
swiss2$Examination2 <- cut(swiss2$Examination2,
                     breaks=c(-Inf, 12, 22, Inf),
                     labels=c("low","medium","high")) # greater than or equal to

14.7.5.2 Exercise: Recoding variables

Save the data set swiss in a new object called swiss2.
Recode the variable Infant.Mortality in your new data set swiss2 so that values <= 18 are coded as 0, 18 < values <= 20 as 1, 20 < values <= 21 as 2 and 21 < values <= 27 as 3. Do this using both the classic way and the cut() function and name the respective variables inf.mort.cla and inf.mort.cut.
Check if your coding worked and check the class of the two new variables/objects.

/	2.15	2.27	2.4	2.56	2.82	2.84	3.3	4.2	4.43	4.52	4.97	5.16	5.23	5.62	6.1	7.72	8.52	8.65	9.96	11.22	12.11	13.79	15.14	16.92	18.46	24.2	33.77	42.34	50.43	58.33	84.84	90.57	91.38	92.85	93.4	96.83	97.16	97.67	98.22	98.61	98.96	99.06	99.46	99.68	99.71	100
0	1	1	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1

/	2.15	2.27	2.4	2.56	2.82	2.84	3.3	4.2	4.43	4.52	4.97	5.16	5.23	5.62	6.1	7.72	8.52	8.65	9.96	11.22	12.11	13.79	15.14	16.92	18.46	24.2	33.77	42.34	50.43	58.33	84.84	90.57	91.38	92.85	93.4	96.83	97.16	97.67	98.22	98.61	98.96	99.06	99.46	99.68	99.71	100
0	1	1	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1

14.7 Data frames and data management

14.7.1 The basics

14.7.2 The attach()-function

14.7.3 Example: The basics

14.7.4 Logic of accessing subsets of data frames

14.7.5 Recoding variables

14.7.5.1 Example: Recoding variables

14.7.5.2 Exercise: Recoding variables

14.7.5.3 Solution: Recoding variables

/	2.15	2.27	2.4	2.56	2.82	2.84	3.3	4.2	4.43	4.52	4.97	5.16	5.23	5.62	6.1	7.72	8.52	8.65	9.96	11.22	12.11	13.79	15.14	16.92	18.46	24.2	33.77	42.34	50.43	58.33	84.84	90.57	91.38	92.85	93.4	96.83	97.16	97.67	98.22	98.61	98.96	99.06	99.46	99.68	99.71	100
0	1	1	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1