Chapter 2 R you Ready?
Before we can get into data analysis, we need to first get acquainted with the R environment and learn some basic functionality. As discussed in Chapter 1, most of what we do in R is one of:
- Creating objects,
- Performing functions on objects to create new objects, or
- Looking at objects
The next few sections will discuss how to accomplish these tasks.
2.1 Basic Commands
2.1.1 Assigning Objects and Basic Math
While using R we spend a lot of time creating, defining, and manipulating objects. The preferred way of creating an object is with an arrow <-
. You can also use an equals sign (=
) but that’s generally frowned upon. We begin by creating an object q
that is the number 42.
<- 42 q
After running this line, the environment window (top right pane) in R will have a new object in it called q. You can look at the object in your environment window to see what its value is, or you can use one of the following methods to see what q contains:
print(q)
## [1] 42
or:
q
## [1] 42
Something you may struggle with a lot early on in your R journey is dealing with the fact that R is case-sensitive and is VERY SERIOUS about it! If I try typing Q into r, I get an error:
Q
## Error in eval(expr, envir, enclos): object 'Q' not found
Object names can include numbers, periods, or underscores, and must begin with a letter. I could create objects with names like q.1 or q_1, but not something like 1q (starts with a number) or q!1 (invalid character).
.1 <- 2.718
q<- 3.142 q_1
The fact that q is an object that contains the number 42 will remain in R’s memory until R is restarted, I overwrite q, or I remove q. Overwriting a variable is easy; simply assign a new value to a variable:
<- 420 q
q
## [1] 420
Removing a variable is accomplished with the rm()
function. If I run the command rm(q),
the object q will be removed from the environment.
rm(q)
Occasionally you may want to remove all the objects from the environment; there is a useful (but tricky to remember) command that removes all objects from memory: rm(list=ls())
.
rm(list=ls())
After running this command, we should see that our environment is clear. Let’s assign a couple numbers to objects for the next bit of discussion:
<- 132
z <- 33 y
Objects don’t have to be be just numbers. They can be words too.
<- "Hello"
a a
## [1] "Hello"
Even though 1 is a number, wrapping it in parenthesis means R treats it like a word, not a number.
<- "1" b
If we want to know what type of an object something is, we can use the class()
command.
class(a)
## [1] "character"
class(b)
## [1] "character"
class(y)
## [1] "numeric"
It is often useful to use R as a calculator.
2+2
## [1] 4
24-18
## [1] 6
45*8
## [1] 360
84/4
## [1] 21
2^8
## [1] 256
abs(-42)
## [1] 42
We can perform arithmetic with our variables from earlier as well:
+y z
## [1] 165
-y z
## [1] 99
*y z
## [1] 4356
/y z
## [1] 4
sqrt(y)
## [1] 5.744563
log(z)
## [1] 4.882802
exp(y)
## [1] 2.146436e+14
We can mix and match numbers with variables:
+4 z
## [1] 136
^2 y
## [1] 1089
/y+3 z
## [1] 7
Why doesn’t this work then?
+b z
## Error in z + b: non-numeric argument to binary operator
Recall that when we created b, we entered b <- "1"
, which forced R to treat the number 1 as a character. Watch this though!
+as.numeric(b) z
## [1] 133
Note that b is still “1”, but using as.numeric(b)
told R to, one time only, treat b as though it were a number if possible. Note that this won’t work with a:
+as.numeric(a) z
## Warning: NAs introduced by coercion
## [1] NA
R is clearly displeased with us.
We can assign objects with math too.
<- z/y
q q
## [1] 4
2.1.2 Vectors
Objects can be sets of elements as well; a sequence of elements of the same type is called a vector. We can create a simple vector with the concatenate command c()
. The next bit of code creates two vectors:
<- c(1,4,9,16,25)
num1 <- c(1,3,6,10,15) num2
Vectors can also include characters, although in econometrics this is not usually all that useful:
<- c("USA", "Canada", "Mexico") countries
Mathematical operations can be performed on vectors, though how R treats these operations often depends on context. For example, the following commands performs elementwise operations (i.e. it performs the operation on every element) on the vector num1:
-1 num1
## [1] 0 3 8 15 24
*3 num1
## [1] 3 12 27 48 75
sqrt(num1)
## [1] 1 2 3 4 5
Mathematical operations can be done with multiple vectors as well. Typically, you are doing mathematical operations on vectors of the same length, so R will perform pairwise arithmetic, meaning it will match the first element of each vector, the second element of each vector, and so forth:
+num2 num1
## [1] 2 7 15 26 40
*num2 num1
## [1] 1 12 54 160 375
We can extract elements from a vector using brackets. Element extraction is extremely powerful and useful in R. The following commands extracts the fifth element from num1 (25) and the first element from num2 (1):
5] num1[
## [1] 25
1] num2[
## [1] 1
We can extract all but certain elements with the negative sign. Let’s see num1 without the third element:
-3] num1[
## [1] 1 4 16 25
A very common use of this functionality is to extract based on a condition. For example, the next command will extract all the elements of num1 that are greater than 10
>5] num1[num1
## [1] 9 16 25
2.1.3 Packages and Libraries
Every command this far has used what is called Base R. Base R is the basic software that contains the R programming language and many statistical and graphical tools. However, R is also extensible via packages, user-written sets of commands that are often open-source (e.g. freely available) that expand upon the capabilities of R. Packages in R must be installed before they can be used, and must also be loaded every time you use them.
To install a package. you typically use the command install.packages()
and put the name of the package to be installed in quotation marks inside the parentheses. For example, to install the EnvStats
package you would type install.packages("EnvStats")
into R. You only ever need to install a package once. It is generally bad idea to include an install.packages()
command within a script (more on scripts below), because this generally leads to attempting to reinstall packages repeatedly which is a waste of time and often breaks your code anyhow. If you are wondering what EnvStats
is all about, try typing ?EnvStats
or help(EnvStats)
into R!
Once a package is installed, I need to let R know when I want to use it. When you open R via RStudio, the only thing that starts right away is Base R, so the only commands you can use natively are those from Base R. If I want to use the geoMean()
function from within the EnvStats
package I just installed, I need to let R know where to find the geoMean()
function. There are two ways of doing so.
The first method uses the double colon operator - ::
- and has the general syntax of library::function
. To see this in action, let’s create a vector with 6 months of rates of return for an asset:
<- 1 + c(.04, .13, -.03, .11, -.05, .08) ror6
Let’s assume I want to calculate the average rate of return, which is where the geometric mean comes in (arithmetic means overstate average rates of return). Next, let’s use the double colon method to calculate the geometric mean using the geoMean()
function from the EnvStats
package:
::geoMean(ror6) EnvStats
## [1] 1.044461
The double colon operator is useful if you only plan on using a function from a particular library once; however, it is often easier to simply load the library into memory so you can access the function without typing ::
all over the place. Loading a package into memory is accomplished with the library()
command. So if I wanted to use the EnvStats
package, I would type library(EnvStats)
(note this time I don’t have quotes) into R and then I could use all of the functions contained within. This next code chunk first loads EnvStats
, so I can directly use geoMean()
in the following line.
library(EnvStats)
geoMean(ror6)
## [1] 1.044461
Generally speaking, the library()
approach is used far more often than the ::
approach.
If you want to get a head start on installing the libraries used in this book, install the following:
- knitr
- tidyverse
- kableExtra
- AER
- stargazer
- wooldridge
- fivethirtyeight
- sandwich
- lmtest
- margins
- MASS
- huxtable
- broom
- jtools
- mlogit
- censReg
- sampleSelection
- scales
- dynlm
- tseries
- collapse
- forecast
- cowplot
- tidyquant
- plm
- gifski
- gganimate
- rnaturalearth
- rnaturalearthdata
2.2 Working with Data
Datasets in R are typically objects called a data frame. Broadly speaking, there 3 ways of getting data into R: loading it from a file, finding it in an R package, or using an R package to fetch live data from the internet.
Data can generally be loaded into R from nearly any other spreadsheet or statistics program. The most common is a Comma Separated Values file, also referred to as a CSV file, using the read.csv()
function. There exist many other utilities for importing data into R, including packages such as readxl
, haven
, Hmisc
, and foreign
. If the data exists in a somewhat common format, somebody has written the package to import it into R!
This book primarily relies on the second method: pulling data directly out of an installed package. The three data packages this text relies most heavily upon are wooldridge, AER, and fivethirtyeight, which, if you haven’t already, you should install right now:
install.packages("wooldridge")
install.packages("AER")
install.packages("fivethirtyeight")
Additionally, many data sets are built into Base R. To see what data sets are readily available, type:
data()
Let’s say we want to play with the iris
data (people love that one, I don’t know why); we can load it into memory with the command:
data(iris)
Now the iris data is loaded into the environment: it is an object called iris
.
If I wanted to use any of the data that is in the packages loaded above, I would first need to call them into memory using the library()
command:
library(wooldridge)
library(AER)
library(fivethirtyeight)
Now, a data()
call will list a lot more available datasets!
Let’s load in the CPS1985 dataset from the AER package and learn some basic tools for inspecting and manipulating data:
data(CPS1985)
For most datasets that are built into packages, you can get information about the dataset using the help command:
?CPS1985
To see how big the data is, we might try looking at the dimensions with dim()
, the number of columns with ncol()
, and the number of rows with nrow()
.
dim(CPS1985)
## [1] 534 11
ncol(CPS1985)
## [1] 11
nrow(CPS1985)
## [1] 534
These show us that the dataset is 534 rows long and 11 columns wide. IMPORTANT: R always does rows first, columns second. Remembering this will help!
We can also learn about what the data generally look like by using the head()
function, which will print the first 6 lines of the data set (we could also use the tail()
function to see the last 6 lines of the data set):
head(CPS1985)
## wage education experience age ethnicity region gender occupation
## 1 5.10 8 21 35 hispanic other female worker
## 1100 4.95 9 42 57 cauc other female worker
## 2 6.67 12 1 19 cauc other male worker
## 3 4.00 12 4 22 cauc other male worker
## 4 7.50 12 17 35 cauc other male worker
## 5 13.07 13 9 28 cauc other male worker
## sector union married
## 1 manufacturing no yes
## 1100 manufacturing no yes
## 2 manufacturing no no
## 3 other no no
## 4 other no yes
## 5 other yes no
We can also use square brackets to subset bits of data. Remember, brackets use the rows, columns convention mentioned above. Here I look at rows 222-225. I leave the column part empty so I get all of the columns:
222:225,] CPS1985[
## wage education experience age ethnicity region gender occupation
## 221 3.84 11 25 42 other south female sales
## 222 6.40 12 45 63 cauc other female sales
## 223 5.56 14 5 25 cauc south male sales
## 224 10.00 12 20 38 cauc south male sales
## sector union married
## 221 other no yes
## 222 other no yes
## 223 other no no
## 224 manufacturing no yes
We can also do logical operators here, which requires us to learn a bit about using the dollar sign \$
operator. We use a dollar sign to refer to a variable or subobject within another object. If I want to refer to the age variable within the CPS1985 dataset, I would refer to it as CPS1985\$age
. With this in mind, Let’s look at every row where age is over 60:
$age>60 , ] CPS1985[CPS1985
## wage education experience age ethnicity region gender occupation
## 31 4.00 12 46 64 cauc south female worker
## 62 7.00 3 55 64 hispanic south male worker
## 200 8.80 14 41 61 cauc south male management
## 217 12.50 12 43 61 cauc other male sales
## 222 6.40 12 45 63 cauc other female sales
## 230 19.98 14 44 64 cauc south male sales
## 239 13.71 12 43 61 cauc other male sales
## 262 11.67 12 43 61 cauc other female office
## 268 5.25 12 45 63 cauc other female office
## 278 9.17 12 44 62 cauc other female office
## 331 11.71 16 42 64 cauc other female office
## 340 10.62 12 45 63 cauc other female office
## 346 6.00 4 54 64 cauc other male services
## 355 3.50 9 48 63 cauc other male services
## 368 3.60 12 43 61 hispanic south female services
## 370 3.40 8 49 63 cauc other female services
## 396 8.00 12 43 61 other other female services
## 403 5.55 11 45 62 cauc other female services
## 405 8.93 8 47 61 hispanic other male services
## 413 3.50 9 47 62 cauc other male services
## 426 9.50 9 46 61 cauc other female services
## 485 22.20 18 40 64 cauc other female technical
## 496 22.83 18 37 61 cauc other female technical
## sector union married
## 31 other no no
## 62 manufacturing no yes
## 200 other no yes
## 217 manufacturing no yes
## 222 other no yes
## 230 other no yes
## 239 other yes yes
## 262 construction no yes
## 268 other no yes
## 278 manufacturing no yes
## 331 manufacturing no no
## 340 manufacturing no no
## 346 other no yes
## 355 other no no
## 368 other no yes
## 370 other no no
## 396 other yes yes
## 403 other yes no
## 405 other yes yes
## 413 other yes yes
## 426 other yes yes
## 485 other no no
## 496 manufacturing no no
We could drill down even further. What about if we only want to see unmarried white females in the south? Here, we need to use the double ==
sign (==
is the boolean operator for “is equal to”), put quotes around the stuff that aren’t numbers, and incorporate a bunch of ampersands:
$ethnicity == "cauc" & CPS1985$region == "south" & CPS1985$gender == "female" & CPS1985$married == "no", ] CPS1985[CPS1985
## wage education experience age ethnicity region gender occupation sector
## 31 4.00 12 46 64 cauc south female worker other
## 33 5.00 17 1 24 cauc south female worker other
## 83 3.75 16 13 35 cauc south female worker other
## 88 6.25 12 6 24 cauc south female worker other
## 193 10.00 18 13 37 cauc south female management other
## 203 7.81 12 1 19 cauc south female management other
## 229 4.75 12 10 28 cauc south female sales other
## 274 5.25 16 2 24 cauc south female office other
## 275 10.32 13 28 47 cauc south female office other
## 283 4.25 13 0 19 cauc south female office other
## 307 7.50 12 20 38 cauc south female office other
## 310 3.55 13 1 20 cauc south female office other
## 312 4.50 13 0 19 cauc south female office other
## 323 5.00 12 26 44 cauc south female office other
## 349 6.00 15 26 47 cauc south female services other
## 402 6.88 14 10 30 cauc south female services other
## 407 3.50 10 33 49 cauc south female services other
## 425 4.55 8 45 59 cauc south female services other
## 494 24.98 16 5 27 cauc south female technical other
## 517 7.45 12 25 43 cauc south female technical other
## union married
## 31 no no
## 33 no no
## 83 no no
## 88 no no
## 193 no no
## 203 no no
## 229 no no
## 274 no no
## 275 no no
## 283 no no
## 307 no no
## 310 no no
## 312 no no
## 323 no no
## 349 no no
## 402 no no
## 407 no no
## 425 no no
## 494 no no
## 517 no no
Typing all that CPS1985$ stuff gets annoying. Understanding dollar sign notation is essential, but to make our lives a bit easier, this might be a good place to attach()
our data:
attach(CPS1985)
## The following objects are masked from CPS1985 (pos = 3):
##
## age, education, ethnicity, experience, gender, married,
## occupation, region, sector, union, wage
## The following objects are masked from CPS1985 (pos = 5):
##
## age, education, ethnicity, experience, gender, married,
## occupation, region, sector, union, wage
Attaching a dataset works a bit like loading a library in that now, R will look inside CPS1985 for variables! Now, that previous command can be simplified as:
== "cauc" & region == "south" & gender == "female" & married == "no", ] CPS1985[ethnicity
## wage education experience age ethnicity region gender occupation sector
## 31 4.00 12 46 64 cauc south female worker other
## 33 5.00 17 1 24 cauc south female worker other
## 83 3.75 16 13 35 cauc south female worker other
## 88 6.25 12 6 24 cauc south female worker other
## 193 10.00 18 13 37 cauc south female management other
## 203 7.81 12 1 19 cauc south female management other
## 229 4.75 12 10 28 cauc south female sales other
## 274 5.25 16 2 24 cauc south female office other
## 275 10.32 13 28 47 cauc south female office other
## 283 4.25 13 0 19 cauc south female office other
## 307 7.50 12 20 38 cauc south female office other
## 310 3.55 13 1 20 cauc south female office other
## 312 4.50 13 0 19 cauc south female office other
## 323 5.00 12 26 44 cauc south female office other
## 349 6.00 15 26 47 cauc south female services other
## 402 6.88 14 10 30 cauc south female services other
## 407 3.50 10 33 49 cauc south female services other
## 425 4.55 8 45 59 cauc south female services other
## 494 24.98 16 5 27 cauc south female technical other
## 517 7.45 12 25 43 cauc south female technical other
## union married
## 31 no no
## 33 no no
## 83 no no
## 88 no no
## 193 no no
## 203 no no
## 229 no no
## 274 no no
## 275 no no
## 283 no no
## 307 no no
## 310 no no
## 312 no no
## 323 no no
## 349 no no
## 402 no no
## 407 no no
## 425 no no
## 494 no no
## 517 no no
Whether you want to use dollar signs or attach data is often a matter of personal preference. Personally, I found that early on learning R, attaching data was a much easier approach, but as I got into more intermediate and advanced applications, I started using dollar sign notation far more often.
Sometimes it is easier to filter out everything BUT a certain group. This is accomplished with not equals signs (!=) and/or negative signs. This next line shows me the subset of data that is males over the age of 55 who are not in management. Because this subgroup only includes males, the gender column is irrelevant, so I’m getting rid of the 7th column.
== "male" & occupation != "management" & age>55, -7] CPS1985[gender
## wage education experience age ethnicity region occupation sector
## 16 8.00 7 44 57 cauc south worker other
## 25 5.75 6 45 57 cauc south worker manufacturing
## 62 7.00 3 55 64 hispanic south worker manufacturing
## 69 6.75 10 41 57 other south worker manufacturing
## 109 11.00 8 42 56 cauc other worker manufacturing
## 112 15.00 12 40 58 cauc other worker construction
## 137 3.35 7 43 56 cauc south worker manufacturing
## 145 7.50 12 38 56 cauc south worker other
## 147 11.25 12 41 59 cauc other worker other
## 217 12.50 12 43 61 cauc other sales manufacturing
## 230 19.98 14 44 64 cauc south sales other
## 239 13.71 12 43 61 cauc other sales other
## 346 6.00 4 54 64 cauc other services other
## 355 3.50 9 48 63 cauc other services other
## 405 8.93 8 47 61 hispanic other services other
## 411 6.50 11 39 56 cauc south services other
## 413 3.50 9 47 62 cauc other services other
## 468 8.00 18 33 57 cauc other technical other
## 481 7.00 18 33 57 cauc other technical other
## 482 18.00 16 38 60 cauc south technical other
## 513 15.00 12 39 57 cauc other technical other
## union married
## 16 no yes
## 25 no yes
## 62 no yes
## 69 yes yes
## 109 no yes
## 112 yes yes
## 137 no yes
## 145 no yes
## 147 yes yes
## 217 no yes
## 230 no yes
## 239 yes yes
## 346 no yes
## 355 no no
## 405 yes yes
## 411 no yes
## 413 yes yes
## 468 yes no
## 481 no yes
## 482 no yes
## 513 yes yes
Another useful way to look at the data is via the str()
function, which tells us about the structure of the data
str(CPS1985)
## 'data.frame': 534 obs. of 11 variables:
## $ wage : num 5.1 4.95 6.67 4 7.5 ...
## $ education : num 8 9 12 12 12 13 10 12 16 12 ...
## $ experience: num 21 42 1 4 17 9 27 9 11 9 ...
## $ age : num 35 57 19 22 35 28 43 27 33 27 ...
## $ ethnicity : Factor w/ 3 levels "cauc","hispanic",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 2 levels "south","other": 2 2 2 2 2 2 1 2 2 2 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 1 1 1 1 1 1 ...
## $ occupation: Factor w/ 6 levels "worker","technical",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sector : Factor w/ 3 levels "manufacturing",..: 1 1 1 3 3 3 3 3 1 3 ...
## $ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ married : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 1 2 1 ...
This tells us about the types of variables we have in our data frame. The first 4 are numeric, the rest are factor (categorical) variables.
You can also see what the variable type is with the class()
command.
class(wage)
## [1] "numeric"
class(age)
## [1] "numeric"
class(married)
## [1] "factor"
class(union)
## [1] "factor"
What went wrong with the class(union)
command? It turns out there is a function in base R called union
, so R is not sure what you are referring to. This is a case where you are stuck using $ notation, even though we used attach()
on the data set. This is why our original attach()
command spit out a weird error message too.
class(CPS1985$union)
## [1] "factor"
If you have a categorical/factor variable, you can use the levels
function to see what all the possible values are:
levels(ethnicity)
## [1] "cauc" "hispanic" "other"
levels(occupation)
## [1] "worker" "technical" "services" "office" "sales"
## [6] "management"
We can get a generic summary of the data with the summary()
command.
summary(CPS1985)
## wage education experience age
## Min. : 1.000 Min. : 2.00 Min. : 0.00 Min. :18.00
## 1st Qu.: 5.250 1st Qu.:12.00 1st Qu.: 8.00 1st Qu.:28.00
## Median : 7.780 Median :12.00 Median :15.00 Median :35.00
## Mean : 9.024 Mean :13.02 Mean :17.82 Mean :36.83
## 3rd Qu.:11.250 3rd Qu.:15.00 3rd Qu.:26.00 3rd Qu.:44.00
## Max. :44.500 Max. :18.00 Max. :55.00 Max. :64.00
## ethnicity region gender occupation sector
## cauc :440 south:156 male :289 worker :156 manufacturing: 99
## hispanic: 27 other:378 female:245 technical :105 construction : 24
## other : 67 services : 83 other :411
## office : 97
## sales : 38
## management: 55
## union married
## no :438 no :184
## yes: 96 yes:350
##
##
##
##
Note that R provides different output for the different types of data. For the numeric data, R gives us quantitative summary statistics – means, min/max, and quartiles. For the categorical data, we get raw counts.
Sometimes, datasets will code a categorical variable as a number. Here, I will code a variable called female which will be 1 for females and 0 for males. So as to not overwrite the CPS1985 data in our environment memory, I will first clone the CPS1985 into an object called tempdata and then create a new variable called female with the ifelse()
command.
<- CPS1985
tempdata $female <- ifelse(tempdata$gender == "female", 1, 0)
tempdata200:209,c(7,12)] tempdata[
## gender female
## 199 male 0
## 200 male 0
## 201 male 0
## 202 male 0
## 203 female 1
## 204 female 1
## 205 female 1
## 206 male 0
## 207 female 1
## 208 female 1
You can see from the output that my code apparently worked. We know that the female variable is essentially categorical, but what happens when we inspect the class?
class(tempdata$female)
## [1] "numeric"
And if we summarize it:
summary(tempdata$female)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4588 1.0000 1.0000
This tells us that 45.88% of our data is female, which is useful, but we might want to force R to either treat the female variable as a factor, or convert it to a factor.
summary(as.factor(tempdata$female))
## 0 1
## 289 245
Now it is giving counts for us. We could also permanently code the variable as a factor:
$female <- factor(tempdata$female, levels = c(0,1), labels = c("male", "female"))
tempdatasummary(tempdata$female)
## male female
## 289 245
Another useful trick is to combine square brackets with our functions. Let’s look at the summary
of wage data, split out by male vs female.
summary(wage[gender == "male"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 8.930 9.995 13.000 26.290
summary(wage[gender == "female"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.750 4.750 6.800 7.879 10.000 44.500
Maybe we want to make a new variable in our dataset to highlight which of our individuals are unmarried females in management. The ifelse
command works, but another useful method is to use the cbind
command. First, let’s create a vector called femanager:
<- gender == "female" & occupation == "management"
femanager summary(femanager)
## Mode FALSE TRUE
## logical 513 21
You can see that this is an object with 533 observations that are TRUE or FALSE. Apparently we have 21 female managers in our data. We can use cbind to add it to our data:
<- cbind(CPS1985, femanager)
tempdata class(tempdata$femanager)
## [1] "logical"
The class is “logical”, but this can be converted to factor easily:
$femanager <- factor(tempdata$femanager, levels = c(FALSE,TRUE), labels = c("no", "yes"))
tempdataclass(tempdata$femanager)
## [1] "factor"
summary(tempdata$femanager)
## no yes
## 513 21
2.2.1 Basics of dplyr and tidyverse
It is often far easier to use the tidyverse for data manipulation than it is to use base R. The tidyverse is a series of packages, most notably dplyr and ggplot2, will be made extensive use of throughout this text. You can load this whole family of packages at once with the library(tidyverse)
command (keep in mind you may need to install.packages("tidyverse")
first!)
The most important dplyr
verbs to remember are:
select
- Selects columnsfilter
- Filters rowsarrange
- Re-orders rowsmutate
- Creates new columnssummarize
- summarizes stuffgroup_by
- allows you to split-apply-recombine data along withungroup
One of the most useful things to learn early in in dplyr is the pipe operator, which looks like this %>%
. The keyboard shortcut for %>%
is Control-Shift-M. That’s a useful one to memorize! The way to think about the pipe operator is that it “pipes” the results from one line of code into the next. I read it as sort of saying “and then”. Let’s put some of these new tools to work.
Recall that above we looked at subgroups of the CPS, starting with those older than 60. Using base R, we got this with CPS1985[CPS1985$age>60 , ]
, however it might be more intuitive to use dplyr and run:
%>%
CPS1985 filter(age>60)
## wage education experience age ethnicity region gender occupation
## 31 4.00 12 46 64 cauc south female worker
## 62 7.00 3 55 64 hispanic south male worker
## 200 8.80 14 41 61 cauc south male management
## 217 12.50 12 43 61 cauc other male sales
## 222 6.40 12 45 63 cauc other female sales
## 230 19.98 14 44 64 cauc south male sales
## 239 13.71 12 43 61 cauc other male sales
## 262 11.67 12 43 61 cauc other female office
## 268 5.25 12 45 63 cauc other female office
## 278 9.17 12 44 62 cauc other female office
## 331 11.71 16 42 64 cauc other female office
## 340 10.62 12 45 63 cauc other female office
## 346 6.00 4 54 64 cauc other male services
## 355 3.50 9 48 63 cauc other male services
## 368 3.60 12 43 61 hispanic south female services
## 370 3.40 8 49 63 cauc other female services
## 396 8.00 12 43 61 other other female services
## 403 5.55 11 45 62 cauc other female services
## 405 8.93 8 47 61 hispanic other male services
## 413 3.50 9 47 62 cauc other male services
## 426 9.50 9 46 61 cauc other female services
## 485 22.20 18 40 64 cauc other female technical
## 496 22.83 18 37 61 cauc other female technical
## sector union married
## 31 other no no
## 62 manufacturing no yes
## 200 other no yes
## 217 manufacturing no yes
## 222 other no yes
## 230 other no yes
## 239 other yes yes
## 262 construction no yes
## 268 other no yes
## 278 manufacturing no yes
## 331 manufacturing no no
## 340 manufacturing no no
## 346 other no yes
## 355 other no no
## 368 other no yes
## 370 other no no
## 396 other yes yes
## 403 other yes no
## 405 other yes yes
## 413 other yes yes
## 426 other yes yes
## 485 other no no
## 496 manufacturing no no
You start with CPS1985, and then (%>%
) filter out everybody who is over 60. The advantages of dplyr become more clear with more complicated filters. Rather than CPS1985[CPS1985$ethnicity == "cauc" & CPS1985$region == "south" & CPS1985$gender == "female" & CPS1985$married == "no", ]
, we might type the following, much cleaner and easier to follow, code:
%>%
CPS1985 filter(ethnicity == "cauc") %>%
filter(region == "south") %>%
filter(gender == "female") %>%
filter(married == "no")
## wage education experience age ethnicity region gender occupation sector
## 31 4.00 12 46 64 cauc south female worker other
## 33 5.00 17 1 24 cauc south female worker other
## 83 3.75 16 13 35 cauc south female worker other
## 88 6.25 12 6 24 cauc south female worker other
## 193 10.00 18 13 37 cauc south female management other
## 203 7.81 12 1 19 cauc south female management other
## 229 4.75 12 10 28 cauc south female sales other
## 274 5.25 16 2 24 cauc south female office other
## 275 10.32 13 28 47 cauc south female office other
## 283 4.25 13 0 19 cauc south female office other
## 307 7.50 12 20 38 cauc south female office other
## 310 3.55 13 1 20 cauc south female office other
## 312 4.50 13 0 19 cauc south female office other
## 323 5.00 12 26 44 cauc south female office other
## 349 6.00 15 26 47 cauc south female services other
## 402 6.88 14 10 30 cauc south female services other
## 407 3.50 10 33 49 cauc south female services other
## 425 4.55 8 45 59 cauc south female services other
## 494 24.98 16 5 27 cauc south female technical other
## 517 7.45 12 25 43 cauc south female technical other
## union married
## 31 no no
## 33 no no
## 83 no no
## 88 no no
## 193 no no
## 203 no no
## 229 no no
## 274 no no
## 275 no no
## 283 no no
## 307 no no
## 310 no no
## 312 no no
## 323 no no
## 349 no no
## 402 no no
## 407 no no
## 425 no no
## 494 no no
## 517 no no
Similarly, rather than the command CPS1985[gender == "male" & occupation != "management" & age>55, -7]
, we could specify:
%>%
CPS1985 filter(gender == "male") %>%
filter(occupation != "management") %>%
filter(age>55) %>%
select(-gender)
## wage education experience age ethnicity region occupation sector
## 16 8.00 7 44 57 cauc south worker other
## 25 5.75 6 45 57 cauc south worker manufacturing
## 62 7.00 3 55 64 hispanic south worker manufacturing
## 69 6.75 10 41 57 other south worker manufacturing
## 109 11.00 8 42 56 cauc other worker manufacturing
## 112 15.00 12 40 58 cauc other worker construction
## 137 3.35 7 43 56 cauc south worker manufacturing
## 145 7.50 12 38 56 cauc south worker other
## 147 11.25 12 41 59 cauc other worker other
## 217 12.50 12 43 61 cauc other sales manufacturing
## 230 19.98 14 44 64 cauc south sales other
## 239 13.71 12 43 61 cauc other sales other
## 346 6.00 4 54 64 cauc other services other
## 355 3.50 9 48 63 cauc other services other
## 405 8.93 8 47 61 hispanic other services other
## 411 6.50 11 39 56 cauc south services other
## 413 3.50 9 47 62 cauc other services other
## 468 8.00 18 33 57 cauc other technical other
## 481 7.00 18 33 57 cauc other technical other
## 482 18.00 16 38 60 cauc south technical other
## 513 15.00 12 39 57 cauc other technical other
## union married
## 16 no yes
## 25 no yes
## 62 no yes
## 69 yes yes
## 109 no yes
## 112 yes yes
## 137 no yes
## 145 no yes
## 147 yes yes
## 217 no yes
## 230 no yes
## 239 yes yes
## 346 no yes
## 355 no no
## 405 yes yes
## 411 no yes
## 413 yes yes
## 468 yes no
## 481 no yes
## 482 no yes
## 513 yes yes
We can combine group_by
and summarize
to easily get average wage by gender:
%>%
CPS1985 group_by(gender) %>%
summarize(wage = mean(wage))
## # A tibble: 2 x 2
## gender wage
## <fct> <dbl>
## 1 male 9.99
## 2 female 7.88
Both dplyr and base R get to the same place, but generally speaking dplyr is a bit more intuitive, especially for more complicated tasks.
2.3 Scripting in R
Now that we have seen some of the building blocks of coding in R, let’s talk about how we put them together. Writing your code in a script is essential for a variety of reasons:
- Scripting makes your work reproducible.
- Scripting allows you to document your code.
- Scripting makes it easier to work on a project over multiple sessions.
A script is essentially a text file that contains a list of commands that R will execute in order. Rather than type one command at a time in the console window (bottom left of R Studio), you type the series of commands in the script window (top left of R studio) and can run all or parts of the script at once.
R can be used not only for data analysis, but also for writing up that analysis in a document that integrates text with code and statistical output. An R script is used to create (and recreate!) a series of commands in R. An R Markdown document incorporates script elements into a high quality document that can be shared in a variety of formats.
2.3.1 R Scripts
To create a new script, use the button on the top left of RStudio. This can also be done in the file menu or with Ctrl+Shift+N. You can run the whole script by using the run button (top right of the script pane). The RStudio script editor is basically a fancy text editor. This means you can’t use the Enter button to run a line of code…to run a line of code, you need to put your cursor on that line and use Ctrl-Enter. You can also highlight a line or a set of lines with your cursor and execute that set of lines with hitting Ctrl+Enter.
2.3.2 R Markdown
You can also embed R code into a R Markdown document to make high quality documents in a variety of types:
- HTML
- LaTeX/PDF
- Word
- Slide Presentations
- Dashboards
- e-Books
There is lots of fancy formatting that goes into all of these types that we will not be going into. Learning some basic RMarkdown is useful as it prevents the need to cut/paste things from R in to a word processing program. To use RMarkdown, You will need to install the rmarkdown
package, and installing knitr
probably won’t hurt either.
This text will not delve deep into RMarkdown, however a good place to learn the basics or RMarkdown formatting syntax are http:\\rmarkdown.rstudio.com/lesson-1.html and read chapters 1-3 of R Markdown: The Definitive Guide by Yihui Xie https://bookdown.org/yihui/rmarkdown/.
2.3.3 Scripting Best Practices
- Comment your work!
You may work on a script with another person who has no idea what your code is all about, or you may not look at a script for a few weeks and have forgotten what you were trying to accomplish! Putting comments in your code is a way of making notes and passing them to people with whom you are working and/or your future self. By using the #
in your code, you can put comments in your code that R will not execute but will tell you what is going on.
# R will not read this line of code when I execute the script
data(mtcars) # R reads data(mtcars) but it doesn't read this This is a good place to put comments
mean(mtcars$mpg) # especially early as you are learning R
mean(mtcars$qsec) # extensive commenting of your code is invaluable for remembering what you are doing
t.test(data = mtcars, mtcars$mpg ~ mtcars$am) # so when you look at a script later, you know what you did,
t.test(data = mtcars, mtcars$qsec ~ mtcars$am) # and why you did it.
A more serious version of commenting on the above code might look like:
# Estimate whether or not having an automatic vs manual transmission effects mpg or quarter second time. Note that for the variable am, automatic = 0 and manual = 1
data(mtcars)
mean(mtcars$mpg)
mean(mtcars$qsec)
t.test(data = mtcars,mpg ~ am)
t.test(mtcars$qsec ~ mtcars$am)
It is very useful to comment your scripts throughout, and also to put a comment at the top of your script to give a general description of what the script does. For example, if I were doing a bunch of analysis on the mtcars dataset I used in the previous example, my first couple lines of code might be:
# This is code to look at the effect of manual vs automatic transmissions in cars from the mtcars data set.
# Code developed by Matt Dobra
- Load required libraries early
Base R is powerful, but what makes R the state-of-the-art programming language is the user written packages.
Remember, you only need to install a package once. This is done with the install.packages()
function. For example:
install.packages("tidyverse")
will install the tidyverse
package onto your computer. Once it is installed, you need to load it into your session whenever you use it. Never put the install.packages()
line into your scripts, but always put the library()
line into your script, and do it at the very top of your script. When scripting, it is a good idea to load in any packages early in your script. For example, nearly every script I write includes the following line toward the very top of the script:
library(tidyverse)
And, of course, absolutely NONE of them have install.packages("tidyverse")
in them anywhere.
If I wind up needing more packages as I develop the script, I try to put their library call at the beginning of the script, not at the end. For example, in my R scripts that I use to write my Principles of Macroeconomics course notes, I have the following code in lines 13-23, as I wind up using these packages in nearly every script for that class.
library(knitr) # markdown language
library(tidyverse) # Keep things tidy
library(kableExtra) # Table Formattting
library(ggthemes) # ggplot addon
library(ggpubr) # ggplot add on
library(DiagrammeR) # Makes Flowcharts
library(WDI) # World Development Indicators
library(quantmod) # Gets FRED data
library(scales) # labeling improvement for ggplots.
- Break your code into labeled sections
A typical script might have 3 distinct sections
- Preamble - a section that overviews the code, loads libraries and data, etc.
- Data Wrangling - a section the transforms the loaded data into what you will be analyzing
- Statistical Analysis and Visualization - a section that does the econometrics and makes graphs.
Use # comments to break these up. Make the section breaks stand out and obvious. For example:
#----------------#
# Data Wrangling #
#----------------#
################################
#~~~~~Statistical Analysis~~~~~#
################################
You will probably discover more useful tips as you go along.