Chapter 2 R you Ready?

Before we can get into data analysis, we need to first get acquainted with the R environment and learn some basic functionality. As discussed in Chapter 1, most of what we do in R is one of:

  1. Creating objects,
  2. Performing functions on objects to create new objects, or
  3. Looking at objects

The next few sections will discuss how to accomplish these tasks.

2.1 Basic Commands

2.1.1 Assigning Objects and Basic Math

While using R we spend a lot of time creating, defining, and manipulating objects. The preferred way of creating an object is with an arrow <-. You can also use an equals sign (=) but that’s generally frowned upon. We begin by creating an object q that is the number 42.

q <- 42

After running this line, the environment window (top right pane) in R will have a new object in it called q. You can look at the object in your environment window to see what its value is, or you can use one of the following methods to see what q contains:

print(q)
## [1] 42

or:

q
## [1] 42

Something you may struggle with a lot early on in your R journey is dealing with the fact that R is case-sensitive and is VERY SERIOUS about it! If I try typing Q into r, I get an error:

Q
## Error in eval(expr, envir, enclos): object 'Q' not found

Object names can include numbers, periods, or underscores, and must begin with a letter. I could create objects with names like q.1 or q_1, but not something like 1q (starts with a number) or q!1 (invalid character).

q.1 <- 2.718
q_1 <- 3.142

The fact that q is an object that contains the number 42 will remain in R’s memory until R is restarted, I overwrite q, or I remove q. Overwriting a variable is easy; simply assign a new value to a variable:

q <- 420
q
## [1] 420

Removing a variable is accomplished with the rm() function. If I run the command rm(q), the object q will be removed from the environment.

rm(q)

Occasionally you may want to remove all the objects from the environment; there is a useful (but tricky to remember) command that removes all objects from memory: rm(list=ls()).

rm(list=ls())

After running this command, we should see that our environment is clear. Let’s assign a couple numbers to objects for the next bit of discussion:

z <- 132
y <- 33

Objects don’t have to be be just numbers. They can be words too.

a <- "Hello"
a
## [1] "Hello"

Even though 1 is a number, wrapping it in parenthesis means R treats it like a word, not a number.

b <- "1"

If we want to know what type of an object something is, we can use the class() command.

class(a)
## [1] "character"
class(b)
## [1] "character"
class(y)
## [1] "numeric"

It is often useful to use R as a calculator.

2+2
## [1] 4
24-18
## [1] 6
45*8
## [1] 360
84/4
## [1] 21
2^8
## [1] 256
abs(-42)
## [1] 42

We can perform arithmetic with our variables from earlier as well:

z+y
## [1] 165
z-y
## [1] 99
z*y
## [1] 4356
z/y
## [1] 4
sqrt(y)
## [1] 5.744563
log(z)
## [1] 4.882802
exp(y)
## [1] 2.146436e+14

We can mix and match numbers with variables:

z+4
## [1] 136
y^2
## [1] 1089
z/y+3
## [1] 7

Why doesn’t this work then?

z+b
## Error in z + b: non-numeric argument to binary operator

Recall that when we created b, we entered b <- "1", which forced R to treat the number 1 as a character. Watch this though!

z+as.numeric(b)
## [1] 133

Note that b is still “1”, but using as.numeric(b) told R to, one time only, treat b as though it were a number if possible. Note that this won’t work with a:

z+as.numeric(a)
## Warning: NAs introduced by coercion
## [1] NA

R is clearly displeased with us.

We can assign objects with math too.

q <- z/y
q
## [1] 4

2.1.2 Vectors

Objects can be sets of elements as well; a sequence of elements of the same type is called a vector. We can create a simple vector with the concatenate command c(). The next bit of code creates two vectors:

num1 <- c(1,4,9,16,25)
num2 <- c(1,3,6,10,15)

Vectors can also include characters, although in econometrics this is not usually all that useful:

countries <- c("USA", "Canada", "Mexico")

Mathematical operations can be performed on vectors, though how R treats these operations often depends on context. For example, the following commands performs elementwise operations (i.e. it performs the operation on every element) on the vector num1:

num1-1
## [1]  0  3  8 15 24
num1*3
## [1]  3 12 27 48 75
sqrt(num1)
## [1] 1 2 3 4 5

Mathematical operations can be done with multiple vectors as well. Typically, you are doing mathematical operations on vectors of the same length, so R will perform pairwise arithmetic, meaning it will match the first element of each vector, the second element of each vector, and so forth:

num1+num2
## [1]  2  7 15 26 40
num1*num2
## [1]   1  12  54 160 375

We can extract elements from a vector using brackets. Element extraction is extremely powerful and useful in R. The following commands extracts the fifth element from num1 (25) and the first element from num2 (1):

num1[5] 
## [1] 25
num2[1]
## [1] 1

We can extract all but certain elements with the negative sign. Let’s see num1 without the third element:

num1[-3]
## [1]  1  4 16 25

A very common use of this functionality is to extract based on a condition. For example, the next command will extract all the elements of num1 that are greater than 10

num1[num1>5]
## [1]  9 16 25

2.1.3 Packages and Libraries

Every command this far has used what is called Base R. Base R is the basic software that contains the R programming language and many statistical and graphical tools. However, R is also extensible via packages, user-written sets of commands that are often open-source (e.g. freely available) that expand upon the capabilities of R. Packages in R must be installed before they can be used, and must also be loaded every time you use them.

To install a package. you typically use the command install.packages() and put the name of the package to be installed in quotation marks inside the parentheses. For example, to install the EnvStats package you would type install.packages("EnvStats") into R. You only ever need to install a package once. It is generally bad idea to include an install.packages() command within a script (more on scripts below), because this generally leads to attempting to reinstall packages repeatedly which is a waste of time and often breaks your code anyhow. If you are wondering what EnvStats is all about, try typing ?EnvStats or help(EnvStats) into R!

Once a package is installed, I need to let R know when I want to use it. When you open R via RStudio, the only thing that starts right away is Base R, so the only commands you can use natively are those from Base R. If I want to use the geoMean() function from within the EnvStats package I just installed, I need to let R know where to find the geoMean() function. There are two ways of doing so.

The first method uses the double colon operator - :: - and has the general syntax of library::function. To see this in action, let’s create a vector with 6 months of rates of return for an asset:

ror6 <- 1 + c(.04, .13, -.03, .11, -.05, .08)

Let’s assume I want to calculate the average rate of return, which is where the geometric mean comes in (arithmetic means overstate average rates of return). Next, let’s use the double colon method to calculate the geometric mean using the geoMean() function from the EnvStats package:

EnvStats::geoMean(ror6)
## [1] 1.044461

The double colon operator is useful if you only plan on using a function from a particular library once; however, it is often easier to simply load the library into memory so you can access the function without typing :: all over the place. Loading a package into memory is accomplished with the library() command. So if I wanted to use the EnvStats package, I would type library(EnvStats) (note this time I don’t have quotes) into R and then I could use all of the functions contained within. This next code chunk first loads EnvStats, so I can directly use geoMean() in the following line.

library(EnvStats)
geoMean(ror6)
## [1] 1.044461

Generally speaking, the library() approach is used far more often than the :: approach.

If you want to get a head start on installing the libraries used in this book, install the following:

  • knitr
  • tidyverse
  • kableExtra
  • AER
  • stargazer
  • wooldridge
  • fivethirtyeight
  • sandwich
  • lmtest
  • margins
  • MASS
  • huxtable
  • broom
  • jtools
  • mlogit
  • censReg
  • sampleSelection
  • scales
  • dynlm
  • tseries
  • collapse
  • forecast
  • cowplot
  • tidyquant
  • plm
  • gifski
  • gganimate
  • rnaturalearth
  • rnaturalearthdata

2.2 Working with Data

Datasets in R are typically objects called a data frame. Broadly speaking, there 3 ways of getting data into R: loading it from a file, finding it in an R package, or using an R package to fetch live data from the internet.

Data can generally be loaded into R from nearly any other spreadsheet or statistics program. The most common is a Comma Separated Values file, also referred to as a CSV file, using the read.csv() function. There exist many other utilities for importing data into R, including packages such as readxl, haven, Hmisc, and foreign. If the data exists in a somewhat common format, somebody has written the package to import it into R!

This book primarily relies on the second method: pulling data directly out of an installed package. The three data packages this text relies most heavily upon are wooldridge, AER, and fivethirtyeight, which, if you haven’t already, you should install right now:

install.packages("wooldridge")
install.packages("AER")
install.packages("fivethirtyeight")

Additionally, many data sets are built into Base R. To see what data sets are readily available, type:

data()

Let’s say we want to play with the iris data (people love that one, I don’t know why); we can load it into memory with the command:

data(iris)

Now the iris data is loaded into the environment: it is an object called iris.

If I wanted to use any of the data that is in the packages loaded above, I would first need to call them into memory using the library() command:

library(wooldridge)
library(AER)
library(fivethirtyeight)

Now, a data() call will list a lot more available datasets!

Let’s load in the CPS1985 dataset from the AER package and learn some basic tools for inspecting and manipulating data:

data(CPS1985)

For most datasets that are built into packages, you can get information about the dataset using the help command:

?CPS1985

To see how big the data is, we might try looking at the dimensions with dim(), the number of columns with ncol(), and the number of rows with nrow().

dim(CPS1985)
## [1] 534  11
ncol(CPS1985)
## [1] 11
nrow(CPS1985)
## [1] 534

These show us that the dataset is 534 rows long and 11 columns wide. IMPORTANT: R always does rows first, columns second. Remembering this will help!

We can also learn about what the data generally look like by using the head() function, which will print the first 6 lines of the data set (we could also use the tail() function to see the last 6 lines of the data set):

head(CPS1985)
##       wage education experience age ethnicity region gender occupation
## 1     5.10         8         21  35  hispanic  other female     worker
## 1100  4.95         9         42  57      cauc  other female     worker
## 2     6.67        12          1  19      cauc  other   male     worker
## 3     4.00        12          4  22      cauc  other   male     worker
## 4     7.50        12         17  35      cauc  other   male     worker
## 5    13.07        13          9  28      cauc  other   male     worker
##             sector union married
## 1    manufacturing    no     yes
## 1100 manufacturing    no     yes
## 2    manufacturing    no      no
## 3            other    no      no
## 4            other    no     yes
## 5            other   yes      no

We can also use square brackets to subset bits of data. Remember, brackets use the rows, columns convention mentioned above. Here I look at rows 222-225. I leave the column part empty so I get all of the columns:

CPS1985[222:225,]
##      wage education experience age ethnicity region gender occupation
## 221  3.84        11         25  42     other  south female      sales
## 222  6.40        12         45  63      cauc  other female      sales
## 223  5.56        14          5  25      cauc  south   male      sales
## 224 10.00        12         20  38      cauc  south   male      sales
##            sector union married
## 221         other    no     yes
## 222         other    no     yes
## 223         other    no      no
## 224 manufacturing    no     yes

We can also do logical operators here, which requires us to learn a bit about using the dollar sign \$ operator. We use a dollar sign to refer to a variable or subobject within another object. If I want to refer to the age variable within the CPS1985 dataset, I would refer to it as CPS1985\$age. With this in mind, Let’s look at every row where age is over 60:

CPS1985[CPS1985$age>60 , ]
##      wage education experience age ethnicity region gender occupation
## 31   4.00        12         46  64      cauc  south female     worker
## 62   7.00         3         55  64  hispanic  south   male     worker
## 200  8.80        14         41  61      cauc  south   male management
## 217 12.50        12         43  61      cauc  other   male      sales
## 222  6.40        12         45  63      cauc  other female      sales
## 230 19.98        14         44  64      cauc  south   male      sales
## 239 13.71        12         43  61      cauc  other   male      sales
## 262 11.67        12         43  61      cauc  other female     office
## 268  5.25        12         45  63      cauc  other female     office
## 278  9.17        12         44  62      cauc  other female     office
## 331 11.71        16         42  64      cauc  other female     office
## 340 10.62        12         45  63      cauc  other female     office
## 346  6.00         4         54  64      cauc  other   male   services
## 355  3.50         9         48  63      cauc  other   male   services
## 368  3.60        12         43  61  hispanic  south female   services
## 370  3.40         8         49  63      cauc  other female   services
## 396  8.00        12         43  61     other  other female   services
## 403  5.55        11         45  62      cauc  other female   services
## 405  8.93         8         47  61  hispanic  other   male   services
## 413  3.50         9         47  62      cauc  other   male   services
## 426  9.50         9         46  61      cauc  other female   services
## 485 22.20        18         40  64      cauc  other female  technical
## 496 22.83        18         37  61      cauc  other female  technical
##            sector union married
## 31          other    no      no
## 62  manufacturing    no     yes
## 200         other    no     yes
## 217 manufacturing    no     yes
## 222         other    no     yes
## 230         other    no     yes
## 239         other   yes     yes
## 262  construction    no     yes
## 268         other    no     yes
## 278 manufacturing    no     yes
## 331 manufacturing    no      no
## 340 manufacturing    no      no
## 346         other    no     yes
## 355         other    no      no
## 368         other    no     yes
## 370         other    no      no
## 396         other   yes     yes
## 403         other   yes      no
## 405         other   yes     yes
## 413         other   yes     yes
## 426         other   yes     yes
## 485         other    no      no
## 496 manufacturing    no      no

We could drill down even further. What about if we only want to see unmarried white females in the south? Here, we need to use the double == sign (== is the boolean operator for “is equal to”), put quotes around the stuff that aren’t numbers, and incorporate a bunch of ampersands:

CPS1985[CPS1985$ethnicity == "cauc" & CPS1985$region == "south" & CPS1985$gender == "female" & CPS1985$married == "no", ]
##      wage education experience age ethnicity region gender occupation sector
## 31   4.00        12         46  64      cauc  south female     worker  other
## 33   5.00        17          1  24      cauc  south female     worker  other
## 83   3.75        16         13  35      cauc  south female     worker  other
## 88   6.25        12          6  24      cauc  south female     worker  other
## 193 10.00        18         13  37      cauc  south female management  other
## 203  7.81        12          1  19      cauc  south female management  other
## 229  4.75        12         10  28      cauc  south female      sales  other
## 274  5.25        16          2  24      cauc  south female     office  other
## 275 10.32        13         28  47      cauc  south female     office  other
## 283  4.25        13          0  19      cauc  south female     office  other
## 307  7.50        12         20  38      cauc  south female     office  other
## 310  3.55        13          1  20      cauc  south female     office  other
## 312  4.50        13          0  19      cauc  south female     office  other
## 323  5.00        12         26  44      cauc  south female     office  other
## 349  6.00        15         26  47      cauc  south female   services  other
## 402  6.88        14         10  30      cauc  south female   services  other
## 407  3.50        10         33  49      cauc  south female   services  other
## 425  4.55         8         45  59      cauc  south female   services  other
## 494 24.98        16          5  27      cauc  south female  technical  other
## 517  7.45        12         25  43      cauc  south female  technical  other
##     union married
## 31     no      no
## 33     no      no
## 83     no      no
## 88     no      no
## 193    no      no
## 203    no      no
## 229    no      no
## 274    no      no
## 275    no      no
## 283    no      no
## 307    no      no
## 310    no      no
## 312    no      no
## 323    no      no
## 349    no      no
## 402    no      no
## 407    no      no
## 425    no      no
## 494    no      no
## 517    no      no

Typing all that CPS1985$ stuff gets annoying. Understanding dollar sign notation is essential, but to make our lives a bit easier, this might be a good place to attach() our data:

attach(CPS1985)
## The following objects are masked from CPS1985 (pos = 3):
## 
##     age, education, ethnicity, experience, gender, married,
##     occupation, region, sector, union, wage
## The following objects are masked from CPS1985 (pos = 5):
## 
##     age, education, ethnicity, experience, gender, married,
##     occupation, region, sector, union, wage

Attaching a dataset works a bit like loading a library in that now, R will look inside CPS1985 for variables! Now, that previous command can be simplified as:

CPS1985[ethnicity == "cauc" & region == "south" & gender == "female" & married == "no", ]
##      wage education experience age ethnicity region gender occupation sector
## 31   4.00        12         46  64      cauc  south female     worker  other
## 33   5.00        17          1  24      cauc  south female     worker  other
## 83   3.75        16         13  35      cauc  south female     worker  other
## 88   6.25        12          6  24      cauc  south female     worker  other
## 193 10.00        18         13  37      cauc  south female management  other
## 203  7.81        12          1  19      cauc  south female management  other
## 229  4.75        12         10  28      cauc  south female      sales  other
## 274  5.25        16          2  24      cauc  south female     office  other
## 275 10.32        13         28  47      cauc  south female     office  other
## 283  4.25        13          0  19      cauc  south female     office  other
## 307  7.50        12         20  38      cauc  south female     office  other
## 310  3.55        13          1  20      cauc  south female     office  other
## 312  4.50        13          0  19      cauc  south female     office  other
## 323  5.00        12         26  44      cauc  south female     office  other
## 349  6.00        15         26  47      cauc  south female   services  other
## 402  6.88        14         10  30      cauc  south female   services  other
## 407  3.50        10         33  49      cauc  south female   services  other
## 425  4.55         8         45  59      cauc  south female   services  other
## 494 24.98        16          5  27      cauc  south female  technical  other
## 517  7.45        12         25  43      cauc  south female  technical  other
##     union married
## 31     no      no
## 33     no      no
## 83     no      no
## 88     no      no
## 193    no      no
## 203    no      no
## 229    no      no
## 274    no      no
## 275    no      no
## 283    no      no
## 307    no      no
## 310    no      no
## 312    no      no
## 323    no      no
## 349    no      no
## 402    no      no
## 407    no      no
## 425    no      no
## 494    no      no
## 517    no      no

Whether you want to use dollar signs or attach data is often a matter of personal preference. Personally, I found that early on learning R, attaching data was a much easier approach, but as I got into more intermediate and advanced applications, I started using dollar sign notation far more often.

Sometimes it is easier to filter out everything BUT a certain group. This is accomplished with not equals signs (!=) and/or negative signs. This next line shows me the subset of data that is males over the age of 55 who are not in management. Because this subgroup only includes males, the gender column is irrelevant, so I’m getting rid of the 7th column.

CPS1985[gender == "male" & occupation != "management" & age>55, -7]
##      wage education experience age ethnicity region occupation        sector
## 16   8.00         7         44  57      cauc  south     worker         other
## 25   5.75         6         45  57      cauc  south     worker manufacturing
## 62   7.00         3         55  64  hispanic  south     worker manufacturing
## 69   6.75        10         41  57     other  south     worker manufacturing
## 109 11.00         8         42  56      cauc  other     worker manufacturing
## 112 15.00        12         40  58      cauc  other     worker  construction
## 137  3.35         7         43  56      cauc  south     worker manufacturing
## 145  7.50        12         38  56      cauc  south     worker         other
## 147 11.25        12         41  59      cauc  other     worker         other
## 217 12.50        12         43  61      cauc  other      sales manufacturing
## 230 19.98        14         44  64      cauc  south      sales         other
## 239 13.71        12         43  61      cauc  other      sales         other
## 346  6.00         4         54  64      cauc  other   services         other
## 355  3.50         9         48  63      cauc  other   services         other
## 405  8.93         8         47  61  hispanic  other   services         other
## 411  6.50        11         39  56      cauc  south   services         other
## 413  3.50         9         47  62      cauc  other   services         other
## 468  8.00        18         33  57      cauc  other  technical         other
## 481  7.00        18         33  57      cauc  other  technical         other
## 482 18.00        16         38  60      cauc  south  technical         other
## 513 15.00        12         39  57      cauc  other  technical         other
##     union married
## 16     no     yes
## 25     no     yes
## 62     no     yes
## 69    yes     yes
## 109    no     yes
## 112   yes     yes
## 137    no     yes
## 145    no     yes
## 147   yes     yes
## 217    no     yes
## 230    no     yes
## 239   yes     yes
## 346    no     yes
## 355    no      no
## 405   yes     yes
## 411    no     yes
## 413   yes     yes
## 468   yes      no
## 481    no     yes
## 482    no     yes
## 513   yes     yes

Another useful way to look at the data is via the str() function, which tells us about the structure of the data

str(CPS1985)
## 'data.frame':    534 obs. of  11 variables:
##  $ wage      : num  5.1 4.95 6.67 4 7.5 ...
##  $ education : num  8 9 12 12 12 13 10 12 16 12 ...
##  $ experience: num  21 42 1 4 17 9 27 9 11 9 ...
##  $ age       : num  35 57 19 22 35 28 43 27 33 27 ...
##  $ ethnicity : Factor w/ 3 levels "cauc","hispanic",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ region    : Factor w/ 2 levels "south","other": 2 2 2 2 2 2 1 2 2 2 ...
##  $ gender    : Factor w/ 2 levels "male","female": 2 2 1 1 1 1 1 1 1 1 ...
##  $ occupation: Factor w/ 6 levels "worker","technical",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sector    : Factor w/ 3 levels "manufacturing",..: 1 1 1 3 3 3 3 3 1 3 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ married   : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 1 2 1 ...

This tells us about the types of variables we have in our data frame. The first 4 are numeric, the rest are factor (categorical) variables.

You can also see what the variable type is with the class() command.

class(wage)
## [1] "numeric"
class(age)
## [1] "numeric"
class(married)
## [1] "factor"
class(union)
## [1] "factor"

What went wrong with the class(union) command? It turns out there is a function in base R called union, so R is not sure what you are referring to. This is a case where you are stuck using $ notation, even though we used attach() on the data set. This is why our original attach() command spit out a weird error message too.

class(CPS1985$union)
## [1] "factor"

If you have a categorical/factor variable, you can use the levels function to see what all the possible values are:

levels(ethnicity)
## [1] "cauc"     "hispanic" "other"
levels(occupation)
## [1] "worker"     "technical"  "services"   "office"     "sales"     
## [6] "management"

We can get a generic summary of the data with the summary() command.

summary(CPS1985)
##       wage          education       experience         age       
##  Min.   : 1.000   Min.   : 2.00   Min.   : 0.00   Min.   :18.00  
##  1st Qu.: 5.250   1st Qu.:12.00   1st Qu.: 8.00   1st Qu.:28.00  
##  Median : 7.780   Median :12.00   Median :15.00   Median :35.00  
##  Mean   : 9.024   Mean   :13.02   Mean   :17.82   Mean   :36.83  
##  3rd Qu.:11.250   3rd Qu.:15.00   3rd Qu.:26.00   3rd Qu.:44.00  
##  Max.   :44.500   Max.   :18.00   Max.   :55.00   Max.   :64.00  
##     ethnicity     region       gender         occupation            sector   
##  cauc    :440   south:156   male  :289   worker    :156   manufacturing: 99  
##  hispanic: 27   other:378   female:245   technical :105   construction : 24  
##  other   : 67                            services  : 83   other        :411  
##                                          office    : 97                      
##                                          sales     : 38                      
##                                          management: 55                      
##  union     married  
##  no :438   no :184  
##  yes: 96   yes:350  
##                     
##                     
##                     
## 

Note that R provides different output for the different types of data. For the numeric data, R gives us quantitative summary statistics – means, min/max, and quartiles. For the categorical data, we get raw counts.

Sometimes, datasets will code a categorical variable as a number. Here, I will code a variable called female which will be 1 for females and 0 for males. So as to not overwrite the CPS1985 data in our environment memory, I will first clone the CPS1985 into an object called tempdata and then create a new variable called female with the ifelse() command.

tempdata <- CPS1985
tempdata$female <- ifelse(tempdata$gender == "female", 1, 0)
tempdata[200:209,c(7,12)]
##     gender female
## 199   male      0
## 200   male      0
## 201   male      0
## 202   male      0
## 203 female      1
## 204 female      1
## 205 female      1
## 206   male      0
## 207 female      1
## 208 female      1

You can see from the output that my code apparently worked. We know that the female variable is essentially categorical, but what happens when we inspect the class?

class(tempdata$female)
## [1] "numeric"

And if we summarize it:

summary(tempdata$female)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4588  1.0000  1.0000

This tells us that 45.88% of our data is female, which is useful, but we might want to force R to either treat the female variable as a factor, or convert it to a factor.

summary(as.factor(tempdata$female))
##   0   1 
## 289 245

Now it is giving counts for us. We could also permanently code the variable as a factor:

tempdata$female <- factor(tempdata$female, levels = c(0,1), labels = c("male", "female"))
summary(tempdata$female)
##   male female 
##    289    245

Another useful trick is to combine square brackets with our functions. Let’s look at the summary of wage data, split out by male vs female.

summary(wage[gender == "male"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.930   9.995  13.000  26.290
summary(wage[gender == "female"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.750   4.750   6.800   7.879  10.000  44.500

Maybe we want to make a new variable in our dataset to highlight which of our individuals are unmarried females in management. The ifelse command works, but another useful method is to use the cbind command. First, let’s create a vector called femanager:

femanager <- gender == "female" & occupation == "management"
summary(femanager)
##    Mode   FALSE    TRUE 
## logical     513      21

You can see that this is an object with 533 observations that are TRUE or FALSE. Apparently we have 21 female managers in our data. We can use cbind to add it to our data:

tempdata <- cbind(CPS1985, femanager)
class(tempdata$femanager)
## [1] "logical"

The class is “logical”, but this can be converted to factor easily:

tempdata$femanager <- factor(tempdata$femanager, levels = c(FALSE,TRUE), labels = c("no", "yes"))
class(tempdata$femanager)
## [1] "factor"
summary(tempdata$femanager)
##  no yes 
## 513  21

2.2.1 Basics of dplyr and tidyverse

It is often far easier to use the tidyverse for data manipulation than it is to use base R. The tidyverse is a series of packages, most notably dplyr and ggplot2, will be made extensive use of throughout this text. You can load this whole family of packages at once with the library(tidyverse) command (keep in mind you may need to install.packages("tidyverse")first!)

The most important dplyr verbs to remember are:

  • select - Selects columns
  • filter - Filters rows
  • arrange - Re-orders rows
  • mutate - Creates new columns
  • summarize - summarizes stuff
  • group_by - allows you to split-apply-recombine data along with ungroup

One of the most useful things to learn early in in dplyr is the pipe operator, which looks like this %>%. The keyboard shortcut for %>% is Control-Shift-M. That’s a useful one to memorize! The way to think about the pipe operator is that it “pipes” the results from one line of code into the next. I read it as sort of saying “and then”. Let’s put some of these new tools to work.

Recall that above we looked at subgroups of the CPS, starting with those older than 60. Using base R, we got this with CPS1985[CPS1985$age>60 , ], however it might be more intuitive to use dplyr and run:

CPS1985 %>% 
  filter(age>60)
##      wage education experience age ethnicity region gender occupation
## 31   4.00        12         46  64      cauc  south female     worker
## 62   7.00         3         55  64  hispanic  south   male     worker
## 200  8.80        14         41  61      cauc  south   male management
## 217 12.50        12         43  61      cauc  other   male      sales
## 222  6.40        12         45  63      cauc  other female      sales
## 230 19.98        14         44  64      cauc  south   male      sales
## 239 13.71        12         43  61      cauc  other   male      sales
## 262 11.67        12         43  61      cauc  other female     office
## 268  5.25        12         45  63      cauc  other female     office
## 278  9.17        12         44  62      cauc  other female     office
## 331 11.71        16         42  64      cauc  other female     office
## 340 10.62        12         45  63      cauc  other female     office
## 346  6.00         4         54  64      cauc  other   male   services
## 355  3.50         9         48  63      cauc  other   male   services
## 368  3.60        12         43  61  hispanic  south female   services
## 370  3.40         8         49  63      cauc  other female   services
## 396  8.00        12         43  61     other  other female   services
## 403  5.55        11         45  62      cauc  other female   services
## 405  8.93         8         47  61  hispanic  other   male   services
## 413  3.50         9         47  62      cauc  other   male   services
## 426  9.50         9         46  61      cauc  other female   services
## 485 22.20        18         40  64      cauc  other female  technical
## 496 22.83        18         37  61      cauc  other female  technical
##            sector union married
## 31          other    no      no
## 62  manufacturing    no     yes
## 200         other    no     yes
## 217 manufacturing    no     yes
## 222         other    no     yes
## 230         other    no     yes
## 239         other   yes     yes
## 262  construction    no     yes
## 268         other    no     yes
## 278 manufacturing    no     yes
## 331 manufacturing    no      no
## 340 manufacturing    no      no
## 346         other    no     yes
## 355         other    no      no
## 368         other    no     yes
## 370         other    no      no
## 396         other   yes     yes
## 403         other   yes      no
## 405         other   yes     yes
## 413         other   yes     yes
## 426         other   yes     yes
## 485         other    no      no
## 496 manufacturing    no      no

You start with CPS1985, and then (%>%) filter out everybody who is over 60. The advantages of dplyr become more clear with more complicated filters. Rather than CPS1985[CPS1985$ethnicity == "cauc" & CPS1985$region == "south" & CPS1985$gender == "female" & CPS1985$married == "no", ], we might type the following, much cleaner and easier to follow, code:

CPS1985 %>% 
  filter(ethnicity == "cauc") %>% 
  filter(region == "south") %>% 
  filter(gender == "female") %>% 
  filter(married == "no")
##      wage education experience age ethnicity region gender occupation sector
## 31   4.00        12         46  64      cauc  south female     worker  other
## 33   5.00        17          1  24      cauc  south female     worker  other
## 83   3.75        16         13  35      cauc  south female     worker  other
## 88   6.25        12          6  24      cauc  south female     worker  other
## 193 10.00        18         13  37      cauc  south female management  other
## 203  7.81        12          1  19      cauc  south female management  other
## 229  4.75        12         10  28      cauc  south female      sales  other
## 274  5.25        16          2  24      cauc  south female     office  other
## 275 10.32        13         28  47      cauc  south female     office  other
## 283  4.25        13          0  19      cauc  south female     office  other
## 307  7.50        12         20  38      cauc  south female     office  other
## 310  3.55        13          1  20      cauc  south female     office  other
## 312  4.50        13          0  19      cauc  south female     office  other
## 323  5.00        12         26  44      cauc  south female     office  other
## 349  6.00        15         26  47      cauc  south female   services  other
## 402  6.88        14         10  30      cauc  south female   services  other
## 407  3.50        10         33  49      cauc  south female   services  other
## 425  4.55         8         45  59      cauc  south female   services  other
## 494 24.98        16          5  27      cauc  south female  technical  other
## 517  7.45        12         25  43      cauc  south female  technical  other
##     union married
## 31     no      no
## 33     no      no
## 83     no      no
## 88     no      no
## 193    no      no
## 203    no      no
## 229    no      no
## 274    no      no
## 275    no      no
## 283    no      no
## 307    no      no
## 310    no      no
## 312    no      no
## 323    no      no
## 349    no      no
## 402    no      no
## 407    no      no
## 425    no      no
## 494    no      no
## 517    no      no

Similarly, rather than the command CPS1985[gender == "male" & occupation != "management" & age>55, -7], we could specify:

CPS1985 %>% 
    filter(gender == "male") %>% 
    filter(occupation != "management") %>% 
    filter(age>55) %>% 
    select(-gender)
##      wage education experience age ethnicity region occupation        sector
## 16   8.00         7         44  57      cauc  south     worker         other
## 25   5.75         6         45  57      cauc  south     worker manufacturing
## 62   7.00         3         55  64  hispanic  south     worker manufacturing
## 69   6.75        10         41  57     other  south     worker manufacturing
## 109 11.00         8         42  56      cauc  other     worker manufacturing
## 112 15.00        12         40  58      cauc  other     worker  construction
## 137  3.35         7         43  56      cauc  south     worker manufacturing
## 145  7.50        12         38  56      cauc  south     worker         other
## 147 11.25        12         41  59      cauc  other     worker         other
## 217 12.50        12         43  61      cauc  other      sales manufacturing
## 230 19.98        14         44  64      cauc  south      sales         other
## 239 13.71        12         43  61      cauc  other      sales         other
## 346  6.00         4         54  64      cauc  other   services         other
## 355  3.50         9         48  63      cauc  other   services         other
## 405  8.93         8         47  61  hispanic  other   services         other
## 411  6.50        11         39  56      cauc  south   services         other
## 413  3.50         9         47  62      cauc  other   services         other
## 468  8.00        18         33  57      cauc  other  technical         other
## 481  7.00        18         33  57      cauc  other  technical         other
## 482 18.00        16         38  60      cauc  south  technical         other
## 513 15.00        12         39  57      cauc  other  technical         other
##     union married
## 16     no     yes
## 25     no     yes
## 62     no     yes
## 69    yes     yes
## 109    no     yes
## 112   yes     yes
## 137    no     yes
## 145    no     yes
## 147   yes     yes
## 217    no     yes
## 230    no     yes
## 239   yes     yes
## 346    no     yes
## 355    no      no
## 405   yes     yes
## 411    no     yes
## 413   yes     yes
## 468   yes      no
## 481    no     yes
## 482    no     yes
## 513   yes     yes

We can combine group_by and summarize to easily get average wage by gender:

CPS1985 %>% 
    group_by(gender) %>% 
    summarize(wage = mean(wage))
## # A tibble: 2 x 2
##   gender  wage
##   <fct>  <dbl>
## 1 male    9.99
## 2 female  7.88

Both dplyr and base R get to the same place, but generally speaking dplyr is a bit more intuitive, especially for more complicated tasks.

2.3 Scripting in R

Now that we have seen some of the building blocks of coding in R, let’s talk about how we put them together. Writing your code in a script is essential for a variety of reasons:

  • Scripting makes your work reproducible.
  • Scripting allows you to document your code.
  • Scripting makes it easier to work on a project over multiple sessions.

A script is essentially a text file that contains a list of commands that R will execute in order. Rather than type one command at a time in the console window (bottom left of R Studio), you type the series of commands in the script window (top left of R studio) and can run all or parts of the script at once.

R can be used not only for data analysis, but also for writing up that analysis in a document that integrates text with code and statistical output. An R script is used to create (and recreate!) a series of commands in R. An R Markdown document incorporates script elements into a high quality document that can be shared in a variety of formats.

2.3.1 R Scripts

To create a new script, use the button on the top left of RStudio. This can also be done in the file menu or with Ctrl+Shift+N. You can run the whole script by using the run button (top right of the script pane). The RStudio script editor is basically a fancy text editor. This means you can’t use the Enter button to run a line of code…to run a line of code, you need to put your cursor on that line and use Ctrl-Enter. You can also highlight a line or a set of lines with your cursor and execute that set of lines with hitting Ctrl+Enter.

2.3.2 R Markdown

You can also embed R code into a R Markdown document to make high quality documents in a variety of types:

  • HTML
  • LaTeX/PDF
  • Word
  • Slide Presentations
  • Dashboards
  • e-Books

There is lots of fancy formatting that goes into all of these types that we will not be going into. Learning some basic RMarkdown is useful as it prevents the need to cut/paste things from R in to a word processing program. To use RMarkdown, You will need to install the rmarkdown package, and installing knitr probably won’t hurt either.

This text will not delve deep into RMarkdown, however a good place to learn the basics or RMarkdown formatting syntax are http:\\rmarkdown.rstudio.com/lesson-1.html and read chapters 1-3 of R Markdown: The Definitive Guide by Yihui Xie https://bookdown.org/yihui/rmarkdown/.

2.3.3 Scripting Best Practices

  1. Comment your work!

You may work on a script with another person who has no idea what your code is all about, or you may not look at a script for a few weeks and have forgotten what you were trying to accomplish! Putting comments in your code is a way of making notes and passing them to people with whom you are working and/or your future self. By using the # in your code, you can put comments in your code that R will not execute but will tell you what is going on.

# R will not read this line of code when I execute the script
data(mtcars) # R reads data(mtcars) but it doesn't read this  This is a good place to put comments
mean(mtcars$mpg) # especially early as you are learning R
mean(mtcars$qsec) # extensive commenting of your code is invaluable for remembering what you are doing
t.test(data = mtcars, mtcars$mpg ~ mtcars$am) # so when you look at a script later, you know what you did,
t.test(data = mtcars, mtcars$qsec ~ mtcars$am) # and why you did it.

A more serious version of commenting on the above code might look like:

# Estimate whether or not having an automatic vs manual transmission effects mpg or quarter second time. Note that for the variable am, automatic = 0 and manual = 1
data(mtcars) 
mean(mtcars$mpg) 
mean(mtcars$qsec) 
t.test(data = mtcars,mpg ~ am) 
t.test(mtcars$qsec ~ mtcars$am) 

It is very useful to comment your scripts throughout, and also to put a comment at the top of your script to give a general description of what the script does. For example, if I were doing a bunch of analysis on the mtcars dataset I used in the previous example, my first couple lines of code might be:

# This is code to look at the effect of manual vs automatic transmissions in cars from the mtcars data set.
# Code developed by Matt Dobra
  1. Load required libraries early

Base R is powerful, but what makes R the state-of-the-art programming language is the user written packages.

Remember, you only need to install a package once. This is done with the install.packages() function. For example:

install.packages("tidyverse")

will install the tidyverse package onto your computer. Once it is installed, you need to load it into your session whenever you use it. Never put the install.packages() line into your scripts, but always put the library() line into your script, and do it at the very top of your script. When scripting, it is a good idea to load in any packages early in your script. For example, nearly every script I write includes the following line toward the very top of the script:

library(tidyverse)

And, of course, absolutely NONE of them have install.packages("tidyverse") in them anywhere.

If I wind up needing more packages as I develop the script, I try to put their library call at the beginning of the script, not at the end. For example, in my R scripts that I use to write my Principles of Macroeconomics course notes, I have the following code in lines 13-23, as I wind up using these packages in nearly every script for that class.

library(knitr) # markdown language
library(tidyverse) # Keep things tidy
library(kableExtra) # Table Formattting
library(ggthemes) # ggplot addon
library(ggpubr) # ggplot add on
library(DiagrammeR) # Makes Flowcharts
library(WDI) # World Development Indicators
library(quantmod) # Gets FRED data
library(scales) # labeling improvement for ggplots.
  1. Break your code into labeled sections

A typical script might have 3 distinct sections

  • Preamble - a section that overviews the code, loads libraries and data, etc.
  • Data Wrangling - a section the transforms the loaded data into what you will be analyzing
  • Statistical Analysis and Visualization - a section that does the econometrics and makes graphs.

Use # comments to break these up. Make the section breaks stand out and obvious. For example:

#----------------#
# Data Wrangling #
#----------------#
################################
#~~~~~Statistical Analysis~~~~~#
################################

You will probably discover more useful tips as you go along.