Chapter 3 The Quest for Causality

3.1 Introduction

In order to familiarize you with the R code necessary to complete the assignments in Real Econometrics, I will reproduce all examples from Chapter 1. As I present the examples, I will explain the syntax for each pieced of code. You will also be introduced to using R Markdown to produce a seamless integration of your code, your output, and your reports.

In subsequent chapters, I will take you through examples of the relevant code necessary to complete the exercises in R.

3.1.1 Table 1.1

Table 1.1 contains the necessary information to produce Figures 1.2 and 1.3. Creating Table 1.1 will give you an opportunity to create a data frame from four vectors. A data frame is used for storing data tables. A data frame is one of the many data structures in R. The others include vector, list, matrix, factors, and tables. A data frame is collection of vectors of the same length.

A vector is the most common and basic data structure in R. Vectors can be of two types: atomic vectors or lists. An atomic vector is a collection of observations of a single variable. The vectors in a data frame can be different types, however. Each vector must be of a single type. The atomic vector types or classes in R are logical, integer, numeric (real or decimal), complex, and character. A logical vector is one in which all of the values are TRUE, FALSE, and NA. An integer vector contains only integers, a real vector contains only reals, etc. If a vector contains more than one type of value, the vector and each element of it is coerced to the most general class in the vector.

Let’s start by creating each vector in Table 1.1. To assign values to a vector, use the assignment operator <- and the concatenate or combine function c().

Note in the code chunk above that the symbol, #, is used to create comments within the code. Those things set off by the # will not be executed as code. These are useful for creating notes to yourself or collaborators about what you are doing with certain lines of code.

We now have four named vectors that we can put into a data frame. A note on naming conventions in R. While there are many name conventions in R, I recommend using snake case where each word is separated by an under score and no capital letters are used. See Hadley Wickhams Style Guide for style suggestions for all parts of R programming, not just variable names. Following these guidelines will make your code easier to read and edit.

A tibble is an update to the traditional data frame. For most of what we will do, it will act the same as a data frame. The two main differences in data frames and tibbles are printing and subsetting. For more on tibbles type vignette("tibble") in the console.

tidyverse is one of the many packages developed within the R community. In R, a package is shareable code that bundles together code, data, documentation, tests, etc. To use a package, it must first be installed and then be loaded. To install a package, call install.packages("package_name")2. To make use of a package, load it by calling library(packagename)3. Currently there are more than 14,000 packages available, to see the packages visit Contributed Packages. CRAN Task Views shows relevant packages by task. You may want to visit CRAN Task View: Econometrics to see the extensive array of packages for use in econometrics.

The tidyverse package is a collection of packages that share an underlying design philosophy, grammar, and data structures. For more on the tidyverse follow this link. The dplyr package loaded below is a grammar of data manipulation that can be used to solve most data manipulation problems.

# A tibble: 13 x 4
   observation_number name              donuts_per_week weight
                <int> <chr>                       <dbl>  <dbl>
 1                  1 Homer                       14       275
 2                  2 Marge                        0       141
 3                  3 Lisa                         0        70
 4                  4 Bart                         5        75
 5                  5 Comic Book Guy              20       310
 6                  6 Mr. Burns                    0.75     80
 7                  7 Smithers                     0.25    160
 8                  8 Chief Wiggum                16       263
 9                  9 Principle Skinner            3       205
10                 10 Rev. Lovejoy                 2       185
11                 11 Ned Flanders                 0.8     170
12                 12 Patty                        5       155
13                 13 Selma                        4       145
Observations: 13
Variables: 4
$ observation_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
$ name               <chr> "Homer", "Marge", "Lisa", "Bart", "Comic Bo...
$ donuts_per_week    <dbl> 14.00, 0.00, 0.00, 5.00, 20.00, 0.75, 0.25,...
$ weight             <dbl> 275, 141, 70, 75, 310, 80, 160, 263, 205, 1...

Use the kable function in knitr to create Table 1.1.

Table 3.1: Table 1.1 Donut Consumption and Weight
Observation
number
Name Donuts
per week
Weight
(pounds)
1 Homer 14.00 275
2 Marge 0.00 141
3 Lisa 0.00 70
4 Bart 5.00 75
5 Comic Book Guy 20.00 310
6 Mr. Burns 0.75 80
7 Smithers 0.25 160
8 Chief Wiggum 16.00 263
9 Principle Skinner 3.00 205
10 Rev. Lovejoy 2.00 185
11 Ned Flanders 0.80 170
12 Patty 5.00 155
13 Selma 4.00 145

3.1.2 Figure 1.2

To create Figure 1.2 we will use the ggplot2 package. ggplot2, also part of the tidyverse, is a system for declarative creating graphics, based on The Grammar of Graphics. The Grammar of Graphics is built on two principles. First, graphics are built with distinct layers of grammatical elements. Second, meaningful plots are formed through aesthetic mappings.

Seven elements comprise the grammar of graphics: data, aesthetics, geometries, facets, statistics, coordinates, and themes. Every graphic must contain, at a minimum, data, aesthetics, and geometries. Data, typically a data frame or tibble, is the data set being plotted. Aesthetics are the scales onto which data are mapped. Aesthetics include x-axis, y-axis, color, fill, size, labels, alpha (transparency), shape, line width, and line type. Geometries are how we want the data plotted, e.g., as points, lines, bars, histograms, boxplots, etc. Facets allow us to use more than one plot, statistics allow us to add elements like error bands, regression lines, etc. Coordinates allow us to control the space into which we plot the data. Finally, themes are all non-data ink in a graphic.

Follow this link for an overview of ggplot2. The Learning ggplot2 section points to three useful places to learn more about using ggplot2. While the use of data visualization is not emphasized in the econometrics, understanding the basic principles will help your data analysis.

This basic plot can be transformed into the figure in the text by adding layers to the graphic to change its appearance.

We can make the graph in one step, if we desire.


  1. You need install a package only once

  2. You must load the package during each R session to make use of it.