Chapter 4 Introduction to R

R is a versatile statistic language/program. At its simplest form, it can be used as a calculator. However, as you get more affluent with this tool, you can use it to accomplish all sorts of things. Among all the popular programming languages and despite its learning curve at the beginning, R should be your first langugage as a social scientist. In this section, we will introduce some basics of R using RStudio, a GUI of R, which is the main program you will rely on for the rest of the semester. See here for an intro of the RStudio interface, though you will be familiar with it soon.

The script/code in R is commented by the # sign, which means the code following it will not be executed. This is useful in that it helps you and others understand what the code is about. For instance,

# calculate 1 + 1
1 + 1 + 1
## [1] 3

As you learn more about the langauge, you should also develop a good R style, which makes your code easier to read (or reread, say, 5 months after submitting your paper), share, and verify.

4.1 R as a calculator

Let’s begin with some basic arithmetic operators (for more, type ?"*" in the console, which gives you the help page for all opeartors).

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /
  • Exponentiation: ^
  • Modulo: %%

Play them around in your console to see how it works. In particular, try the last two operators, which might be unfamiliar to some of you.

# calculate 1 + 1
# Exponentiation: raises a number to a certain power
3^2
## [1] 9
2^3
## [1] 8
# Modulo: returns the remainder of a division
27 %% 7
## [1] 6

In addition, you should pay attention to the order of operations. Try this,

3 + 4 / 7
## [1] 3.571429
(3 + 4) / 7
## [1] 1

4.2 Basic Data Types in R

R works with many data types. In the previous section, we work with numbers, which are called numeric. Here are some basics to start with,

  • Decimal values like 1.111 are called numeric.
  • Natural numbers like 2 are called integer. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) values ("R rocks") are called character.

Before we proceed, let’s also talk about variable assignment. A variable is a basic concept in statistics. In R, we use it to store a value or an object (e.g., functions, plots, and datasets.) You can later access the stored value or object by simply typing the variable’s name. To assign a variable, we use this operator <- or = (the former is preferred). Try this,

## assign a value to the variable apples
apples <- 4
## print out the value of apples
apples
## [1] 4
## assign a value to the variable oranges
oranges <- 6
## print out the value of apples and orages
apples + oranges
## [1] 10

You can assign characters by either double or single quotation marks, though the former is preferred. Also, R is case sensitive. Recall there are different data types. Sometimes, you can’t compare apples and oranges.

## assign a value to the variable oranges
oranges <- "six"
## print out the value of apples and orages
#apples + oranges

4.3 Logical Operators

We introduce some arithemetic operators previously. Now let’s look at some logical operators, which yields Boolean values (TRUE or FALSE). For a list of both types of opertors, see here. Some basics to start with,

  • < less than
  • <= less than or equal to
  • \(>\) greater than
  • \(>=\) greater than or equal to
  • == exactly equal to
  • !X Not X

Now try this,

## is 1 smaller than 2
1 < 2
## [1] TRUE
## does 1 plus 1 equal 3
1 + 1 == 3
## [1] FALSE

In R, TRUE equals 1 and FALSE equals 0. Now let’s try this,

apples <- 4
oranges <- TRUE
apples + oranges
## [1] 5

Recall that R is case sensitive. Try this,

## lower case and upper case
oranges <- "six"
Oranges <- "Six"
oranges == Oranges
## [1] FALSE

4.4 Vectors

Previously, we talk about variables. Another basic statistic concenpt is a vector, which is a one-dimension array to hold as many data as you want. In R, we use the combine function c() to create a vector. Elements you wish to place in the vector are separated by a comma. For instance,

## a numberic vector
(num_vec <- c(1, 2, 3))
## [1] 1 2 3

Now try creating a character vector and a Boolean vector yourself. You can check the data type of your vector by using the class() function. After you are done, try this

## a numberic vector
(mix_vec <- c(1, "Hi", TRUE))
## [1] "1"    "Hi"   "TRUE"
class(mix_vec)
## [1] "character"

For arithmetic operations of vectors, R proceeds element-wise. For addition, try this

c(1, 2, 3) + c(4, 5, 6)
## [1] 5 7 9
c(1 + 4, 2 + 5, 3 + 6)
## [1] 5 7 9

How about other operations, say multiplication? What happen when the length of two vectors is not equal? Try these out yourself.

To select an element, you tell R which element you want by using square brakets. To select multiple elements, use the combine function c(). For instance,

num_vec <- c(11, 21, 63, 44, 95, 86)
num_vec[3]
## [1] 63
num_vec[c(1,4)]
## [1] 11 44

4.5 Matrices

The third statistical concenpt we’ll introduce today is a matrix, which is a two dimension (with a fixed number of rows and colummns) collection of elements of the same data type. You can construct a matrix using the matrix() function.

matrix(1:12, byrow=TRUE, nrow=3)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

You can also construct matrices by combining vectors or matrices using the cbind() or the rbind() functions.

(c1 <- 1:3)
## [1] 1 2 3
(c2 <- 4:6)
## [1] 4 5 6
(c3 <- 7:9)
## [1] 7 8 9
cbind(c1,c2,c3)
##      c1 c2 c3
## [1,]  1  4  7
## [2,]  2  5  8
## [3,]  3  6  9
rbind(c1,c2,c3)
##    [,1] [,2] [,3]
## c1    1    2    3
## c2    4    5    6
## c3    7    8    9

To select an element of a matrix, we also use the square brackets. Since matrices are two dimensional, it means we need to index the element by both its row number and column number. For instance, my_matrix[1,2] gives you the element that is at the first row and second column. How about my_matrix[1:2,2:3] and my_matrix[,2]? Create a matrix that has at least two rows and three columns and play around with this code.

For basic arithmetic operations, matrices also work in an element-wise order. For instance, 11+my_matrix adds 11 to all elements of my_matrix. You can also perform basic arithmetic between matrices, with an addition caveat: the matrices need to be conformable, meaning they have the same dimension. For instance, matrix(1:6, nrow = 2)+matrix(1:9, nrow=3) gives you an error. As you become more affluent with R and statistics, you may need more involved operations, such as inner and outer products. You do not need to worry about them for now.

4.6 Data Frame

Recall all elements within a matrix need to be of the same type. In many cases, however, this is not a desirable feature. For instance, in International Relations studies, you might want to have data of different types within a dataset.

‘Which country?’ character GDP of the country. numberic ‘Is the country a democracy?’ logical A data frame helps you store data of various types, with the variables as columns and the observations as rows.

# print out built-in R data frame
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
summary(mtcars)
##       mpg             cyl             disp             hp             drat      
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec             vs               am              gear      
##  Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000   Min.   :3.000  
##  1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :3.325   Median :17.71   Median :0.0000   Median :0.0000   Median :4.000  
##  Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062   Mean   :3.688  
##  3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000  
##  Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000   Max.   :5.000  
##       carb      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.812  
##  3rd Qu.:4.000  
##  Max.   :8.000

4.7 Packages

A package in R is a collection of functions and objects. Whenever you open RStudio or R, it automatically loads a number of packages. You can check the packages that have been loaded using the sessionInfo() function. As your skills get more sophisticated, you will need more packages to help you tackle your problems in hand. To do so, you need to install a package using the install.package() function and load a package using the library() function. Once you’ve installed a package, you do not have to install it again. But every time you start a new R session, you need to load the packages you intend to use.

# Example package installation
## Foreign allows you to read Stata .dta data files (NOT V12+), among others
#install.packages("foreign")
## plyr allows advanced data manipulation
#install.packages("plyr")
## ggplot2 facilitates the creation of excellent graphics
#install.packages("ggplot2")

# Load libraries to use the packages we just installed
#library(foreign)
#library(plyr)
#library(ggplot2)
#library(tidyverse) 
# A really powerful package. 
# My data cleaning process in doing all my research is mainly accomplished by using "tidyverse".

For more detailed introduction on packages, see here. In addition, while R allows you to install multiple packages, you have to load libraries one by one. A useful function is to install and load multiple packages at once, which is shown here. You don’t have to worry about them for now.

4.8 Directories

Before you begin writing an R script, you need to set up a clear file structure where you have:

  • Main project directory
    • code subdirectory
    • tables subdirectory
    • figures subdirectory
    • .tex file subdirectory The following codes might be helpful:
# Clear all existing values/models/etc.
#rm(list=ls())

# Identify the current working directory
#getwd()

# Set the working directory
## setting your working directory on a PC - you must include "C:/"
#setwd("/Users/timothypeterson/Documents/Teaching/Fall 2017/502/Week 1")

# Create folders for tables and figures
#dir.create("./tables")
#dir.create("./figures")

4.9 Additional Resouces

This introduction is heavily lifted from the introduction offered by DataCamp and Dr. Timothy Peterson’s website as well as Dr. Yuleng Zeng's website. In addition to what I have mentioned, the course (free!) by DataCamp also talks about factors, lists, and many other functions. I highly recommend taking this course to both strengthen what you have learned today and to pick up these new stuffs. If you alreay have some experience with R and today’s intro is a bit easy for you, try out the two basics courses offered by RStudio. To access them, you need to sign in to RStudio Cloud and click Primers. Choose The Basics and you’ll see the two courses. Today’s materal is highly related to the Programming Basics course.

Looking into the future, there are many free courses online that can help you learn more about using R. One set of courses that I worked through and would highly recommend is the Data Science Specialization offered by Coursera. It consists of nine courses (you do not need to take the capstone project or pay for all these courses if you do not want a certificate). I spent around half a year on these courses in my first year. In hindsight, the specialization could be an overkill. But it does help a lot and shows me how powerful R can be. If you intend to take them you might want to plan accordingly given how time consuming it could be.