Chapter 4 Introduction to R
R is a versatile statistic language/program. At its simplest form, it can be used as a calculator. However, as you get more affluent with this tool, you can use it to accomplish all sorts of things. Among all the popular programming languages and despite its learning curve at the beginning, R should be your first langugage as a social scientist. In this section, we will introduce some basics of R using RStudio, a GUI of R, which is the main program you will rely on for the rest of the semester. See here for an intro of the RStudio interface, though you will be familiar with it soon.
The script/code in R is commented by the # sign, which means the code following it will not be executed. This is useful in that it helps you and others understand what the code is about. For instance,
# calculate 1 + 1
1 + 1 + 1
## [1] 3
As you learn more about the langauge, you should also develop a good R style, which makes your code easier to read (or reread, say, 5 months after submitting your paper), share, and verify.
4.1 R as a calculator
Let’s begin with some basic arithmetic operators (for more, type ?"*" in the console, which gives you the help page for all opeartors).
- Addition: +
- Subtraction: -
- Multiplication: *
- Division: /
- Exponentiation: ^
- Modulo: %%
Play them around in your console to see how it works. In particular, try the last two operators, which might be unfamiliar to some of you.
# calculate 1 + 1
# Exponentiation: raises a number to a certain power
3^2
## [1] 9
2^3
## [1] 8
# Modulo: returns the remainder of a division
27 %% 7
## [1] 6
In addition, you should pay attention to the order of operations. Try this,
3 + 4 / 7
## [1] 3.571429
(3 + 4) / 7
## [1] 1
4.2 Basic Data Types in R
R works with many data types. In the previous section, we work with numbers, which are called numeric. Here are some basics to start with,
- Decimal values like 1.111 are called numeric.
- Natural numbers like 2 are called integer. Integers are also numerics.
- Boolean values (TRUE or FALSE) are called logical.
- Text (or string) values ("R rocks") are called character.
Before we proceed, let’s also talk about variable assignment. A variable is a basic concept in statistics. In R, we use it to store a value or an object (e.g., functions, plots, and datasets.) You can later access the stored value or object by simply typing the variable’s name. To assign a variable, we use this operator <- or = (the former is preferred). Try this,
## assign a value to the variable apples
apples <- 4
## print out the value of apples
apples
## [1] 4
## assign a value to the variable oranges
oranges <- 6
## print out the value of apples and orages
apples + oranges
## [1] 10
You can assign characters by either double or single quotation marks, though the former is preferred. Also, R is case sensitive. Recall there are different data types. Sometimes, you can’t compare apples and oranges.
## assign a value to the variable oranges
oranges <- "six"
## print out the value of apples and orages
#apples + oranges
4.3 Logical Operators
We introduce some arithemetic operators previously. Now let’s look at some logical operators, which yields Boolean values (TRUE or FALSE). For a list of both types of opertors, see here. Some basics to start with,
- < less than
- <= less than or equal to
- \(>\) greater than
- \(>=\) greater than or equal to
- == exactly equal to
- !X Not X
Now try this,
## is 1 smaller than 2
1 < 2
## [1] TRUE
## does 1 plus 1 equal 3
1 + 1 == 3
## [1] FALSE
In R, TRUE equals 1 and FALSE equals 0. Now let’s try this,
apples <- 4
oranges <- TRUE
apples + oranges
## [1] 5
Recall that R is case sensitive. Try this,
## lower case and upper case
oranges <- "six"
Oranges <- "Six"
oranges == Oranges
## [1] FALSE
4.4 Vectors
Previously, we talk about variables. Another basic statistic concenpt is a vector, which is a one-dimension array to hold as many data as you want. In R, we use the combine function c() to create a vector. Elements you wish to place in the vector are separated by a comma. For instance,
## a numberic vector
(num_vec <- c(1, 2, 3))
## [1] 1 2 3
Now try creating a character vector and a Boolean vector yourself. You can check the data type of your vector by using the class() function. After you are done, try this
## a numberic vector
(mix_vec <- c(1, "Hi", TRUE))
## [1] "1" "Hi" "TRUE"
class(mix_vec)
## [1] "character"
For arithmetic operations of vectors, R proceeds element-wise. For addition, try this
c(1, 2, 3) + c(4, 5, 6)
## [1] 5 7 9
c(1 + 4, 2 + 5, 3 + 6)
## [1] 5 7 9
How about other operations, say multiplication? What happen when the length of two vectors is not equal? Try these out yourself.
To select an element, you tell R which element you want by using square brakets. To select multiple elements, use the combine function c(). For instance,
num_vec <- c(11, 21, 63, 44, 95, 86)
num_vec[3]
## [1] 63
num_vec[c(1,4)]
## [1] 11 44
4.5 Matrices
The third statistical concenpt we’ll introduce today is a matrix, which is a two dimension (with a fixed number of rows and colummns) collection of elements of the same data type. You can construct a matrix using the matrix() function.
matrix(1:12, byrow=TRUE, nrow=3)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
You can also construct matrices by combining vectors or matrices using the cbind() or the rbind() functions.
(c1 <- 1:3)
## [1] 1 2 3
(c2 <- 4:6)
## [1] 4 5 6
(c3 <- 7:9)
## [1] 7 8 9
cbind(c1,c2,c3)
## c1 c2 c3
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
rbind(c1,c2,c3)
## [,1] [,2] [,3]
## c1 1 2 3
## c2 4 5 6
## c3 7 8 9
To select an element of a matrix, we also use the square brackets. Since matrices are two dimensional, it means we need to index the element by both its row number and column number. For instance, my_matrix[1,2] gives you the element that is at the first row and second column. How about my_matrix[1:2,2:3] and my_matrix[,2]? Create a matrix that has at least two rows and three columns and play around with this code.
For basic arithmetic operations, matrices also work in an element-wise order. For instance, 11+my_matrix adds 11 to all elements of my_matrix. You can also perform basic arithmetic between matrices, with an addition caveat: the matrices need to be conformable, meaning they have the same dimension. For instance, matrix(1:6, nrow = 2)+matrix(1:9, nrow=3) gives you an error. As you become more affluent with R and statistics, you may need more involved operations, such as inner and outer products. You do not need to worry about them for now.
4.6 Data Frame
Recall all elements within a matrix need to be of the same type. In many cases, however, this is not a desirable feature. For instance, in International Relations studies, you might want to have data of different types within a dataset.
‘Which country?’ character GDP of the country. numberic ‘Is the country a democracy?’ logical A data frame helps you store data of various types, with the variables as columns and the observations as rows.
# print out built-in R data frame
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear
## Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
## 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
## Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
## Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
## 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
## carb
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.812
## 3rd Qu.:4.000
## Max. :8.000
4.7 Packages
A package in R is a collection of functions and objects. Whenever you open RStudio or R, it automatically loads a number of packages. You can check the packages that have been loaded using the sessionInfo() function. As your skills get more sophisticated, you will need more packages to help you tackle your problems in hand. To do so, you need to install a package using the install.package() function and load a package using the library() function. Once you’ve installed a package, you do not have to install it again. But every time you start a new R session, you need to load the packages you intend to use.
# Example package installation
## Foreign allows you to read Stata .dta data files (NOT V12+), among others
#install.packages("foreign")
## plyr allows advanced data manipulation
#install.packages("plyr")
## ggplot2 facilitates the creation of excellent graphics
#install.packages("ggplot2")
# Load libraries to use the packages we just installed
#library(foreign)
#library(plyr)
#library(ggplot2)
#library(tidyverse)
# A really powerful package.
# My data cleaning process in doing all my research is mainly accomplished by using "tidyverse".
For more detailed introduction on packages, see here. In addition, while R allows you to install multiple packages, you have to load libraries one by one. A useful function is to install and load multiple packages at once, which is shown here. You don’t have to worry about them for now.
4.8 Directories
Before you begin writing an R script, you need to set up a clear file structure where you have:
- Main project directory
- code subdirectory
- tables subdirectory
- figures subdirectory
- .tex file subdirectory The following codes might be helpful:
# Clear all existing values/models/etc.
#rm(list=ls())
# Identify the current working directory
#getwd()
# Set the working directory
## setting your working directory on a PC - you must include "C:/"
#setwd("/Users/timothypeterson/Documents/Teaching/Fall 2017/502/Week 1")
# Create folders for tables and figures
#dir.create("./tables")
#dir.create("./figures")
4.9 Additional Resouces
This introduction is heavily lifted from the introduction offered by DataCamp and Dr. Timothy Peterson’s website as well as Dr. Yuleng Zeng's website. In addition to what I have mentioned, the course (free!) by DataCamp also talks about factors, lists, and many other functions. I highly recommend taking this course to both strengthen what you have learned today and to pick up these new stuffs. If you alreay have some experience with R and today’s intro is a bit easy for you, try out the two basics courses offered by RStudio. To access them, you need to sign in to RStudio Cloud and click Primers. Choose The Basics and you’ll see the two courses. Today’s materal is highly related to the Programming Basics course.
Looking into the future, there are many free courses online that can help you learn more about using R. One set of courses that I worked through and would highly recommend is the Data Science Specialization offered by Coursera. It consists of nine courses (you do not need to take the capstone project or pay for all these courses if you do not want a certificate). I spent around half a year on these courses in my first year. In hindsight, the specialization could be an overkill. But it does help a lot and shows me how powerful R can be. If you intend to take them you might want to plan accordingly given how time consuming it could be.