Chapter 2 Introduction to R
R is a versatile statistic language/program. At its simplest form, it can be used as a calculator. However, as you get more affluent with this tool, you can use it to accomplish all sorts of things. Among all the popular programming languages and despite its learning curve at the beginning, R should be your first langugage as a social scientist. In this section, we will introduce some basics of R using RStudio, a GUI of R, which is the main program you will rely on for the rest of the semester. See here for an intro of the RStudio interface, though you will be familiar with it soon.
The script/code in R is commented by the #
sign, which means the code following it will not be executed. This is useful in that it helps you and others understand what the code is about. For instance,
# calculate 1 + 1
1 + 1 + 1
As you learn more about the langauge, you should also develop a good R style, which makes your code easier to read (or reread, say, 5 months after submitting your paper), share, and verify.
2.1 R as A Calculator
Let’s begin with some basic arithmetic operators (for more, type ?"*"
in the console, which gives you the help page for all opeartors).
- Addition:
+
- Subtraction:
-
- Multiplication:
*
- Division:
/
- Exponentiation:
^
- Modulo:
%%
Play them around in your console to see how it works. In particular, try the last two operators, which might be unfamiliar to some of you.
# Exponentiation: raises a number to a certain power
3^2
2^3
# Modulo: returns the remainder of a division
27 %% 7
In addition, you should pay attention to the order of operations. Try this,
3 + 4 / 7
(3 + 4) / 7
2.2 Basic Data Types in R
R works with many data types. In the previous section, we work with numbers, which are called numeric. Here are some basics to start with,
- Decimal values like
1.111
are called numeric. - Natural numbers like
2
are called integer. Integers are also numerics. - Boolean values (
TRUE
orFALSE
) are called logical. - Text (or string) values (
"R rocks"
) are called character.
Before we proceed, let’s also talk about variable assignment. A variable is a basic concept in statistics. In R, we use it to store a value or an object (e.g., functions, plots, and datasets.) You can later access the stored value or object by simply typing the variable’s name. To assign a variable, we use this operator <-
or =
(the former is preferred). Try this,
## assign a value to the variable apples
apples <- 4
## print out the value of apples
apples
Now try this,
## assign a value to the variable oranges
oranges <- 6
## print out the value of apples and orages
apples + oranges
You can assign characters by either double or single quotation marks, though the former is preferred. Also, R is case sensitive. Recall there are different data types. Sometimes, you can’t compare apples and oranges.
## assign a value to the variable oranges
oranges <- "six"
## print out the value of apples and orages
apples + oranges
2.2.1 Logical Operators
We introduce some arithemetic operators previously. Now let’s look at some logical operators, which yields Boolean values (TRUE
or FALSE
). For a list of both types of opertors, see here. Some basics to start with,
<
less than<=
less than or equal to>
greater than>=
greater than or equal to==
exactly equal to!X
Not X
Now try this,
## is 1 smaller than 2
1 < 2
## does 1 plus 1 equal 3
1 + 1 == 3
In R, TRUE
equals 1 and FALSE
equals 0. Now let’s try this,
apples <- 4
oranges <- TRUE
apples + oranges
Recall that R is case sensitive. Try this,
## lower case and upper case
oranges <- "six"
Oranges <- "Six"
oranges == Oranges
2.3 Vectors
Previously, we talk about variables. Another basic statistic concenpt is a vector, which is a one-dimension array to hold as many data as you want. In R, we use the combine function c()
to create a vector. Elements you wish to place in the vector are separated by a comma. For instance,
## a numberic vector
num_vec <- c(1, 2, 3)
Now try creating a character vector and a Boolean vector yourself. You can check the data type of your vector by using the class()
function. After you are done, try this
## a numberic vector
mix_vec <- c(1, "Hi", TRUE)
class(mix_vec)
For arithmetic operations of vectors, R proceeds element-wise. For addition, try this
c(1, 2, 3) + c(4, 5, 6)
c(1 + 4, 2 + 5, 3 + 6)
How about other operations, say multiplication? What happen when the length of two vectors is not equal? Try these out yourself.
To select an element, you tell R which element you want by using square brakets. To select multiple elements, use the combine function c()
. For instance,
num_vec <- c(11, 21, 63, 44, 95, 86)
num_vec[3]
num_vec[c(1,4)]
2.4 Matrices
The third statistical concenpt we’ll introduce today is a matrix, which is a two dimension (with a fixed number of rows and colummns) collection of elements of the same data type. You can construct a matrix using the matrix()
function.
matrix(1:12, byrow=TRUE, nrow=3)
You can also construct matrices by combining vectors or matrices using the cbind()
or the rbind()
functions.
c1 <- 1:3
c2 <- 4:6
c3 <- 7:9
cbind(c1,c2,c3)
rbind(c1,c2,c3)
To select an element of a matrix, we also use the square brackets. Since matrices are two dimensional, it means we need to index the element by both its row number and column number. For instance, my_matrix[1,2]
gives you the element that is at the first row and second column. How about my_matrix[1:2,2:3]
and my_matrix[,2]
? Create a matrix that has at least two rows and three columns and play around with this code.
For basic arithmetic operations, matrices also work in an element-wise order. For instance, 11+my_matrix
adds 11 to all elements of my_matrix
. You can also perform basic arithmetic between matrices, with an addition caveat: the matrices need to be conformable, meaning they have the same dimension. For instance, matrix(1:6, nrow = 2)+matrix(1:9, nrow=3)
gives you an error. As you become more affluent with R and statistics, you may need more involved operations, such as inner and outer products. You do not need to worry about them for now.
2.5 Data Frame
Recall all elements within a matrix need to be of the same type. In many cases, however, this is not a desirable feature. For instance, in International Relations studies, you might want to have data of different types within a dataset.
- ‘Which country?’
character
- GDP of the country.
numberic
- ‘Is the country a democracy?’
logical
A data frame helps you store data of various types, with the variables as columns and the observations as rows.
# print out built-in R data frame
mtcars
There are several functions to help you examine a data frame, including head()
, tail()
, str()
, summary()
, dim()
. Try these out yourself. Also, if you are not sure of how a function works, type ?
before a function and R will give you a help page with more details. Try ?head
, read the help page, and show the first 3 rows of the mtcars
data.
2.6 Packages
A package in R is a collection of functions and objects. Whenever you open RStudio or R, it automatically loads a number of packages. You can check the packages that have been loaded using the sessionInfo()
function. As your skills get more sophisticated, you will need more packages to help you tackle your problems in hand. To do so, you need to install a package using the install.package()
function and load a package using the library()
function. Once you’ve installed a package, you do not have to install it again. But every time you start a new R session, you need to load the packages you intend to use.
# Example package installation
## Foreign allows you to read Stata .dta data files (NOT V12+), among others
install.packages("foreign")
## plyr allows advanced data manipulation
install.packages("plyr")
## ggplot2 facilitates the creation of excellent graphics
install.packages("ggplot2")
# Load libraries to use the packages we just installed
library(foreign)
library(plyr)
library(ggplot2)
For more detailed introduction on packages, see here. In addition, while R allows you to install multiple packages, you have to load libraries one by one. A useful function is to install and load multiple packages at once, which is shown here. You don’t have to worry about them for now.
2.7 Codes to Create Directories
Before you begin writing an R script, you need to set up a clear file structure where you have:
- Main project directory
- code subdirectory
- tables subdirectory
- figures subdirectory
- .tex file subdirectory
You can do so by the following code
# Clear all existing values/models/etc.
rm(list=ls())
# Identify the current working directory
getwd()
# Set the working directory
## setting your working directory on a PC - you must include "C:/"
setwd("/Users/timothypeterson/Documents/Teaching/Fall 2017/502/Week 1")
# Create folders for tables and figures
dir.create("./tables")
dir.create("./figures")
2.8 A Working Example
This is an example for coding, summarizing and visualizing data written by Tim.
# First, remove all we have done so far
rm(list = ls())
# Load the diamonds data (requires ggplot2, which we loaded earlier)
data(diamonds)
names(diamonds)
head(diamonds)
str(diamonds)
summary(diamonds)
# R histogram including labels
hist(diamonds$carat, main = "Carat Histogram", xlab = "Carat")
# ggplot2 version of same histogram
ggplot(data = diamonds) +
geom_histogram(aes(x = carat))
# ggplot2 allows for a lot of adjustments to fine-tune the graph, which are added with "+"
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), fill = "grey50") + # note the option to change bar color
ylab("Frequency") +
xlab("Carots") +
ggtitle("Count of diamonds by size") +
theme_bw() # This changes the background color
# Save the plot to the figures folder we created
ggsave(file="./figures/figure1.pdf", width=6.5, height=5)
ggsave(file="./figures/figure1a.png", width=6.5, height=5, device = "png")
# Let's save a summary stats table to the tables folder we created
library(stargazer)
# Keep three variables for which to produce summary stats
diamonds <- subset(diamonds, select = c("carat", "depth", "price"))
## Some R example data is not stored as a normal data frame, in which case the following command is necessary
diamonds <- as.data.frame(diamonds)
sum.table1 <- stargazer(diamonds, covariate.labels=c("Size (carats)", "Cut", "Color", "Clarity"), title = "Summary stats for diamond data", label = "table:summary1")
write(x=sum.table1, file="./tables/Summary1.tex")
2.9 Additional Resouces
This introduction is heavily lifted from the introduction offered by DataCamp and Tim Peterson’s website. In addition to what I have mentioned, the course (free!) by DataCamp also talks about factors, lists, and many other functions. I highly recommend taking this course to both strengthen what you have learned today and to pick up these new stuffs. If you alreay have some experience with R and today’s intro is a bit easy for you, try out the two basics courses offered by RStudio. To access them, you need to sign in to RStudio Cloud and click Primers. Choose The Basics and you’ll see the two courses. Today’s materal is highly related to the Programming Basics course.
Looking into the future, there are many free courses online that can help you learn more about using R. One set of courses that I worked through and would highly recommend is the Data Science Specialization offered by Coursera. It consists of nine courses (you do not need to take the capstone project or pay for all these courses if you do not want a certificate). I spent around half a year on these courses in my first year. In hindsight, the specialization could be an overkill. But it does help a lot and shows me how powerful R can be. If you intend to take them you might want to plan accordingly given how time consuming it could be.