3 The Basics
In this section, I’ll introduce foundational concepts of programming in R, assuming zero experience with coding.
As of Fall 2024, my view as a practicing empirical economics researcher is that learning the basics of how these languages works is still important, even in a world of large language models (LLMs). At the end of this section, I provide some advice about how to use LLMs as you code.
3.1 How coding works
A computer is a machine that follows instructions. In some obvious ways, computers are more powerful than the human brain. If you asked me to invert a 10-by-10 matrix, I would have a lot of trouble, but a computer can do it almost instantly. However, at least for the time being, computers need us to give them guidance. When we write code, we use are creating precise instructions that direct the computer’s power towards our goal.
R is a language that you can use to communicate with your computer. It is specifically designed to make it straightforward for users to manipulate and analyze data.
3.2 R objects
R operates on named objects. Almost everything you do in R is going to create a new object, manipulate an existing object, or print output from an existing object.
That’s all a bit vague, and it helps to be concrete when getting started. In this section, I’ll walk you through creating and working with some of the most important types of objects you’re likely to work with. For a more detailed explanation of R’s object types, I recommend An Introduction to R.
3.2.1 Scalars and numeric vectors
The simplest object you can create is a single number. We already did this once in the previous section. x = 3
creates a new object named x with value 3. In general, this is what the =
operator does: it assigns the value on the right to the name on the left.3
## [1] 3
You can also modify existing objects. Write the object name on the left side of =
, and then implement desired changes on the right.
## [1] 4
I call single numbers scalars. We can combine an arbitrary number of scalars into numeric vectors using the c()
function.
## [1] 1.0000000 23.0000000 34.0000000 87.3000000 -34.0000000 30.0000000 0.6666667
You can do lots of things to vectors. Here are some examples.
## [1] 7
## [1] 141.9667
## [1] 87.3
## [1] 20.28095
You can also perform arithmetic operations on vectors. Conveniently, you can use scalars in these expressions. So, for example, y+10
will add 10 to every element of y.
## [1] 4.000000 92.000000 136.000000 349.200000 -136.000000 120.000000 2.666667
3.2.2 Logical vectors
Suppose we wanted to study only above-average elements of a vector. How can we ask R which elements are above-average?
## [1] 1.0 3.0 7.0 2.5 14.0 0.0 2.0 5.0
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
The vector of TRUE and FALSE values defined by x > mean(x)
is called a logical vector. You can see that the vector takes value TRUE exactly where the statement is true (i.e, where x
is in fact greater than mean(x)
). Logical operations like this one are often convenient when you are trying to restrict your analyses to particular subsets.
To check if two values are equal, you use the ==
operator. Remember, we can’t check with =
: that is the assignment operator! Mixing up ==
and =
is a common error.
z = 2 # Assignment: Sets z equal to 2.
print(z == 1) # Logical check: is TRUE if and only if z is equal to 1.
## [1] FALSE
Sometimes, we need to combine multiple statements. &
is the “and” operator, and |
is the “or” operator. Here is an illustration of how these operators work.
Condition1 = c(TRUE, TRUE, FALSE, FALSE)
Condition2 = c(TRUE, FALSE, TRUE, FALSE)
print(Condition1 & Condition2)
## [1] TRUE FALSE FALSE FALSE
## [1] TRUE TRUE TRUE FALSE
To negate any logical vector, you can use the !
operator. When using R, I read !
as “not.” So !(x > mean(x))
says, “x
NOT greater than mean(x)
.” The “not equals” operator is defined similarly. x != 0
reads “x
NOT equal to 0
.”
## [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
3.2.3 Character vectors
You will often have to work with non-numeric objects. These objects are called characters. You can specify character vectors just like you specify numeric vectors.
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
Note that I have enclosed the characters in single quotes '
. Double quotes "
will also work. The reason is that if I do not use single quotes, R will think I am assigning objects to the vector with the given names. Consider the following example.
Alabama = 30
Alaska = 40
Arizona = 50
Arkansas = 60
x = c('Alabama', 'Alaska', 'Arizona', 'Arkansas') # Assign characters
print(x)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [1] 30 40 50 60
Note the difference between x
and y
. In the first case, I told R that I wanted it to use the characters for the state names by enclosing them in quotes. In the second case, I told R that I wanted it to use the values taken by the objects with the state names. I emphasize this difference because this type of mix-up – mistakenly enclosing or not enclosing something in quotes – is common among beginners.
3.2.4 Converting between data types
Sometimes, you might be woking with data where quantitative variables are stored as characters. To convert these vectors to strings, you can use the as.numeric
function. Similarly, to go from a numeric variable to a character, use as.character
.
example_vec = c('0001','0045', '2460','0913')
#print(mean(example_vec)) Won't work since example_vec is a string
example_vec = as.numeric(example_vec)
print(example_vec)
## [1] 1 45 2460 913
## [1] 854.75
## [1] "1" "45" "2460" "913"
It’s also common to convert between logical variables and numeric variables. TRUE
maps to 1
, while FALSE
maps to 0
.
## [1] 1
## [1] FALSE
3.3 Packages
One of the main benefits of R is that it is an open source language. This means that R is free to use and that many programmers have contributed to the language by writing their own functions that perform specialized tasks. It is not the 1980s; we do not have to write our own new function anytime we want to implement (say) a fixed effects regression. Instead, we can use one of the packages for fixed effects regression in R that are optimized for speed and accuracy, such as fixest
. If you are ever considering writing complicated code to perform some analysis, first check to make sure there is no package that already implements the analysis. If we have seen further, it is because we sit on the shoulders of giants.
To use a package, you have to install it and then load it into your session. One very popular package is dplyr
. To install that package, I ran install.packages('dplyr')
4 in the console. You only have to install a package by running a line like this in the console once. Then, when you want to actually use the package, you tell R by running library(dplyr)
.
#install.packages('dplyr') # Install: run this line once EVER
library(dplyr) # Load: run this line anytime you need to use dplyr
Think of install.package()
as something that downloads the package onto your computer. Once you do it, it’s there forever. But then you have to “open” the package anytime you want to use it library()
is like opening an application when you want to use it.
3.4 Aside: indentation and line endings
Unlike some other languages, R does not care about indentation. And within a function or parenthetical expression, R does not care about line endings. So this line of code
## [1] 3
works exactly the same as this line of code.
## [1] 3
Sometimes it can be worthwhile to write one command on multiple lines in the text editor, especially if a function is long or has many components.
3.5 Data frames
Now, let’s talk about the main type of object you’ll be working with: the data frame.
Data frames are similar to matrices: they have rows and columns. I typically think of a data frame rows as observations and data frame columns as variables.
3.5.1 Creating new dataframes
You can create new data frames from scratch with the function data.frame()
.
x = c('Terun', 'Zareens', 'Orens Hummus', 'Jack in the Box', 'Ettan', 'Sundance')
y = c(4.1, 4.4, 4, 2.5, 4.2, 3.9)
z = c(2, 2, 2, 1, 3, 3)
df = data.frame('name' = x, 'rating' = y, 'dollar_sign' = z)
# syntax: 'column_name' = column_values
We have just created a data.frame object called df
. Let’s look at this object.
## name rating dollar_sign
## 1 Terun 4.1 2
## 2 Zareens 4.4 2
## 3 Orens Hummus 4.0 2
## 4 Jack in the Box 2.5 1
## 5 Ettan 4.2 3
## 6 Sundance 3.9 3
This is real data from Yelp about six Palo Alto restaurants.5 There are three variables.
- name: the name of the restaurant
- rating: the average Yelp rating
- dollar_signs: the number of dollar signs on Yelp (a measure of how pricey the restaurant is).
Each row corresponds to an observation in the data. In this case, an observation is a restaurant.
Importantly, name
, rating
, and dollar_signs
are not objects separate from the data.frame. If we tried to print(rating)
, R would give us an error, because no object called rating
exists – it is just a variable name within df
. One way of accessing an object that lives inside of another R object (such as a variable in a data.frame) is to use the $
character.
## [1] 4.1 4.4 4.0 2.5 4.2 3.9
## [1] 3.85
3.5.2 Manipulating dataframes
There are two main ways to manipulate dataframes in R: the “base” way and the “dplyr”6 way. The former does not rely on any separate R packages while the latter uses dplyr
.7 For the vast majority of applications, either way will work fine. In my own work, I use a hybrid style. I mainly use the dplyr way, but I sometimes toggle between the two if base seems more natural for a given task. Both this guide and the sample code for the E3 class use my personal style. As you learn more about R and programming, you will probably develop your own style – some of this is a matter of personal taste!
3.5.2.1 Defining new variables
To create a new variable in base R, define another variable in the data.frame with $
and tell R what values this variable should take. For example, maybe I want to create a standardized rating variable by subtracting the mean rating from the rating and dividing by its standard deviation.
## name rating dollar_sign std_rating
## 1 Terun 4.1 2 0.36583190
## 2 Zareens 4.4 2 0.80483017
## 3 Orens Hummus 4.0 2 0.21949914
## 4 Jack in the Box 2.5 1 -1.97549224
## 5 Ettan 4.2 3 0.51216466
## 6 Sundance 3.9 3 0.07316638
You can see that df
has a new column with our variable std_rating
!
You can also manipulate existing variables in this way. Maybe you realized that you actually wanted a rating index with standard deviation 100.
## name rating dollar_sign std_rating
## 1 Terun 4.1 2 36.583190
## 2 Zareens 4.4 2 80.483017
## 3 Orens Hummus 4.0 2 21.949914
## 4 Jack in the Box 2.5 1 -197.549224
## 5 Ettan 4.2 3 51.216466
## 6 Sundance 3.9 3 7.316638
3.5.2.2 Conditional manipulations (ifelse)
Sometimes, it is desirable to make a change if and only if a certain condition holds. For example, maybe we want a new variable affordability
to equal 'Affordable'
if and only if 'dollar_signs'
is less than 3 and 'expensive'
otherwise. We can use R’s ifelse
function to implement this definition.
ifelse()
accepts three arguments. The first is the “condition,” which is a logical vector. Then, the second and third arguments are the values to take if the condition is true or false, respectively.
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
## [1] "Affordable" "Affordable" "Affordable" "Affordable" "Expensive" "Expensive"
We can define the affordability
variable accordingly.
3.5.3 Doing it in dplyr
dplyr
provides an alternative syntax for data frames that some people (myself included) find more natural than base R. I only scratch the surfact of its functionality here. For a more detailed guide that uses dplyr
, check out R for Data Science.8
Let’s start by re-creating the original dataset and loading dplyr
.
library(dplyr) # Make sure to install the package first!
df = data.frame('name' = x, 'rating' = y, 'dollar_sign' = z)
dplyr
is organized around several functions called verbs. Each verb takes a data frame as its first input and returns a data frame as its output. The other inputs are instructions about how to modify the data from input to output.
Most verbs accomplish specific tasks, such as altering variables or summarizing data (I’ll cover the most important verbs in Chapter 5). Therefore, to solve complex problems, we often have to use multiple verbs at once. The pipe character %>%
is useful in these cases. The pipe takes the thing on the left and passes it as an argument to the function on the right as its first input. So, for example, x %>% f(y)
is equivalent to f(x, y)
.
mutate
is the dplyr
verb that alters variables. Inside dplyr verbs, you don’t need to include quotes around variable names – since you include your data frame as an argument, dplyr
“knows” that you’re talking about variables within the data frame. You can pass to mutate a series of instructions that includes new columns as a function of old columns. It will execute each instruction in order.
mean_rating = mean(df$rating)
sd_rating = sd(df$rating)
df = df %>%
mutate(rating_std = (rating - mean_rating)/sd_rating,
rating_std = rating_std*100)
print(df)
## name rating dollar_sign rating_std
## 1 Terun 4.1 2 36.583190
## 2 Zareens 4.4 2 80.483017
## 3 Orens Hummus 4.0 2 21.949914
## 4 Jack in the Box 2.5 1 -197.549224
## 5 Ettan 4.2 3 51.216466
## 6 Sundance 3.9 3 7.316638
This is exactly equivalent to the result we produced with base R. Similarly, we can make the affordability variable with mutate.
## name rating dollar_sign rating_std affordability
## 1 Terun 4.1 2 36.583190 Affordable
## 2 Zareens 4.4 2 80.483017 Affordable
## 3 Orens Hummus 4.0 2 21.949914 Affordable
## 4 Jack in the Box 2.5 1 -197.549224 Affordable
## 5 Ettan 4.2 3 51.216466 Expensive
## 6 Sundance 3.9 3 7.316638 Expensive
3.6 Using LLMs and other debugging tips
Coding can be frustrating, especially as a beginner. Unless you give exactly the correct instructions, the computer will break and get confused about what you’re trying to do. For example, you might have defined a vector x_Times_y
somewhere. And then you try to call mean(x_times_y)
. R will return an error – there is no object called x_times_y
. “Wait,” you say. “I defined that object earlier!” But R cares about capitalization – x_times_y
is NOT the same object as x_Times_y
. I have spent a lot of time looking at RStudio trying to catch errors like that one.
Don’t get down on yourself if you are spending a lot of time debugging early on. You will get better at this as you go. And you can use powerful tools to help you
Large language models like ChatGPT and Claude are one of the best ways to make coding go more smoothly. In particular, they are amazing at debugging. LLMs have been trained on a gigantic corpus of text that included a lot of code. This means it is pretty good at recognizing code that looks strange. So if something’s not working, just describe your goal to LLMs, copy and paste your current (broken) code in there, and ask it to give you detailed suggestions about how to make the code accomplish the goal. This trick has already saved me hours of headaches.
Beyond debugging, LLMs can complete many code tasks on its own if you describe them in precise language. This functionality is incredibly powerful. In my experience, though, its suggestions are most useful if you already have a working knowledge of how R works. I find that I am most productive when I am using my own knowledge to send the LLM in the right direction. If I try to “shut my brain off” and just let the LLM write all my code, it usually makes some errors, and I spend even more time trying to fix those errors. An understanding of R will allow you to understand when the LLM’s suggestions are appropriate, and when they are nonsensical.
Using LLMs inappropriately comes with significant risks. As of Fall 2024, it happens pretty often that when you ask the LLM for code that accomplishes a task, it gives you something that works on the first try. The human might not even understand how the analysis was conducted, but still be able to view the output. As I have emphasized, this is remarkable. However, at least two problems can arise.
- First, if you are not paying attention, the LLM may not conduct exactly the analysis that you want. Being wrong can have major consequences! For example, failing to include a crucial control in a regression can cause you to estimate a treatment effect that is different from the treatment effect you thought you were estimating. This kind of thing happens fairly frequently in my own work, as of Fall 2024. We definitely want to avoid this – that’s another reason to make sure you understand what your scripts are doing.
- Second, even if your analysis is right, it often matters a lot whether you are able to explain what you did. In corporate and government settings, the final decision-makers are typically not the people who conducted the analysis to others. You need to be able to explain the choices that you made in your analysis to others. “I told Claude to do the analysis and this is what it gave me” is not typically an acceptable answer!
- One way to avoid this pitfall is to iterate back and forth with the LLM. You can treat the LLM like a TA who will respond to your emails within 20 seconds. If it tells you to use some lines of code, you can ask it to explain what those lines are doing. You can even ask it to connect the lines of code to the economic question you’re trying to answer! In my experience, this is where LLMs can be most powerful in classroom settings.
For all of these reasons, I tentatively suggest that the ability to work with LLMs, general coding know-how, and sound economic reasons may be complements rather than substitutes. Luckily for E3 students, you can learn all of the above in the course.
R actually has multiple assignment operators. Sometimes you will see people using
<-
as the assignment operator. For example,x <- 3
does exactly the same thing asx = 3
. Throughout this guide, I use=
as the assignment operator, since this is consistent with other languages.↩︎Alternatively, you can manually install packages by using the packages tab in the bottom right pane of RStudio. The packages pane is helpful sometimes nice because you can see which packages you have already installed and don’t need to download again.↩︎
Manually collected in August 2023.↩︎
I pronounce this “dip-ler”, for whatever it’s worth.↩︎
I know that there is also a third way, which involves the data table package, but I do not know how to use this package myself, so I will not cover it.↩︎
The whole book is excellent. The
dplyr
stuff starts in chapter 4.↩︎