3 The Basics

In this section, I’ll introduce foundational concepts of programming in R, assuming zero experience with coding.

3.1 How coding works

A computer is a machine that follows instructions. In some obvious ways, computers are more powerful than the human brain: if you asked me to invert even a 10-by-10 matrix, I would have a lot of trouble, but a computer can do it almost instantly. However, at least for the time being, computers need us to give them guidance. When we write code, we use are creating precise instructions that direct the computer’s power towards our goal.

R is a language that you can use to communicate with your computer. It is specifically designed to make it straightforward for users to manipulate and analyze data.

3.2 R objects

R operates on named objects. Almost everything you do in R is going to create a new object, manipulate an existing object, or print output from an existing object.

That’s all a bit vague, and it helps to be concrete when getting started. In this section, I’ll walk you through creating and working with some of the most important types of objects you’re likely to work with. For a more detailed explanation of R’s object types, I recommend An Introduction to R.

3.2.1 Scalars and numeric vectors

The simplest object you can create is a single number. We already did this once in the previous section. x = 3 creates a new object named x with value 3. In general, this is what the = operator does: it assigns the value on the right to the name on the left.3

x = 3
print(x)
## [1] 3

You can also modify existing objects. Write the object name on the left side of =, and then implement desired changes on the right.

x = x + 1
print(x)
## [1] 4

I call single numbers scalars. We can combine an arbitrary number of scalars into numeric vectors using the c() function.

y = c(1, 23, 34, 87.3, -34, 5*6, x/(9-3))
print(y)
## [1]   1.0000000  23.0000000  34.0000000  87.3000000 -34.0000000  30.0000000
## [7]   0.6666667

You can do lots of things to vectors. Here are some examples.

y_len = length(y) # length = number of elements
print(y_len)
## [1] 7
y_sum = sum(y) #sum of elements
print(y_sum)
## [1] 141.9667
y_max = max(y) #maximal element (can also find min)
print(y_max)
## [1] 87.3
y_mean = mean(y) #mean element
print(y_mean)
## [1] 20.28095

You can also perform arithmetic operations on vectors. Conveniently, you can use scalars in these expressions. So, for example, y+10 will add 10 to every element of y.

z = y*x # multiply each element of y by x 
print(z)
## [1]    4.000000   92.000000  136.000000  349.200000 -136.000000  120.000000
## [7]    2.666667

3.2.2 Logical vectors

Suppose we wanted to study only above-average elements of a vector. How can we ask R which elements are above-average?

x = c(1,3,7,2.5,14,0,2,5)
print(x)
## [1]  1.0  3.0  7.0  2.5 14.0  0.0  2.0  5.0
print(x > mean(x))
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE

The vector of TRUE and FALSE values defined by x > mean(x) is called a logical vector. You can see that the vector takes value TRUE exactly where the statement is true (i.e, where x is in fact greater than mean(x)). Logical operations like this one are often convenient when you are trying to restrict your analyses to particular subsets.

To check if two values are equal, you use the == operator. Remember, we can’t check with =: that is the assignment operator! Mixing up == and = is a common error.

z = 2 # Assignment: Sets z equal to 2.
print(z == 1) # Logical check: is TRUE if and only if z is equal to 1.
## [1] FALSE

Sometimes, we need to combine multiple statements. & is the “and” operator, and | is the “or” operator. Here is an illustration of how these operators work.

Condition1 = c(TRUE, TRUE, FALSE, FALSE)
Condition2 = c(TRUE, FALSE, TRUE, FALSE)
print(Condition1 & Condition2)
## [1]  TRUE FALSE FALSE FALSE
print(Condition1 | Condition2)
## [1]  TRUE  TRUE  TRUE FALSE

To negate any logical vector, you can use the ! operator. When using R, I read ! as “not.” So !(x > mean(x)) says, “x NOT greater than mean(x).” The “not equals” operator is defined similarly. x != 0 reads “x NOT equal to 0.”

print( !(x > mean(x)) )
## [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
print( x != 0 )
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

3.2.3 Character vectors

You will often have to work with non-numeric objects. These objects are called characters. You can specify character vectors just like you specify numeric vectors.

x = c('Alabama', 'Alaska', 'Arizona', 'Arkansas') 
print(x)
## [1] "Alabama"  "Alaska"   "Arizona"  "Arkansas"

Note that I have enclosed the characters in single quotes '. Double quotes " will also work. The reason is that if I do not use single quotes, R will think I am assigning objects to the vector with the given names. Consider the following example.

Alabama = 30
Alaska = 40 
Arizona = 50
Arkansas = 60
x = c('Alabama', 'Alaska', 'Arizona', 'Arkansas') # Assign characters
print(x)
## [1] "Alabama"  "Alaska"   "Arizona"  "Arkansas"
y = c(Alabama, Alaska, Arizona, Arkansas) # Assign objects
print(y)
## [1] 30 40 50 60

Note the difference between x and y. In the first case, I told R that I wanted it to use the characters for the state names by enclosing them in quotes. In the second case, I told R that I wanted it to use the values taken by the objects with the state names. I emphasize this difference because this type of mix-up – mistakenly enclosing or not enclosing something in quotes – is common among beginners.

3.2.4 Converting between data types

Sometimes, you might be woking with data where quantitative variables are stored as characters. To convert these vectors to strings, you can use the as.numeric function. Similarly, to go from a numeric variable to a character, use as.character.

example_vec = c('0001','0045', '2460','0913')
#print(mean(example_vec)) Won't work since example_vec is a string
example_vec = as.numeric(example_vec)

print(example_vec)
## [1]    1   45 2460  913
print(mean(example_vec))
## [1] 854.75
example_vec = as.character(example_vec)
print(example_vec)
## [1] "1"    "45"   "2460" "913"
# note that the leading zeros are gone now! they got lost when we converted 
# to numeric earlier.

It’s also common to convert between logical variables and numeric variables. TRUE maps to 1, while FALSE maps to 0.

print(as.numeric(TRUE))
## [1] 1
print(as.logical(0))
## [1] FALSE

3.3 Packages

One of the main benefits of R is that it is an open source language. This means that R is free to use and that many programmers have contributed to the language by writing their own functions that perform specialized tasks. It is not the 1980s; we do not have to write our own new function anytime we want to implement (say) a fixed effects regression. Instead, we can use one of the packages for fixed effects regression in R that are optimized for speed and accuracy, such as fixest. If you are ever considering writing complicated code to perform some analysis, first check to make sure there is no package that already implements the analysis. If we have seen further, it is because we sit on the shoulders of giants.

To use a package, you have to install it and then load it into your session. One very popular package is dplyr. To install that package, I ran install.packages('dplyr')4 in the console. You only have to install a package by running a line like this in the console once. Then, when you want to actually use the package, you tell R by running library(dplyr).

#install.packages('dplyr') # Install: run this line once EVER
library(dplyr) # Load: run this line anytime you need to use dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Think of install.package() as something that downloads the package onto your computer. Once you do it, it’s there forever. But then you have to “open” the package anytime you want to use it library() is like opening an application when you want to use it.

3.4 Aside: indentation and line endings

Unlike some other languages, R does not care about indentation. And within a function or parenthetical expression, R does not care about line endings. So this line of code

mean(c(1,2,3,4,5))
## [1] 3

works exactly the same as this line of code.

mean(
  c(1,2,3,
    4,5)
)
## [1] 3

Sometimes it can be worthwhile to write one command on multiple lines in the text editor, especially if a function is long or has many components.

3.5 Data frames

Now, let’s talk about the main type of object you’ll be working with: the data frame.

Data frames are similar to matrices: they have rows and columns. I typically think of a data frame rows as observations and data frame columns as variables.

3.5.1 Creating new dataframes

You can create new data frames from scratch with the function data.frame().

x = c('Terun', 'Zareens', 'Orens Hummus', 'Jack in the Box', 'Ettan', 'Sundance')
y = c(4.1, 4.4, 4, 2.5, 4.2, 3.9)
z = c(2, 2, 2, 1, 3, 3)


df = data.frame('name' = x, 'rating' = y, 'dollar_sign' = z)
# syntax: 'column_name' = column_values

We have just created a data.frame object called df. Let’s look at this object.

print(df)
##              name rating dollar_sign
## 1           Terun    4.1           2
## 2         Zareens    4.4           2
## 3    Orens Hummus    4.0           2
## 4 Jack in the Box    2.5           1
## 5           Ettan    4.2           3
## 6        Sundance    3.9           3

This is real data from Yelp about six Palo Alto restaurants.5 There are three variables.

  • name: the name of the restaurant
  • rating: the average Yelp rating
  • dollar_signs: the number of dollar signs on Yelp (a measure of how pricey the restaurant is).

Each row corresponds to an observation in the data. In this case, an observation is a restaurant.

Importantly, name, rating, and dollar_signs are not objects separate from the data.frame. If we tried to print(rating), R would give us an error, because no object called rating exists – it is just a variable name within df. One way of accessing an object that lives inside of another R object (such as a variable in a data.frame) is to use the $ character.

#print(rating) will return an error. So instead, we...
print(df$rating)
## [1] 4.1 4.4 4.0 2.5 4.2 3.9
print(mean(df$rating))
## [1] 3.85

3.5.2 Manipulating dataframes

There are two main ways to manipulate dataframes in R, which I’ll call the “base” way and the “dplyr”6 way because the former does not rely on any separate R packages while the latter uses dplyr.7 For the vast majority of applications, either way will work fine. In my own work, I use a hybrid style. I mainly use the dplyr way, but I sometimes toggle between the two if base seems more natural for a given task. Both this guide and the sample code for the E3 class use my personal style. As you learn more about R and programming, you will probably develop your own style – some of this is a matter of personal taste!

3.5.2.1 Defining new variables

To create a new variable in base R, define another variable in the data.frame with $ and tell R what values this variable should take. For example, maybe I want to create a standardized rating variable by subtracting the mean rating from the rating and dividing by its standard deviation.

df$std_rating = ( df$rating - mean(df$rating) ) / sd(df$rating)
print(df)
##              name rating dollar_sign  std_rating
## 1           Terun    4.1           2  0.36583190
## 2         Zareens    4.4           2  0.80483017
## 3    Orens Hummus    4.0           2  0.21949914
## 4 Jack in the Box    2.5           1 -1.97549224
## 5           Ettan    4.2           3  0.51216466
## 6        Sundance    3.9           3  0.07316638

You can see that df has a new column with our variable std_rating!

You can also manipulate existing variables in this way. Maybe you realized that you actually wanted a rating index with standard deviation 100.

df$std_rating = df$std_rating*100
print(df)
##              name rating dollar_sign  std_rating
## 1           Terun    4.1           2   36.583190
## 2         Zareens    4.4           2   80.483017
## 3    Orens Hummus    4.0           2   21.949914
## 4 Jack in the Box    2.5           1 -197.549224
## 5           Ettan    4.2           3   51.216466
## 6        Sundance    3.9           3    7.316638

3.5.2.2 Conditional manipulations (ifelse)

Sometimes, it is desirable to make a change if and only if a certain condition holds. For example, maybe we want a new variable affordability to equal 'Affordable' if and only if 'dollar_signs' is less than 3 and 'expensive' otherwise. We can use R’s ifelse function to implement this definition.

ifelse() accepts three arguments. The first is the “condition,” which is a logical vector. Then, the second and third arguments are the values to take if the condition is true or false, respectively.

print(df$dollar_sign<3) #prints the condition
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE
print(ifelse(df$dollar_sign<3, 'Affordable', 'Expensive')) #prints ifelse output
## [1] "Affordable" "Affordable" "Affordable" "Affordable" "Expensive" 
## [6] "Expensive"

We can define the affordability variable accordingly.

df$affordability = ifelse(df$dollar_sign<3, 'Affordable', 'Expensive')

3.5.3 Doing it in dplyr

dplyr provides an alternative syntax for data frames that some people (myself included) find more natural than base R. I only scratch the surfact of its functionality here. For a more detailed guide that uses dplyr, check out R for Data Science.8

Let’s start by re-creating the original dataset and loading dplyr.

library(dplyr) # Make sure to install the package first!
df = data.frame('name' = x, 'rating' = y, 'dollar_sign' = z)

dplyr is organized around several functions called verbs. Each verb takes a data frame as its first input and returns a data frame as its output. The other inputs are instructions about how to modify the data from input to output.

Most verbs accomplish specific tasks, such as altering variables or summarizing data (I’ll cover the most important verbs in Chapter 5). Therefore, to solve complex problems, we often have to use multiple verbs at once. The pipe character %>% is useful in these cases. The pipe takes the thing on the left and passes it as an argument to the function on the right as its first input. So, for example, x %>% f(y) is equivalent to f(x, y).

mutate is the dplyr verb that alters variables. Inside dplyr verbs, you don’t need to include quotes around variable names – since you include your data frame as an argument, dplyr “knows” that you’re talking about variables within the data frame. You can pass to mutate a series of instructions that includes new columns as a function of old columns. It will execute each instruction in order.

mean_rating = mean(df$rating)
sd_rating = sd(df$rating)

df = df %>% 
  mutate(rating_std = (rating - mean_rating)/sd_rating,
         rating_std = rating_std*100)

print(df)
##              name rating dollar_sign  rating_std
## 1           Terun    4.1           2   36.583190
## 2         Zareens    4.4           2   80.483017
## 3    Orens Hummus    4.0           2   21.949914
## 4 Jack in the Box    2.5           1 -197.549224
## 5           Ettan    4.2           3   51.216466
## 6        Sundance    3.9           3    7.316638

This is exactly equivalent to the result we produced with base R. Similarly, we can make the affordability variable with mutate.

df = df %>% 
  mutate(affordability = ifelse(dollar_sign<3, 'Affordable', 'Expensive'))
print(df)
##              name rating dollar_sign  rating_std affordability
## 1           Terun    4.1           2   36.583190    Affordable
## 2         Zareens    4.4           2   80.483017    Affordable
## 3    Orens Hummus    4.0           2   21.949914    Affordable
## 4 Jack in the Box    2.5           1 -197.549224    Affordable
## 5           Ettan    4.2           3   51.216466     Expensive
## 6        Sundance    3.9           3    7.316638     Expensive

3.6 Debugging tips

Coding can be frustrating, especially as a beginner. Unless you give exactly the correct instructions, the computer will break and get confused about what you’re trying to do. For example, you might have defined a vector x_Times_y somewhere. And then you try to call mean(x_times_y). R will return an error – there is no object called x_times_y. “Wait,” you say. “I defined that object earlier!” But R cares about capitalization – x_times_y is NOT the same object as x_Times_y. I have spent a lot of time looking at RStudio trying to catch errors like that one.

Don’t get down on yourself if you are spending a lot of time debugging early on. You will get better at this as you go. And there are a lot of resources to help you.

  • ChatGPT has been trained on a gigantic corpus of text that included a lot of code. This means it is pretty good at recognizing code that looks strange. So if something’s not working, just describe your goal to ChatGPT, copy and paste your current (broken) code in there, and ask it to give you detailed suggestions about how to make the code accomplish the goal. This trick has already saved me hours of headaches.
    • Beyond debugging, ChatGPT can complete many code tasks on its own if you describe them in precise language. In my experience, though, its suggestions are most useful if you already have a working knowledge of how R works. An understanding of R will allow you to understand when ChatGPT’s suggestions are appropriate, and when they are nonsensical.
    • There is a risk if you are not paying attention that ChatGPT will not conduct exactly the analysis that you want. Being wrong can have major consequences! For example, failing to include a crucial control in a regression can cause you to estimate a treatment effect that is different from the treatment effect you thought you were estimating. We definitely want to avoid this – that’s another reason to make sure you understand what your scripts are doing.
    • For these reasons, the ability to prompt AI effectively and general coding know-how may be complements rather than substitutes.
    • My workflow is to try something on my own, then ask ChatGPT when I’m stuck/out of ideas, and then try out its suggestions. I then iterate back and forth with the computer until I arrive at something that works. But you should figure out what works for you!
  • When you make a mistake, R returns an error message. Read the message to try to figure out both where the error occurred (which line of code) and how it might be solved (what does the error text say the problem is?).
    • If the error text is obscure or confusing, try googling the error. Many other people might have faced the same problem as you – there are thousands of stackexchange posts about common errors that can help.
  • If all else fails, you could try an old-school method: talk to a rubber duck.
    • Try to explain each step of your own code aloud, line by line, to an inanimate object, such as a rubber duck.
    • As you do this, you will often discover that something you’re saying doesn’t make sense. This might lead to you catching an error. It’s kind of like how you understand something better once you have to teach it to someone else.
    • If you tire of rubber ducks, your classmates can be good interlocutors, especially in a setting (like E3) where you are all trying to complete the same task.

  1. R actually has multiple assignment operators. Sometimes you will see people using <- as the assignment operator. For example, x <- 3 does exactly the same thing as x = 3. Throughout this guide, I use = as the assignment operator, since this is consistent with other languages.↩︎

  2. Alternatively, you can manually install packages by using the packages tab in the bottom right pane of RStudio. The packages pane is helpful sometimes nice because you can see which packages you have already installed and don’t need to download again.↩︎

  3. Manually collected in August 2023.↩︎

  4. I pronounce this “dip-ler”, for whatever it’s worth.↩︎

  5. I know that there is also a third way, which involves the data table package, but I do not know how to use this package myself, so I will not cover it.↩︎

  6. The whole book is excellent. The dplyr stuff starts in chapter 4.↩︎