2 Introduction to R

This chapter introduces R, a programming language. In doing so, it provides an overview of R, instructions for downloading the software, and a brief introduction to its use.

2.1 What is R

R is a programming language and environment for data analysis that is popular with researchers from many disciplines. R refers both to the computer program one runs, as well as the language one uses to alter data within the environment. R only speaks R, and so like traveling to a foreign country it is useful to learn the local language in order to communicate. You call yell at R in English as long as you want, but it can’t produce your data unless you ask correctly. Fortunately, R’s language is based on English and it wants to be as straightforward as possible

2.2 Why Use R?

There are other statistical packages that similar research methods classes use including Stata and SPSS. One of the greatest benefits of R is the price: free. Access to Stata for a one semester class costs $45-125, and extended access costs more. R is an open source software that anyone can use free of charge forever. That means whatever skills you learn you can continue to develop after the class ends.

Many people have access to Excel as a spreadsheet program through Microsoft Office, but R is faster and more flexible for data analysis. Excel is a drag-and-drop program that does not produce reproducible analysis. R, as a programming language, allows users to create a ‘script’ that the computer runs in order to output analysis. That means the script can be reusable, shareable, and iterative, which will have significant benefits if you continue with data analysis after the class. Luckily, R is a relatively straightforward introduction to programming.

2.3 Why learn to program?

Data analytics is a quickly growing field with many job possibilities. The skills you learn in this class, if more fully developed, can be applicable to any industry, from Google to banks to government to a lemonade stand.

Computer programming is a flexible skill that can help you to manage laborious processes. It can stimulate creative thinking, grow your problem solving capabilities, and can teach persistence. All of that with a valuable labor market skill.

Data Scientist has been called the sexiest job of the 21st Century.

Let’s give it a shot in this class, and see if it’s a skill you’d like to continue developing.

2.4 The R Programming environment

R can be downloaded from the r-project.org

There is a link on the left. You’ll need to select a ‘mirror’ to download from. Don’t worry too much about that, the code for R is housed at multiple locations around the world so that it’s always available even if one site gets knocked off line. Generally, you should download from the location that is closest to you, but I have never noticed a difference. For New Orleans, that’s either Oak Ridge, Tennessee or Dallas, Texas. Click the link and follow the directions for installing the program.

2.5 R studio

The R package you just downloaded can fully operate on its own, but we want to download an additional integrated development environment in order to make using R a little more straight forward. R Studio uses the R language while organizing our data sets, scripts, and outputs in a more user-friendly format. Luckily, it’s free too and can be downloaded from rstudio.com. Click the link for R studio Desktop and follow the prompts to install. Note: in order for R Studio to operate R must also be loaded on the computer too. R can operate on its own, and you’re welcome to use it, but class examples will be shown using R Studio.

2.6 Getting started in R studio

Let’s open R Studio and see what we have downloaded.

The program opens with 3 sections (or boxes) displayed, although there are four. If you click the small green button in the upper left, you can create several types of documents in R. Let’s open a script, which should now add our fourth section.

The upper left quadrant is called the ‘script’, which is where we can write out codes to be executed. You can enter the code without writing it out first, but by writing in the script we can be preserve and reuse it. If you’re going to use a line of code multiple times, it’s good to have it written because then you can re-execute it without re-writing it. Because scripts can get up into thousands of lines, it’s good to have everything written out so that it can be reviewed and checked. These are like the directions for a recipe we used in baking our data.

In order to execute code that you write in the script, you need to press the ‘run’ button in the upper-center of R studio. That’ll send the code to be executed and provide output below.

The bottom left is the ‘output’. If you write the command 2+5 in your script and run it, that line of code and the result will appear in the output: 7. Any code you run will display itself processing in that section, and any statistics you produce will come at the end (like the answer to 2+5).

The upper right is called the environment, which is where data you have available to you in R will appear. You won’t see the data itself, but the environment gives you a record of everything that is available in R Studio for you at that moment. If you do want to see the actually data you have saved, you can type View() with the name of the data set.

The bottom right section actually has a few different uses, but we can concentrate now on the graphical output. If we produce a plot or graph of our data, that is where it will appear once the data has executed.

The picture below shows all 4 sections in use. You can see the brief script I wrote, the output of that script, the data I’m using (cars) and the graph I’ve created.

That’s an introduction to R Studio.

Let’s quickly review some of the things we can do in R that are most useful. I would recommend creating a new script in R Studio and entering and running these commands as they are outlined. That applies to future chapters as well.

Reading a description of operations is a good start, but much of coding is muscle memory and takes practice at the syntax and structure of commands. As you enter the commands, try to tweak them and break them. Figure out what’s optional in what I’m writing and what’s necessary.

There is a tradition that you should first introduce yourself by saying “hello world”. In R, you can do that by writing it directly into the script and executing.

## [1] "hello world!"

Or you can save it by giving it a name. When you save something you create or read data into R it creates an object, which will appear under the environment on the top right of the screen.

## [1] "hello world!"

When you try to execute each command by hitting run, make sure that you’ve highlighted all of the code you want to enter. You can run thousands of line of code at any given time just by highlighting it. But unless you tell R that is the line of code you want to run, it wont do anything.

We can use R as a calculator by entering math equations:

## [1] 7
## [1] 65.5

R is an object oriented programming language, so we generally want to save what we do by assigning it a name. We can assign things with an arrow: <-

## [1] 7

We can also store lists as an object by placing c in front of parentheses.

## [1] 1 4 6

Better yet, we can use whole data sets with multiple columns and rows. R actually comes with a lot of data sets built into it’s software. We’ll use those data sets all semester to create examples. You can see the data that is present in R by writing data() and you can call one of those data sets out of the background into being used by writing data() with the name of the desired data set inside.

##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17

We can see the data we have, and we can make graphs with it using plot()

There are a lot of idiosyncrasies of using any programming language. The best way to read this book is to follow along with the examples. Each chapter will provide more practice, and this book is searchable if you ever need to go back and find an example of doing some operation.

The next chapter will introduce one of the many uses we’ll have for R: creating data-driven documents with Markdown.

2.7 Getting Help

One of my favorite things about R is how much information there is online to help someone with problems. If you feel stuck, googling “introduction to R Studio” will produce thousands of links, and if you want to search “how to create a plot in R” you’ll find lots of help from a really engaged community. If this is your first time opening R it’s probably overwhelming, but the best way to move forward is to practice. As the semester goes we will get more comfortable.

I hardly make it through a project without searching for an answer to something. And there are some commands I just haven’t memorized. There are a few on post-it notes stuck around my computer screen, and there are others I have to search every few weeks (“how to remove duplicate results”). The goal of learning R isn’t to immediately memorize every command, it’s to know what’s possible in R. And as you get more comfortable, more will become possible.