# Lesson 1 Getting started in R

In this very first lesson, you’ll learn how to:

• get R up and running on your computer.
• interact with R by typing statements into the console and seeing the results of those statements.
• define your own objects and use those objects in subsequent statements.
• create a new script and run some statements from that script.
• get help with R.

To begin using R, you’ll need to download two pieces of software:

1. Head to cloud.r-project.org to download and install the version of R appropriate for your computer, whether Windows, Mac, or Linux. Note that if you have a newer MacBook with an Apple M1 chip, make sure to install the appropriate version of R (labeled Apple silicon arm64 on the download page), and not the version designed for older Intel-based MacBooks.1
2. Head to www.rstudio.com and download RStudio. If prompted to choose your version, you’ll want “RStudio Desktop,” which is free and, like R, available for Windows, Mac, and Linux.

Why two programs? Well, R is the program that does most of the raw computational work, while RStudio is a graphical front end for R (called an “integrated development environment,” or IDE) that offers a lot of creature comforts. You’ll work entirely within RStudio; meanwhile, RStudio will interface with R behind the scenes, without your ever needing to open the R program itself. Loosely speaking, if you imagine that analyzing data is like driving a car, then RStudio is the steering wheel and the pedals, while R is the engine.2 That’s why you need to install R in order for RStudio to work: if you try to drive a car without an engine, you’re not going to get very far.

## 1.2 First steps

Go ahead and open up RStudio for the first time. On a Mac or Windows machine, this is as simple as finding where you’ve installed the RStudio app (not the R app) and double-clicking its icon. You should see something like this (with possible minor aesthetic differences, depending on your computer’s operating system):

There’s a lot here, and it might seem overwhelming at first. Here’s a brief summary of what you’re looking at:

• The left panel, with the blinking cursor and the > symbol, is the console. This is where R statements get evaluated and where the results of those statements get printed.
• The top right panel shows your workspace. In these lessons, you’ll mainly use this panel to import data sets. But this panel allows you to do some other handy things, too, like examine a history of prior statements that you’ve asked R to evaluate in the console.
• Finally, the bottom-right panel is mainly used for plots, packages, and help. We’ll see this in action soon.

### Interacting with R

For now, just focus on the left-hand panel (the console). More specifically, focus on the blinking cursor in the console, right next to the angle bracket (>). That > symbol is a prompt; it’s R’s way of saying, “Tell me what to do by typing statements right here.” Below I’ve circled it in red:

Where see you that prompt (>), type the following statement:

2+3

Then hit Enter to run the statement. R should print out 5 right underneath where you typed 2+3, and then it should give you a new prompt (> + blinking cursor) immediately below that. In other words, your screen should now look like this:

Congratulations! You’ve just run your first line of R code — in this case, a statement that tells R to do some simple arithmetic. In the console, R is telling you that the result of evaluating the statement 2+3 is 5.

You might be confused by the little [1] in front of the 5. That’s R’s way of telling you that it has only 1 number to report, and that this number is 5. This habit of R’s might initially seem like oversharing, but it will make a lot more sense later, when we start doing calculations that produce multiple numbers as outputs and we might want some handy way of referencing, say, the 104th number.

This example, while simple, illustrates the basic idea of interacting with R:

• You write statements (also known as commands or lines of code), and run those statements in the console.
• R evaluates those statements and does something—for example, printing out the result of a calculation or making a plot.

Some statements, like 2 + 3, are simple. Other statements, like log10(1000) + 4^2 - 15.85841, are a bit more complex. (Try that one in the console.)3 Soon, you’ll start to chain multiple statements together to form a program. But in the very beginning, learning R mostly means learning which statements do which things!

### How you’ll get feedback

Every lesson in this book depends on getting immediate feedback: I’ll show you what R statements you need to write in order to accomplish a given task, together with what you should expect to see as a result of executing those statements. In my experience, this is the best way learn R (or any programming language).

Above, I provided this feedback via screen shot. This worked OK. But screen shots aren’t really the best way to give you the feedback you need. Not only are they tedious for me, but they’re also inefficient for you: they show a bunch of extraneous information, with the relevant feedback buried in a corner somewhere, perhaps with a silly red circle drawn around it.

So from here on, we’ll adopt the following convention. Whenever you’re supposed to evaluate a statement (like 2+3) and to compare your result with mine, you’ll see it written in two boxes, like this:

2+3
## [1] 5

The first box shows what you’re supposed to type. Immediately below that, the second box shows what the result should be, whether it’s something printed to console (like we see here) or a plot (like you’ll see below). Note that you won’t actually see the little ## symbols printed in your console. Those ## symbols are there so that you can distinguish intended input (first box, no ##) from expected output (second box, with ##) at a quick glance.

Let’s see one more explicit example of this convention in action. We’ll add 1 and 4, and then multiply the result by 3:

3*(1+4)
## [1] 15

The two blocks above are telling you: 1) that you should type in 3*(1+4) and then hit Enter (first block); and 2) that you should see [1] 15 printed out to the console as a result (second block). Again, the [1] 15 means that R has 1 number to report as a result of what you asked it to do, and that this number is 15.

I’ll mostly stick with this convention for providing feedback, saving screen shots only for when they’re necessary.

### R as a calculator

You probably got the sense from the examples above you can treat R as a simple calculator. Indeed, R works exactly as you’d expect it to in this regard: it obeys the standard order of operations that you learned in grade school, and it also knows all the important “fancy” functions like logarithms and cosines and exponentials that you learned in high school. Here we’ll calculate the base-10 logarithm of 1000:

log10(1000)
## [1] 3

You can also treat R like a graphing calculator, using the curve function. Try, for example, typing in the following statement:

curve(x^2 - 3*x + 1, from=0, to=5)

This statement plots the curve $$f(x) = x^2 - 3x + 1$$ over the domain $$0 \leq x \leq 5$$. The three inputs to curve are called arguments:

• x^2 - 3*x + 1 is the curve you want to plot.
• from=0 and to=5 says that you want to start the curve at $$x=0$$ and end it at $$x=5$$.

When you evaluate the statement by hitting Enter, nothing gets printed to the console, but you should see a graph pop up automatically in the Plots tab, in the lower-right panel on your screen. (Displaying plots is one of the main uses of that lower-right panel, although we’ll cover a couple of other uses later.)

If you want to get some practice, try any of the following examples, or else just make up your own.

30/10
5^2
sqrt(9)
log(7.4)
exp(2)
curve(cos(x), from=0, to=4*pi)

Note that to R:

• the four basic arithmetic operations are +, -, *, and /.
• log means natural log, while log10 means the base-10 log.
• x^a means “raise x to the power a.”
• But if you want to raise the mathematical constant $$e = 2.718...$$ to some power $$a$$, use exp(a).
• For trigonometric functions (like cos, sin, tan, etc.), angles are assumed to be in radians.

### R is case sensitive

Statements in R are case-sensitive. If you use upper case where lower case is expected, or vice versa, you’ll get an error. So if, for example, you typed Curve rather than curve, you’d get an error telling you that there is no such thing as a function called Curve:

Curve(x^2 - 3*x + 1, from=0, to=5)
## Error in Curve(x^2 - 3 * x + 1, from = 0, to = 5): could not find function "Curve"

Be aware of this case-sensitivity. It’s a common source of coding errors for beginners.

## 1.3 Objects

We’ve seen that using R as a (graphing) calculator is pretty straightforward. But to do anything more interesting than basic arithmetic, we need to learn how to assign values to objects.4

In computer programming, an object is analogous to an envelope or a file folder. It’s a place where information can be stored, given a label, and accessed later. Creating objects helps us break complex tasks down into a series of simpler tasks.

Let’s create your first object. Run the following statement in the console:

foo = 3

Nothing gets printed to the console when you run this statement. But under the hood, R has created an object called foo and stored the value 3 in that object.

Let’s unpack the statement foo = 3 piece by piece. All assignment statements in R have the same basic structure:

• 3 is the value of the object. In our file-folder analogy, this is like the contents of the folder.
• foo is the name of the object. In our file-folder analogy, this is like the label on the folder. Here we called the object foo, but you can call objects in R pretty much anything you want, with a handful of exceptions.
• = is called the “assignment operator.” It tells R to assign the value on the right (3) to the object on the left (foo).

If we now ask R what foo is, it will tell us:

foo
## [1] 3

The reason we assign values to objects is so that we can use those objects in subsequent calculations. It’s like a souped-up version of the “memory” function on your calculator. To illustrate this, let’s create an object called bar that stores the results of the computation $$4 + \log_{10}(100)$$. Since $$\log_{10}(100) = 2$$, this calculation should give us 6.

bar = 4 + log10(100)

Again, nothing gets printed to the console when we create an object and assign it a value. But, remember, once we’ve created the object, we can type its name into the console, and R will tell us what value is stored there:

bar
## [1] 6

More importantly, now we can use this object in a subsequent computation, just like we’d use any number:

bar + 4
## [1] 10

Creating objects—that is, storing intermediate results in user-defined objects, and re-using those results in subsequent calculations—may not seem like a big deal now. After all, we’re just messing around with basic arithmetic. But as you’ll soon learn, the ability to create your own objects is a source of great power. That’s because it allows us to write a complex data analysis as a sequence of many smaller, simpler steps. And that’s the way we accomplish pretty much anything complex in life, whether it’s running a data analysis, auditing a bank, or building a house:

• accomplish each task in isolation, using the products of earlier tasks to help us with the next task.
• stitch the tasks together in the proper order to accomplish the overall goal.

We could summarize this “basic mantra of data science” as follows.

The basic mantra of data science: manage complexity by breaking complex tasks down into simple tasks, and then stitching the simple tasks together.

In the next section, we’ll learn how scripts can make this process a lot more manageable.

## 1.4 Scripts

In our previous examples, you learned to interact with R by typing statements like 2+3 or sqrt(9) directly into the console, where you see the > prompt. This works OK for simple statements, but it’s actually not the best way to interact with R—especially when we start chaining together complex statements to analyze data.

Instead, you should learn to work with scripts. A script is a file that collects multiple statements (i.e. lines of R code) in a single document, which always ends in a .R suffix.

Your basic workflow in RStudio should look like this.

• Create a script with the goal of solving some specific tasks.
• Write statements in your script, saving them for subsequent modification or re-use.
• Run those statements in the console to produce the desired output or behavior.
• View the results, usually either in the console or the Plots tab of RStudio’s lower-right panel.

There are many advantages to working with scripts, which we’ll discuss below. For now let’s focus on the “how” and “what” rather than the “why.”

### Creating and running scripts

To create a new R script, go to the File Menu and choose New File > R Script. Your screen should now look something like this:

Before there were three panels; now there are four. The bottom left panel is your old friend, the console. But now there’s a (new) top left panel called the code editor. RStudio’s code editor allows you to create, open, and edit R scripts. When you opened the File menu and chose New File > R script, you conjured into existence a new, blank script whose default name is probably something like Untitled1.R. In the lessons to follow, this panel is where you’ll do most of your actual work, by creating and editing scripts (.R files) that encode the steps in a data analysis.

Go ahead and type the following two statements in the new script you just created. Do not type them directly in the console. Put each statement on its own line in your script, and then save the script, giving it whatever name you want (e.g. my_first_script.R):

foo = 1 + 2
foo + 7

These statements: 1) create an object called foo that is assigned a specific value (i.e. 3, the result of 1 + 2); and then 2) add 7 to foo. We should get 10 as a result, right? But when you type these statements in your script, nothing actually happens. That’s because you need to run these statements in the console in order to get R to evaluate them.

The easiest way to run statements from your script is to highlight those statements with your mouse, and then use the keyboard shortcut Control-Enter (Command-Enter works too if you’re on a Mac). When you do this, you should see the statements themselves, followed by their result, appear in your console, like this:

foo = 1 + 2
foo + 7
## [1] 10

And that’s it—your first R script. From here on, when I give you R commands to run, you should first write them in a script, and then run them in the console. It might feel clunky at first, but you should practice this way of doing things, because it will indispensable when things get more complex.

Two further notes on running statements from a script:

• If you just want to run a single line from your script, you don’t have to highlight it. You can just click anywhere on that line and then hit Control-Enter.
• You can also run statements (either individually or as a block) using the Run button at the top of the code editor. I personally find this less user-friendly, but your mileage may vary.

### A slightly more interesting script

OK, so that first script was a bit simple. Below, I’ve given you one that’s slightly more interesting. You can safely include the little notes after the # symbol, which are called “comments.” R ignores anything in a script the right of the # symbol, allowing you to write little explanations (for yourself or others) about what your code is doing.

# Load one of R's built-in data sets about cars
data(mtcars)

# Fit a straight line for mpg vs hp and plot the result.
mpg_model = lm(mpg ~ hp, data=mtcars)
plot(mtcars$hp, mtcars$mpg)
abline(mpg_model)
coef(mpg_model)

Create a new blank script, and then type5 this entire chunk of R code, word for word, into your script. Don’t worry too much about the details of the individual statements; we’ll cover those in later lessons.

Now highlight everything in the script, and hit Control-Enter to run it. You should see the plot below:

And you should also see the following information printed in your console:

## (Intercept)          hp
## 30.09886054 -0.06822828

Each dot in the plot represents a car. The x coordinate of the dot represents the horsepower (hp) of that car’s engine. The y coordinate represents that car’s gas mileage (mpg). The line you see is the result of fitting a linear regression model to estimate the systematic relationship between mileage and horsepower. The two numbers printed in your console tell you the $$y$$-intercept and slope of the trend line. Unsurprisingly, cars with more powerful engines tend to get lower gas mileage (hence the negative slope).

This code chunk exhibits two important ideas we’ve covered: 1) the use of = to assign values to objects; and 2) the re-use of those objects in subsequent statements. For example, the statement mpg_model = lm(mpg ~ hp, data=mtcars) fits a straight line to mpg versus hp, storing the result in an object that I called mpg_model. This mpg_model object is then re-used in two subsequent statements, abline(mpg_model) and coef(mpg_model), which draw the line through the point cloud and print the intercept/slope to the console, respectively.

So there you have it: you’ve run and visualized your first data analysis in R! I hope you’re beginning to get a feel for the basic RStudio workflow. You create a script organized around some specific task. Then you:

• write statements in your script, saving them for subsequent modification or re-use.
• run those statements in the console to produce the desired output or behavior.
• view the results, usually either in the console or the Plots panel

In more complex data analyses, you will typically iterate these three steps, gradually building up complexity until you’ve accomplished what you set out to do.

Organizing your data analyses around scripts is probably the single most important “best practice” of using R. It’s perfectly fine to type statements directly into the console every now and again, especially if you’re in more of an “exploratory” mode. But if you find yourself doing this repeatedly, especially with complex statements that build on previous statements, you should probably stop and ask yourself: “Would I be better off writing these statements in a script instead, so that I can save, re-use, and modify them later?” Usually the answer is yes!

### Why can’t I just point and click?

This way of interacting with R—writing statements in a script and then running those statements in the console—is called a “command-line interface” or a “REPL” (pronounced “reeple,” for Read-Evaluate-Print Loop). It may seem unfamiliar or even intimidating at first. It’s certainly different from popular programs you might be used to, where you do a lot of pointing and clicking. To R beginners, the command-line interface can even feel like a step backward in time—sort of like you’re interacting with a “dumb” computer from the 1970s rather than something “smart” from the 21st century, with a mouse or a touch screen.

“Why do I have to type literally everything?” you might wonder. “Why can’t I click on menus and buttons that do this stuff for me?”

I understand this reaction. But let me try to convince you that the command-line interface is actually a huge advantage for doing data science! It’s true that the learning curve is steeper than with more familiar “point and click” software packages. But with R, the results of running a complex analysis don’t require that you remember a long, detailed succession of clicks and menu options. Instead, those results rely upon a series of written commands that do exactly what they say. So, for example, if you want to re-run your analysis on a new data set—perhaps because you collected some more data—you don’t need have to remember which buttons you clicked or which menu options you chose in order to get your new results. You just have to load the new data set and re-run your script from beginning to end—possibly tweaking it here and there, as required.

Here are five other advantages to this way of interacting with R, via scripts and the console:

1. Scripts make it simple to save your work and pick up where you left off, without having to remember what you’ve accomplished already.
2. Scripts make it easy to modify a complex analysis by adding or changing steps in the middle of a long chain of statements. (No Control-Z required!)
3. Scripts make your analysis shareable: just save the .R file and post it to a site like GitHub.
4. Scripts make your analysis reproducible, since anyone—including a future version of yourself—can read the script and see/repeat exactly what you’ve done.
5. Scripts make it much easier to diagnose and correct errors in your analysis, because—unlike, say, in other widely used data-analysis programs—an error always arises from some specific statement that’s written down in black and white for anyone to see.6

As I hope you’ll come to appreciate over the lessons to come, these advantages far outweigh the familiar comforts of mice and menus.

## 1.5 Getting help

Everyone needs help with R at some point or another. In fact, the more you use R, the more you’ll find yourself relying on various help resources. In my life as a teacher and researcher, I look for R help all the time, and you will, too.

1. Search the web. There’s a huge and very active community of R users out there, and many of them like to post about their R problems, or their solutions to other people’s R problems, online. If in doubt, just Google your question, e.g. “How do I find the maximum of two numbers in R?” Chances are very, very good that your question has been asked and answered before.

2. Search R’s own help files. For example, suppose you wanted to figure out how to calculate the median of a bunch of numbers in R. You could go to the Help tab in the lower-right panel, and type median into the search bar. You’d see several options pop up, one of which is the function called median. Click on it, and the help page for that function will be displayed. It will show you how to use the function, and even give you examples at the bottom of the help page.

3. If you want help on a specific function—say, for example, the function log10—you can type a question mark, followed by the function’s name, into the console. Try, for example, running this statement:

?log10

Personally, I use option 1 about 70% of the time, option 3 about 30% of the time, and option 2 about 0% of the time (since all of R’s help files are on the web anyway, and will come up in a web search if they’re useful).

Now that we’ve covered the basics of R scripts and getting help, here’s my advice for how best to use this book to learn R.

1. Type out my commands into your own script. Don’t just copy/paste. Copying and pasting is lazy, and you’ll never build muscle memory that way, any more than you can learn to ride a bike by watching someone else ride a bike.
2. Execute the commands and compare your results to mine. (Remember our convention on How you’ll get feedback from above.)
3. Be an active learner. Once you start to learn a bit of R, try executing variations on the commands I’ve given to see what you get. Trying articulating out loud in your own words what each command is doing and why it produces the behavior it does. This is a great way to build familiarity with the R environment.

Invariably, I have students in my large data science classes at UT who think they understand R on the basis of these lessons, but then do poorly on quizzes and tests. When I talk with them about their study habits, more often than thought it turns out that they’ve been plowing through these walk-throughs, blindly copying/pasting commands and hitting command enter at high speed, without dwelling on why the commands do what they do, and without absorbing the underlying ideas.

You cannot expect to just “copy/paste/command-enter” your way through these walk-throughs and come away with an understanding of R or data science. What’s true of learning R is true of learning pretty much anything: becoming an independent learner means actively engaging in the material.

## 1.6 Libraries

R has an enormous ecosystem of libraries, ranging from the simple to the very sophisticated. A library is a piece of software that provides additional functionality to R, beyond what’s contained in the basic R installation. If R is like a smart phone, then a library is like an app for the phone. And just like a phone app, a library is something you need to install only once, but load each time you want to use it.

### Installing a library

Here we’ll install two libraries that we’ll use a lot in the lessons to follow: tidyverse and mosaic.

The first minute of this video gives a walk-through of how to install a library. (It’s with an older version of RStudio, though the process is still the same). But we’ll explain the steps here, too. Conveniently, libraries, also called “packages,” are installed from within RStudio itself.

Here are the steps to install tidyverse. The same process works for any library:

1. In the lower right panel of RStudio, you’ll see a tab called Packages. Click on it.
2. Under the Packages tab, you’ll see a button at the top left of the panel called Install. Click on it.
3. In the window that pops up, type in the name of the package you want to install: tidyverse. After a few letters, RStudio will start to auto-suggest options for you. Just keep typing until you see tidyverse as the only option. Then either click on it or hit Tab to auto-complete the full name of the library.
4. Click the Install button.

In response, R should print out a very long and seemingly ponderous “progress report” into the console. This “progress report” provides all kinds of interesting detail for R super-users, but it isn’t all that helpful if you’re a beginner—and because there’s just so much darn red text, it can even seem intimidating! But not to worry. R is just telling you, in its own long-winded way, that it’s downloading and installing a bunch of tidyverse-related files behind the scenes.7 Eventually it should stop with some kind of DONE message in the console and give you another prompt (>), at which point the library is installed and ready to be used.

Then repeat the whole process to install mosaic. If you got an error, see Dealing with installation errors, below.

Let’s practice loading the tidyverse library, which we’ll lean on heavily in most of these lessons. In R, you load a library using the library command, like this:

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::lag()    masks stats::lag()

If you see a similar set of messages to what’s shown above (possibly with some more rows called “Conflicts”), you’re good to go! Feel free to move on to the next lesson.

But if you haven’t installed the tidyverse library, executing this command will give you an error like this:

Error in library(tidyverse) : there is no package called ‘tidyverse’

To avoid the error, you’ll first need to install tidyverse like we covered above.

### Dealing with installation errors

R might issue a warning or tell you that it had a conflict when you installed these libraries. Despite sounding scary, these notices are almost certainly innocuous and can be safely ignored.

On the other hand, if R tells you that it had an error when installing, you’ll need to address that error before you move on. Luckily, most library installation errors are a result of having an old version of R, and therefore fixable with two simple steps:

Usually, it’s that simple.

If that doesn’t fix the problem, however, your next step is to try Googling the error message. (If you’re taking my class and show me the installation error message, that is precisely what I will do, unless it happens to be some particular error I’ve seen before.) Installation errors like this are rare, but they can happen, and they’re almost always due to some quirk of your particular computer (and therefore difficult to generalize about). The good news is whatever error you’re experiencing is almost surely not unprecedented in the history of R. In fact, the chances are good that someone, somewhere, has figured out what the error means and how to fix it.

In my large classes at UT-Austin, the most common (but still quite rare) library installation error I’ve seen tends to occur on Windows machines, and it looks something like the following:

The downloaded source packages are in
Updating HTML index of packages in '.Library'
Warning in install.packages :
cannot create file
'/opt/POC/lib64/Revo-7.3/R-3.1.1/lib64/R/doc/html/packages.html', reason
'Permission denied'

Yikes! Basically, what’s happening here is that you don’t have permission to write files to the directory where R wants to write files. (Hence 'Permission denied'!) This can happen if your computer is actually owned by someone else, e.g. your employer, but I’ve also seen it happen out of the blue to students who own their computers fair and square.

Fixing the error depends on which version of Windows you have, and so your best bet is to Google something like “Give permissions to files and folders in Windows” and follow whatever steps suggested by the collective wisdom of humanity.

1. If you need to check what kind of chip your Mac has, choose “About this Mac” from the Apple menu in the upper left corner of your screen. You’ll see an entry for “Processor” that will tell you whether you’ve got an Intel chip or an Apple M1 chip inside your machine.↩︎

2. This car analogy isn’t quite right. Strictly speaking, RStudio is optional, and you could opt to use base R (i.e. only the “engine”). But RStudio is both free and great. I highly recommend using it; it’s really helpful for beginners, but it also has many wonderful features that cater to advanced users and that you might explore as your R skills mature. From here on out, I’ll assume you’re using RStudio rather than base R. On a practical level, this means that you’ll start your R sessions by double-clicking the RStudio icon on your computer, rather than the R icon.↩︎

3. You can probably guess what this statement does, but to be super explicit: it takes the base-10 logarithm of 1000, to that adds the quantity $$4^2$$ (i.e. $$4 \cdot 4$$), and from that subtracts 15.85841. You might recognize the result.↩︎

4. Objects are sometimes called variables in computer programming. But we’ll use the term object to avoid confusion with the statistical concept of a variable.↩︎

5. You could just copy-paste, but I recommend actually typing it, to build “muscle memory” for what these commands are doing.↩︎

6. Rather than, for example, hidden in a formula associated with some specific cell of a spreadsheet.↩︎

7. tidyverse is actually a suite of individual libraries, kind of like Microsoft Office is a suite of individual programs. So it make take a minute or two to install.↩︎