# 1 Data in R

In this chapter, we give some information about how to use R and common data types. We begin by showing how to do basic arithmetic, assign values to variables.

## 1.1 Arithmetic and Variable Assignment

R can be used as a calculator. Here R a bunch of examples:

`1+1`

`## [1] 2`

`3*7`

`## [1] 21`

`7/3`

`## [1] 2.333333`

`exp(2)`

`## [1] 7.389056`

`2^3`

`## [1] 8`

`log(7)`

`## [1] 1.94591`

`log(7, base = 10)`

`## [1] 0.845098`

`sqrt(2)`

`## [1] 1.414214`

`2^(1/2)`

`## [1] 1.414214`

`pi/2`

`## [1] 1.570796`

A couple of things to note are that `pi`

is a variable that contains the value \(\pi\), and that the default base of `log`

is the **natural** logarithm.

You can’t get very far without storing results of your computations to variables! The way to do so is with the `<-`

command, as shown below.

```
height <- 62 #in inches
height <- height + 2
height <- 3*height
```

Note that `Alt`

+ `-`

is the keyboard shortcut for `<-`

when working on the command line. (That means that `<-`

is one keystroke less than `=`

!)

If you want to see what value is stored in a variable, you can

- type the variable name

`height`

`## [1] 192`

look in the environment box in the upper right hand corner of RStudio.

Use the str command. This command gives other useful information about the variable, in addition to its value.

`str(height)`

`## num 192`

This says that height contains *num*eric data, and its current value is 192 (which is 3(62 + 2)). Note that there is a big difference between typing `height + 2`

(which computes the value of `height + 2`

and displays it on the screen) and typing `height <- height + 2`

, which computes the value of `height + 2`

*and stores the new value back in height*.

It is important to choose your variable names wisely. Variables in R *cannot* start with a number, and they *should* not start with a period. Do not use c, q, t, C, D, F, I, T as variable names, as they are already defined. It is also a *terrible* idea (and one of the most frustrating things to debug) to use `sum`

, `mean`

, or other commonly used functions as variable names. `T`

and `F`

are variables with default values `TRUE`

and `FALSE`

, which can be changed. I recommend writing out `TRUE`

and `FALSE`

rather than using the shortcuts `T`

and `F`

for this reason.

If you have a longish variable name in your environment, then you can use the **tab** key to autocomplete. Many variable names that you run across will contain periods to indicate some hierarchical structure.

## 1.2 Vectors

Data often takes the form of lists of values, rather than single values. We need to be able to store lists of values in order to be able to work with them. For this, we use *vectors*. A vector is a list of values (usually) of length bigger than one. (Formally, a `list`

is a different data type. Here, I just mean a finite sequence of values of the same type.)

There are many ways to create vectors. Perhaps the easiest is:

`c(2,3,5,7,11)`

`## [1] 2 3 5 7 11`

This vector is the list of the first 5 prime numbers. We can store vectors in variables just like we did with numbers:

`primes <- c(2,3,5,7,11)`

Another way is with the `scan()`

command, which is interactive. Try `primes <- scan()`

and type in a few primes. When you want to be done, just hit `return`

one more time. Using `scan()`

is fine when doing some work on your own, but it is not *reproducible*, in the sense that you can’t re-run your code and re-create the same output.

You can create a vector of numbers in order using the `:`

operator:

`1:10`

`## [1] 1 2 3 4 5 6 7 8 9 10`

The `rep( )`

function will create a vector of repeated values of a given times: `rep(x, times)`

. For example,

`rep(2,3)`

`## [1] 2 2 2`

`rep(1:3,4)`

`## [1] 1 2 3 1 2 3 1 2 3 1 2 3`

`rep(c(2,4),5)`

`## [1] 2 4 2 4 2 4 2 4 2 4`

Most of the operations in R work well with vectors. Suppose you wanted to see what the square roots of the first 5 primes were. You might guess:

`primes^(1/2)`

`## [1] 1.414214 1.732051 2.236068 2.645751 3.316625`

and you would be right! Guess what would happen when you type `primes + primes`

, `primes * primes`

and `primes & primes`

. Were you right?

## 1.3 Indexing Vectors

To examine or use a single element in a vector, you need to supply its *index*. primes[1] is the first element in the list of primes, primes[2] is the second, and so on.

`primes[1]`

`## [1] 2`

`primes[2]`

`## [1] 3`

You can do many things with indexes:

`primes[1:3]`

`## [1] 2 3 5`

`primes[-1]`

`## [1] 3 5 7 11`

`primes[c(1,5)]`

`## [1] 2 11`

`primes[c(T,F,T,F,T)]`

`## [1] 2 5 11`

`primes > 6`

`## [1] FALSE FALSE FALSE TRUE TRUE`

`primes[primes > 6]`

`## [1] 7 11`

R comes with many built-in data sets. For example, the `discoveries`

dataset is a vector containing the number of “great” inventions and scientific discoveries in each year from 1860 to 1959. Try `?discoveries`

to see more information about the `discoveries`

dataset.

`discoveries`

```
## Time Series:
## Start = 1860
## End = 1959
## Frequency = 1
## [1] 5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4 4 0 2
## [24] 3 7 12 3 10 9 2 3 7 7 2 3 3 6 2 4 3 5 2 2 4 0 4
## [47] 2 5 2 3 3 6 5 8 3 6 6 0 5 2 2 2 6 3 4 4 2 2 4
## [70] 7 5 3 3 0 2 2 2 1 3 4 2 2 1 1 1 2 1 4 4 3 2 1
## [93] 4 1 1 1 0 0 2 0
```

For larger data sets, this book will often display only the first few cases, using `head`

:

`head(discoveries)`

`## [1] 5 3 0 2 0 3`

Here are a few more things you can do with a vector:

`table(discoveries)`

```
## discoveries
## 0 1 2 3 4 5 6 7 8 9 10 12
## 9 12 26 20 12 7 6 4 1 1 1 1
```

`max(discoveries)`

`## [1] 12`

`sum(discoveries)`

`## [1] 310`

`discoveries[discoveries > 5]`

`## [1] 6 7 12 10 9 7 7 6 6 8 6 6 6 7`

`which(discoveries > 5) + 1859`

`## [1] 1868 1884 1885 1887 1888 1891 1892 1896 1911 1913 1915 1916 1922 1929`

## 1.4 Data Types

There are several types of data that R understands. Data can be stored as `numeric`

, `integer`

, `character`

, `factor`

and `logical`

data types, among others. There are *lots* of issues and special cases that come up when dealing with these data types, and we do not plan to go over all of them here.

`numeric`

data is numerical data, including all real numbers. If you type `x <- 2`

, then `x`

will be stored as `numeric`

data. (You can test this by typing `str(x)`

.)

`integer`

data is data that is integers! If you type `x <- 2L`

, then `x`

will be stored as an integer. (Why L? I don’t know.) When reading data in from files, R will store data that is all integer as an integer, unlike when you enter data in like `x <- 2`

. Again, `str()`

is your best friend here.

`character`

data is what many languages call strings. It is a vector of characters. If you type `x <- "hello"`

, then `x`

is a `character`

variable. Compare `str("hello")`

to `str(c(1,2))`

. Note that if you want to access the `e`

from `hello`

, you **cannot** use `x[2]`

. If you find yourself in the situation where you need to manipulate strings, I recommend using the `stringr`

package. We will not be doing much with strings in this book.

`logical`

data is `TRUE`

and `FALSE`

.

`factor`

data is common in statistics, but maybe not so commonly implemented and used in other programming languages. `factor`

data can take on values in a predefined set: the variable `sex`

could be set up to allow only entries of `Male`

or `Female`

, for example. Depending on the reason for collecting sex data, you might allow `sex`

to have more values.

`NA`

isn’t a data type, but a value that can take on *any* data type. It stands for Not Available, and it means that there is no data collected for that value. This is most useful if you think of a big set of data with lots of variables. Maybe you are collecting height, weight, sex and blood pressure. You aren’t able to get all of the data from all of the people, so you record what data you can get, and put `NA`

in the other columns. Missing data is a problem that comes up frequently; see the subsection *Missing Values* below for a brief introduction on how to deal with missing values.

My experience has been that students underestimate the importance of knowing what type of data they are working with. R works really well when the data types are assigned properly. However, some bizarre things can occur when you try to force R to do something with a data type that is different than what you think it is! My strong suggestion is, whenever you examine a new data set (especially one that you read in from a file!), your first move is to use `str()`

on it, followed by `head()`

. Make sure that the data is stored the way you want *before* you continue with anything else.

## 1.5 Data Frames

Consider the built-in data set `rivers`

. By typint `?rivers`

, we learn that this data set gives the lengths (in miles) of 141 major rivers in North America, as compiled by the US Geological Survey. This data set is explored further in the exercises in this chapter. By typing `head(rivers)`

, we see that `rivers`

is a **vector** of values that give the length of the rivers.

Now, it would be very useful if the `rivers`

data set also had the *names* of the rivers also stored. That is, for each river, we would like to know **both** the name of the river and the length of the river. This leads us to one of the most common data types in R, *data frames*. A data frame consists of a number of observations of variables. Some examples would be:

- The name and length of major rivers.
- The height, weight and blood pressure of a sample of healthy, adult females.
- The high and low temperature in St Louis, MO, for each day of 2016.

As a specific example, let’s look at the data set `mtcars`

, which is a predefined data set in R.

Start with `str(mtcars)`

. You can see that `mtcars`

consists of 32 observations of 11 variables. The variable names are `mpg, cyl, disp`

and so on. You can also type `?mtcars`

on the console to see information on the data set. Some data sets have more detailed help pages than others, but it is always a good idea to look at the help page.

You can see that the data is from the 1974(!) *Motor Trend* magazine. You might wonder why we use such an old data set. In the R community, there are standard data sets that get used as examples when people create new code. The fact that familiar data sets are usually used lets people focus on the new aspect of the code rather than on the data set itself. In this course, we will do a mix of data sets; some will be up-to-date and hopefully interesting. Others will be so that you begin to familiarize yourself with the common data sets that developeRs use.

There are two ways to access the data in `mtcars`

. You can use $ notation or `[]`

notation. To examine the weights of the car, for example, we could do

`mtcars$wt`

```
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440
## [12] 4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520
## [23] 3.435 3.840 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
```

Or, we could do `mtcars[,"wt"]`

or `mtcars[,6]`

. If we want to see what the third car’s weight is, we could use

`mtcars$wt[3]`

`## [1] 2.32`

Or, we could use `mtcars[3,6]`

. If we want to form a new data frame, call it `smallmtcars`

, that only contains the variables `mpg`

, `cyl`

and `qsec`

, we could use `smallmtcars <- mtcars[,c(1,2,7)]`

. If we want to look at only the first 10 observations, we could use `mtcars[1:10,]`

. We can also select observations of the data that satisfies certain properties. For example, if we want to pull out all observations that get more than 25 miles per gallon, then we could use `mtcars[mtcars$mpg > 25,]`

.

In order to test equality of two values, you use `==`

. For example, in order to see which cars have 2 carburetors, we can use `mtcars[mtcars$carb == 2,]`

. Finally, to combine multiple conditions, you can use the vector logical operators `&`

for *and* and `|`

, for *or*. As an example, to see which cars either have 2 carburetors or 3 forward gears (or both), we would use `mtcars[mtcars$carb == 2 | mtcars$gear == 3,]`

. There are several exercises below which will allow you to practice manipulating data frames. In Chapter @ref(data_manipulation), we will introduce `dplyr`

tools which we will use to do more advanced manipulations, but it is good to be able to do basic things with `[,]`

and `$`

as well.

## 1.6 Reading data from files

Loading data into R is one of the most important things to be able to do. If you can’t get R to load your data, then it doesn’t matter what kinds of neat tricks you could have done. It is also one of the most frustrating things - not just in R, but in general. If you are lucky, then the data that you are trying to load into R is saved as a .csv file. The extension .csv stands for “Comma Separated Values” and means that the data is stored in rows with commas separating the variables. For example, it might look like this:

```
"Gender","Body.Temp","Heart.Rate"
"Male",96.3,70
"Male",96.7,71
"Male",96.9,74
"Female",96.4,69
"Female",96.7,62
```

This would mean that there are three variables: `Gender`

, `Body.Temp`

and `Heart.Rate`

. There are 5 observations; 3 males and 2 females. The first male had a body temperature of 96.3 and a heart rate of 70.

Now, even though you are lucky that your data is in .csv format, if you are a computer novice, then you will still have some frustration getting the file into R. One way to do it is to store the file on your computer. Now, type `getwd()`

in the Console to see what your current working directory is. Make sure that the file you want to read is stored in the current working directory. How do you know? Well, you can click on the `Files`

tab in the lower-right panel in R Studio, and see whether you see the file. If you do not, then it isn’t in the right directory. You will need to move it into the directory specified by `getwd()`

.

Once you have it in the correct directory, you are ready to load the file into R. At this point, you simply type `my_variable <- read.csv("file.name")`

. For example, to load the .csv `normtemp.csv`

, which contains the gender, body temperature and heart rate data mentioned above, I would type `temp_data <- read.csv("normtemp.csv")`

. More advanced users may want to set up a file structure that has data stored in a separate folder, in which case they could either specify the full path or the relative path to the file they want to load. There are also **interactive** ways to load data, but we do not encourage their use in this book, as the results of an analysis from interactive loading of data will *not* be reproducible.

In other instances, the csv that you want to read is available as a file hosted on a web page. In this case, it is usually easier to read the file directly from the web page by using `read.csv("http://website.csv")`

. As an example, there is a csv hosted at http://stat.slu.edu/~speegle/_book_data/stlTempData.csv, which you can load by typing `stlTemp <- read.csv("http://stat.slu.edu/~speegle/_book_data/stlTempData.csv")`

.

I can’t emphasize enough the importance of looking at your data after you have loaded it. Start by using `str()`

and `head()`

on your variable after reading it in. As often as not, there will be something you will need to change in the data frame.

Finally, you can also write R data frames to a csv file, in order to share with other people. If you have a data frame that you wish to store as a .csv file, you use the `write.csv()`

command. If your row names are not meaningful, then often you will want to add `row.names = FALSE`

. The command `write.csv(mtcars, "mtcars_file.csv")`

writes the variable `mtcars`

to the file `mtcars_file.csv`

, which is again stored in the directory specified by `getwd()`

by default.

## 1.7 Missing Values

Many times, there will be missing values in a data set, which is denoted by `NA`

. If your data contains missing values, then `sum`

and `max`

will return the value NA, rather than any meaningful number. Let’s look at an example. These are daily precip, maximum and minimum temperature readings at the St Louis Science Center for the calendar year 2016, as downloaded from noaa.gov.

`stlTempData <- read.csv("http://stat.slu.edu/~speegle/_book_data/stlTempData.csv")`

As always, we want to run `str()`

, `summary`

and `head`

on this new data set!

`str(stlTempData)`

```
## 'data.frame': 349 obs. of 6 variables:
## $ STATION : Factor w/ 1 level "GHCND:USC00237452": 1 1 1 1 1 1 1 1 1 1 ...
## $ STATION_NAME: Factor w/ 1 level "ST LOUIS SCIENCE CTR MO US": 1 1 1 1 1 1 1 1 1 1 ...
## $ DATE : int 20160101 20160102 20160103 20160104 20160105 20160106 20160107 20160108 20160109 20160110 ...
## $ PRCP : num 0 0 0 0 0 0 0 0.21 0.22 0.15 ...
## $ TMAX : int 33 42 49 38 37 40 47 48 53 37 ...
## $ TMIN : int 26 28 29 28 24 28 33 32 37 6 ...
```

`summary(stlTempData)`

```
## STATION STATION_NAME
## GHCND:USC00237452:349 ST LOUIS SCIENCE CTR MO US:349
##
##
##
##
##
##
## DATE PRCP TMAX TMIN
## Min. :20160101 Min. :0.0000 Min. :14.00 Min. : 2.00
## 1st Qu.:20160404 1st Qu.:0.0000 1st Qu.:55.50 1st Qu.:36.50
## Median :20160702 Median :0.0000 Median :73.00 Median :55.00
## Mean :20160669 Mean :0.1304 Mean :69.45 Mean :51.76
## 3rd Qu.:20160930 3rd Qu.:0.0200 3rd Qu.:87.00 3rd Qu.:69.00
## Max. :20161231 Max. :4.3900 Max. :99.00 Max. :82.00
## NA's :1 NA's :18 NA's :18
```

`head(stlTempData)`

```
## STATION STATION_NAME DATE PRCP TMAX TMIN
## 1 GHCND:USC00237452 ST LOUIS SCIENCE CTR MO US 20160101 0 33 26
## 2 GHCND:USC00237452 ST LOUIS SCIENCE CTR MO US 20160102 0 42 28
## 3 GHCND:USC00237452 ST LOUIS SCIENCE CTR MO US 20160103 0 49 29
## 4 GHCND:USC00237452 ST LOUIS SCIENCE CTR MO US 20160104 0 38 28
## 5 GHCND:USC00237452 ST LOUIS SCIENCE CTR MO US 20160105 0 37 24
## 6 GHCND:USC00237452 ST LOUIS SCIENCE CTR MO US 20160106 0 40 28
```

From this, we can see that not all dates are represented, as there are only 349 observations, where there should have been 366. Moreover, we can also see that the max temperature, min temperature and precipitation were not recorded on 18, 18 and 1 day(s) respectively. We can also see that the hottest it got was 99 degrees (huh), and the coldest it got was 2 degrees. The most it rained on any one day was 4.39 inches.

Suppose, however, that we wanted to compute the mean of the maximum temperatures using the `mean`

function, as we did above. We could try

`mean(stlTempData$TMAX)`

`## [1] NA`

But, this gives us the answer `NA`

. If we want R to compute the mean of the values that are there, we need to add the option `na.rm = TRUE`

.

`mean(stlTempData$TMAX, na.rm = TRUE)`

`## [1] 69.45317`

Similarly, if we want to compute the max using `max`

or the standard deviation using `sd`

, we would need to do it as follows:

`max(stlTempData$TMAX, na.rm = TRUE)`

`## [1] 99`

`sd(stlTempData$TMAX, na.rm = TRUE)`

`## [1] 19.89868`

## 1.8 Exercises

- Let
`x <- c(1,2,3)`

and`y <- c(6,5,4)`

. Predict what will happen when the following pieces of code are run. Check your answer.`x * 2`

`x * y`

`x[1] * y[2]`

- Let
`x <- c(1,2,3)`

and`y <- c(6,5,4)`

. What is the value of`x`

after each of the following commands? (Assume that each part starts with the values of`x`

and`y`

given above.)`x + x`

`x <- x + x`

`y <- x + x`

`x <- x + 1`

- Determine the values of the vector
`vec`

after each of the following commands is run.`vec <- 1:10`

`vec <- 1:10 * 2`

`vec <- 1:10^2`

`vec <- 1:10 + 1`

`vec <- 1:(10 * 2)`

Use R to calculate the sum of the squares of all numbers from 1 to 100: \(1^2 + 2^2 + \dotsb + 99^2 + 100^2\)

- Let
`x`

be the vector obtained by running the R command`x <- seq(10, 30, 2)`

.- What is the length of
`x`

? (By length, we mean the number of elements in the vector. This can be obtained using the`str`

function or the`length`

function.) - What is
`x[2]`

? - What is
`x[1:5]`

? - What is
`x[1:3*2]`

? - What is
`x > 25`

? - What is
`x[x > 25]`

? - What is
`x[-1]`

? - What is
`x[-1:-3]`

?

- What is the length of
- Consider the built-in data frame
`airquality`

.- How many observations of how many variables are there?
- What are the names of the variables?
- What type of data is each variable?
- Do you agree with the data type that has been given to each variable? What would have been some alternative choices?

- R has a built-in vector
`rivers`

which contains the lengths of major North American rivers.- Use
`?rivers`

to learn about the data set. - Find the mean and sd of the rivers data.
- Make a histogram (
`hist`

) of the rivers data. - Get the five number summary (
`summary`

) of rivers data. - Find the longest and shortest river in the set.
- Make a list of all (the lengths of the) rivers longer than 1000 miles.

- Use
- There is a built in data set
`state`

, which is really seven separate variables with names such as`state.name`

,`state.region`

, and`state.area`

.- What are the possible regions a state can be in? How many states are in each region?
- Which states have area less than 10000 square miles?
- Which state’s geographic center is furthest south? (Hint: use
`which.min`

)

- Consider the
`mtcars`

data set.- Which cars have 4 forward gears?
- What subset of
`mtcars`

does`mtcars[mtcars$disp > 150 & mtcars$mpg > 20,]`

describe? - Which cars have 4 forward gears and manual transmission? (Note: manual transmission is 1 and automatic is 0.)
- Which cars have 4 forward gears or manual transmission?
- Find the mean mpg of the cars with 2 carburetors.

Complete the Introduction to R datacamp tutorial.

In the text, we loaded the data at http://stat.slu.edu/~speegle/_book_data/stlTempData.csv by reading it directly from the web site. For large files, this can be time-consuming to do every time, and it also requires you to always have an internet connection when you want to use that data. Load the data set contained here by first downloading it onto your machine, putting it in the correct directory, and using

`read.csv`

.- Install the package
`Lahman`

by clicking on`Install`

under the`Packages`

tab. Type in`Lahman`

. (Or, use the command`install.packages("Lahman")`

.) Then, load the library into memory by typing`library(Lahman)`

. Consider the data set`Batting`

, which should now be available. It contains batting statistics of all major league players broken down by season since 1871. We will be using this data set extensively in the data wrangling chapter of this book.- How many observations of how many variables are there?
- Use the command
`head(Batting)`

to get a look at the first six lines of data. - What is the most number of triples (X3B) that have been hit in a single season?
- What is the playerID(s) of the person(s) who hit the most number of triples in a single season? In what year did it happen?
- Which player hit the most number of triples in a single season since 1960?