Chapter 3 R Essentials

This chapter offers a brief—and very basic—introduction to R. This material is covered more thoroughly elsewhere. See, for example:

3.1 Introduction to R

As noted earlier, R is an object-oriented programming language.

Use the assignment operator, “<-”, to assign a value to an object. (The equals sign, “=”, works for assignment also.) However, commands can be run directly without storing results in an object. For example, type 5 + 5 in the console.1 The most basic use of R is as a calculator.

5 + 5
## [1] 10

Let’s assign the results of that calculation to an object, x:

(x <- 5 + 5) # An example of assignment
## [1] 10

Here we have assigned the resulting sum, 10, to the object x. (We could have called this object anything, as long as the name consisted in a character string.) Notice that x, along with the value assigned to it, now shows up under Values in the environment tab of the Environment pane. The parentheses tell R not just to do the calculation, and store it in x, but also to print the result to the console.

The above code snippet also demonstrates how to add comments to a script using “#.” R will ignore any text following a hashtag. Comments are a good way to communicate details about your code to collaborators, or your future self, when coming back to a script after a break. As code gets more complicated, and analyses more involved, comments become more important.

Once we have assigned a value to x, we can use it to perform further operations, the results of which can then also be stored in an object. For example:

new_object <- x + x
new_object
## [1] 20

The power of object-oriented programming consists in this ability to store values in objects for additional manipulation and computation.

3.2 Getting data into R

Getting data into R can sometimes be a challenge! The easiest option is to acquire a dataset that is included in base R, such as mtcars, using the data() function:2

data(mtcars)

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The head() function displays the top 6 rows (the default) of a table for quick inspection. You can adjust the number of rows shown by including a value for the n argument to the head() function, like this: head(mtcars, n = 10). head() will then display 10 rows.3

You can also import data from your working drive using the read.csv() function, or, perhaps easier, use RStudio’s “import dataset” button located under the environment tab of the environment pane to browse for data on your computer. There are, in addition, other functions to import a great variety of data types, such as read.xlsx(), for MS Excel files, and read.table() for text files. The foreign package has functionality to import other, more exotic file types.

You can also import data directly from the web. Here, for example, is code to retrieve the “abalone” dataset from UC Irvine’s machine learning data repository. We will assign this dataset to an object, which, for brevity, we’ll denote “a”:

a <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = F)

head(a) # Note: no column headers
##   V1    V2    V3    V4     V5     V6     V7    V8 V9
## 1  M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
## 2  M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070  7
## 3  F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210  9
## 4  M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
## 5  I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055  7
## 6  I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120  8

Here is info on the “abalone” dataset, which includes the variable names: abalone. Based on this information we can assign names to the columns in this table:

names(a) <- c("Sex", "Length"   ,"Diameter", "Height"   ,"Whole_weight",
              "Shucked_weight", "Viscera_weight", "Shell_weight","Rings")

names(a)
## [1] "Sex"            "Length"         "Diameter"       "Height"        
## [5] "Whole_weight"   "Shucked_weight" "Viscera_weight" "Shell_weight"  
## [9] "Rings"

The c() function (the “c” stands for “concatenate”) puts items together into a vector. names(a) is a character vector containing the column names of a table. Here we are assigning to names(a) a new vector of character variables, thereby renaming the columns. (Note that the command colnames(a) is identical to names(a).)

3.3 R Data types

These are the main data types in R:

  • numeric (or integer)
  • character
  • factor
  • logical

We can check to see which data types we have in a by using the str() command (“str” stands for “structure”). str() returns the dimensions of dataset, the underlying data types, and the first 10 observations of each variable.

str(a)
## 'data.frame':    4177 obs. of  9 variables:
##  $ Sex           : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ Length        : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ Diameter      : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ Height        : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ Whole_weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ Shucked_weight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ Viscera_weight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ Shell_weight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ Rings         : int  15 7 9 10 7 8 20 16 9 19 ...

We can see that the a dataset has 4177 rows, 9 columns, and that all the variables are numeric or integer except for Sex which is a factor variable. Factors are character variables that have an assigned order. If we had wanted Sex to be imported as a character variable, we could have added an additional argument to the read.csv() function after header = F: stringsAsFactors = F.

R’s default action is to encode the levels of a factor based on alphabetic order. Thus, in the above output from str(a) we can see that the 3 levels of Sex are alphabetic: “F”,“I”, and “M” (standing for “female,” “indeterminate,” and “male,” in that order). Factors are extremely important in R, especially when we get to regression modeling, where the order of the factor levels determines which comparisons the lm() function will automatically report.

Note that the examples of Sex in the above call to str() are not “F,” “I” or “M” but rather 1, 2 or 3. This is because R stores factor variables in the background as numeric, in order to capture the order of the levels.

We can query and change the order of the factor levels as follows:

levels(a$Sex)
## [1] "F" "I" "M"
levels(a$Sex) <- c("M", "I", "F")
levels(a$Sex)
## [1] "M" "I" "F"

There is also a logical data type in R, consisting of TRUE or FALSE, which can be abbreviated T or F.

logical_vector <- c(T, F, F, T, T)
str(logical_vector)
##  logi [1:5] TRUE FALSE FALSE TRUE TRUE
sum(logical_vector) # The sum() function counts the Ts!
## [1] 3

It is often convenient to define a logical condition for a vector of values, transforming them into logical values, T or F. We define a logical vector using logical operators such as:

  • “==” (equal)
  • “!=” (not equal)
  • “<” (less than)
  • “<=” (less than or equal to)
  • “>” (greater than)
  • “>=” (greater than or equal to)
  • is.na()
  • !is.na()
  • “|” (pipe operator, or)
  • “&” (and)
  • isTRUE()
  • !isTRUE()

Below we define a logical vector for a$Ring based on whether the number of rings is greater than 10. We can then use sum() on the resulting vector to find out how many abalone in the dataset have more than 10 rings. This works because R counts the logical values TRUE and FALSE as 1 and 0, respectively. Using sum() and length() on our defined vector we can also calculate the proportion of a$Rings greater than 10. (The length() function returns the length of vector, or number of items, and is not to be confused with the nrow() function which counts the number of rows in a table.)

many_rings  <- a$Rings > 10
head(many_rings)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE
sum(many_rings)
## [1] 1447
sum(many_rings)/length(many_rings) 
## [1] 0.3464209

So, 34.6420876% of the abalone in this dataset have more than 10 rings. If we wanted less precision than than R’s default 7 decimal places (which will usually be the case when reporting results) then we could use the round() function, the second argument of which is the number of desired decimal places:

round(sum(many_rings)/length(many_rings),2)
## [1] 0.35

Note that you never want to round values that will be used in subsequent operations! In that case, rounding error would compound through all of your calculations.

3.4 Working with R Data Structures

The following are the main data structures in R:

  • Vector
  • Dataframe
  • Matrix
  • List
  • Array

I discuss vectors and dataframes in some detail below, and will introduce other data structures elsewhere as necessary.

3.4.1 Dataframes and vectors

The workhorse data structure in R is the dataframe, which is essentially analogous to a spreadsheet, with rows and columns. In a dataframe the columns are necessarily vectors.

As noted above, vectors are unidimensional collections of items, with the following caveat: the items in a vector must always be of the same type—logical, numeric, character, etc. A dataframe, by contrast, is a multidimensional structure that will accommodate vectors of different data types as columns.

The str() function is, as we’ve seen, a handy way to query either vectors or dataframes to get information about their size or dimension as well as the constituent data types. Here are other commands useful for working with dataframes:

  • head() returns the first six rows by default. (Specify a value for the n argument to adjust the default: head(a, n = 20) will return 20 rows.)
  • dim() queries the dimension of a dataframe and returns a vector with the number of observations and number of variables.
  • nrow() returns the number of rows.
  • ncol() returns the number of columns.
dim(a) 
## [1] 4177    9
nrow(a)
## [1] 4177
ncol(a)
## [1] 9

3.4.2 Indexing and subsetting

A essential skill in working with data frames and vectors is indexing and subsetting. Here are some base R methods.

Square bracket notation allows you to index a position in a vector or data frame and return the observations at that position. If you are working with a vector, then use a single number, or set of numbers, in square brackets, to denote the vector position(s) for which you want observations. If you are working with a data frame then you need to specify in square brackets a row position and a column position, separated by a comma, to denote the row-column indexes for which you want observations.

# Index a vector
logical_vector[1:2]
## [1]  TRUE FALSE
# Index a dataframe
a[1:2, 1] # Returns the first two rows in the first column
## [1] F F
## Levels: M I F
a[c(1,2), 1] # Same result
## [1] F F
## Levels: M I F

Dollar sign notation (“$”) picks out a column vector from a dataframe, which can then be indexed with square brackets.

a$Sex[1:2] 
## [1] F F
## Levels: M I F

subset() in base R allows you to filter a dataframe according to logical criteria.

head(subset(a, Sex == "I" & Diameter > .3))
##     Sex Length Diameter Height Whole_weight Shucked_weight Viscera_weight
## 51    I  0.520    0.410  0.120       0.5950         0.2385         0.1110
## 113   I  0.435    0.320  0.080       0.3325         0.1485         0.0635
## 208   I  0.435    0.340  0.110       0.3795         0.1495         0.0850
## 225   I  0.425    0.380  0.105       0.3265         0.1285         0.0785
## 235   I  0.440    0.350  0.135       0.4350         0.1815         0.0830
## 316   I  0.450    0.355  0.110       0.4585         0.1940         0.0670
##     Shell_weight Rings
## 51         0.190     8
## 113        0.105     9
## 208        0.120     8
## 225        0.100    10
## 235        0.125    12
## 316        0.140     8

Here we have used head() to select just the first six rows of of a new dataset consisting of rows where Sex equals “I” and Diameter is greater than .3. We can perform this same operation with square brackets:

a[a$Sex == "I" & a$Diameter > .3, ][1:6, ]
##     Sex Length Diameter Height Whole_weight Shucked_weight Viscera_weight
## 51    I  0.520    0.410  0.120       0.5950         0.2385         0.1110
## 113   I  0.435    0.320  0.080       0.3325         0.1485         0.0635
## 208   I  0.435    0.340  0.110       0.3795         0.1495         0.0850
## 225   I  0.425    0.380  0.105       0.3265         0.1285         0.0785
## 235   I  0.440    0.350  0.135       0.4350         0.1815         0.0830
## 316   I  0.450    0.355  0.110       0.4585         0.1940         0.0670
##     Shell_weight Rings
## 51         0.190     8
## 113        0.105     9
## 208        0.120     8
## 225        0.100    10
## 235        0.125    12
## 316        0.140     8

3.5 Basic programming in R

3.5.1 Functions

When we use + or log() or ^2 we are using functions that have been programmed into base R. One of the great things about R is the ease with which we can write our own functions to simplify repeated custom calculations or operations. Here is an example for how to write a function that returns the cube of a number. The function will take one argument, x, as follows:

cube <- function(x){
  x * x * x
}

cube # Returns the function definition
## function(x){
##   x * x * x
## }
## <environment: 0x11001e318>
cube(2) # Returns the value of the function evaluated at x = 2 
## [1] 8
cube(3) # Returns the value of the function evaluated at x = 3 
## [1] 27

3.5.2 Control structures: loops

Another way of handling repeated operations is to use a loop. There are different types of loops; here we introduce one of the simplest, the for loop, which repeats an operation for a specified number of times. The for loop includes two arguments: the counter (often denoted “i” for iteration) and the length of the loop, or number of iterations. To use a for loop we first initialize a vector to store the values produced at each iteration. Here we calculate cubes for the numbers 1 - 10 using the cube() function we just created. The counter, i, will advance by 1 with each loop.

results <- numeric(10)
for(i in 1:10){
  results[i] <- cube(i)
}

results
##  [1]    1    8   27   64  125  216  343  512  729 1000

We have used the counter for each loop, i, to index our results vector, results[i], using it to store the value produced by cube(i). For loop i = 1, the calculation produced by cube(1) is stored in results[1]; likewise, for loop i = 2, we store the value produced by cube(2) in results[2], and so forth. Of course, we could also have performed this same operation in a vectorized fashion:

cube(1:10)
##  [1]    1    8   27   64  125  216  343  512  729 1000

The advantage of the second implementation is not only simplicity but also speed: loops are generally pretty slow ways of doing things in R. Nevertheless, loops are easy to understand and will be especially useful control structures when we get to bootstrapping.

3.5.3 Control structures: if-else

Some operations that could be handled by loops have been vectorized in R for speed and convenience. Examples include if() and ifelse(), which are are extremely helpful functions for performing a repeated operation on a vector, such as recoding a variable. Suppose we would like to create a binary variable that splits a$Length at the mean, coding all values above the mean as 1 and all values below or equal to the mean as 0. We’ll call the new variable Length_bin:

a$Length_bin <- ifelse(a$Length > mean(a$Length), 1, 0)

The first argument to ifelse() is the condition, the second is the stipulated value when the condition = T, and the third is the stipulated value when the condition = F. Thus, the above code says: when a$Length is greater than mean(a$Length) then make a$Length_bin 1 for that observation; but if a$Length is not greater than mean(a$Length) then make a$Length_bin 0. The result is a vector of 0s and 1s.

mean(a$Length)
## [1] 0.5239921
a$Length_bin[1:10]
##  [1] 0 0 1 0 0 0 1 1 0 1
a$Length[1:10]
##  [1] 0.455 0.350 0.530 0.440 0.330 0.425 0.530 0.545 0.475 0.550

3.6 Summarizing data

3.6.1 summary()

Base R contains many useful functions for summarizing data. The summary() function will automatically summarize the distributions of numeric variables in a dataframe, and will return tables of character or factor variables (along with NAs, where applicable).

summary(a)
##  Sex          Length         Diameter          Height      
##  M:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000  
##  I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150  
##  F:1528   Median :0.545   Median :0.4250   Median :0.1400  
##           Mean   :0.524   Mean   :0.4079   Mean   :0.1395  
##           3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650  
##           Max.   :0.815   Max.   :0.6500   Max.   :1.1300  
##   Whole_weight    Shucked_weight   Viscera_weight    Shell_weight   
##  Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015  
##  1st Qu.:0.4415   1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300  
##  Median :0.7995   Median :0.3360   Median :0.1710   Median :0.2340  
##  Mean   :0.8287   Mean   :0.3594   Mean   :0.1806   Mean   :0.2388  
##  3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290  
##  Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050  
##      Rings          Length_bin    
##  Min.   : 1.000   Min.   :0.0000  
##  1st Qu.: 8.000   1st Qu.:0.0000  
##  Median : 9.000   Median :1.0000  
##  Mean   : 9.934   Mean   :0.5624  
##  3rd Qu.:11.000   3rd Qu.:1.0000  
##  Max.   :29.000   Max.   :1.0000

3.6.2 table()

The table() function can be used directly to get information about the distribution of a$Sex, while prop.table() returns proportions:

table(a$Sex)
## 
##    M    I    F 
## 1307 1342 1528
round(prop.table(table(a$Sex)), 2)
## 
##    M    I    F 
## 0.31 0.32 0.37

3.6.3 Barplots

The barplot() command in base R works with table() to visualize counts, or with prop.table() to visualize proportions:

barplot(table(a$Sex), main = "Counts  by Sex")

barplot(prop.table(table(a$Sex)), main = "Proportions by Sex")

3.6.4 Histograms

Bar plots summarize discrete variables (character of factor); histograms summarize numeric distributions.

hist() chooses a default bin size, counts the observations in that bin, and produces a plot with bars whose heights correspond to the frequency. Histograms are great for visualizing the range and shape of a distribution.

hist(a$Length)

We can see that most abalone, roughly the middle 50% of the distribution, have lengths between about .4 and .7. We can compute quartiles to get this information exactly, either using summary(a) as above, or quantile(a$Length). The default behavior of quantile() is to return quartiles:

quantile(a$Length)
##    0%   25%   50%   75%  100% 
## 0.075 0.450 0.545 0.615 0.815

Remember that quantiles are just percentiles. If we arranged each observation of a$Length from smallest to largest, then 0% would be the minimum and 100% would be the maximum; the first quartile, between .075 and .45, would contain the first 25% of all the observations, while the second quartile would contain the second 25% of all the observations, from .45 to the median, .545; the middle 50% of the distribution would extend from the first quartile, .45, to the third quartile, .615. The middle 50% of the observations can be used to characterize the central tendency of the distribution.

We can adjust the probs argument in the quantile() function to return other percentiles. Here, for example, is how we would get quintiles:

quantile(a$Length, probs = c(0, .2, .4, .6, .8, 1))
##    0%   20%   40%   60%   80%  100% 
## 0.075 0.425 0.510 0.575 0.625 0.815

Let’s include this distributional information on the histogram plot to precisely demarcate the middle 50% of the observations:

hist(a$Length)
abline(v = quantile(a$Length, probs = .25), col="red")
abline(v = quantile(a$Length, probs = .75), col="red")

We can decrease the size of the bins using the breaks argument to get finer resolution on this distribution of a$Length:

hist(a$Length, breaks = 100)
abline(v = quantile(a$Length, probs = .25), col="red")
abline(v = quantile(a$Length, probs = .75), col="red")

If we want to see what the hist() function is doing in the background we can query it using the “?” command: ?hist. The documentation for histogram indicates that hist() creates an object, consisting in a list, that is used by the function to create the plot. Use str() to look at the components of the list.

str(hist(a$Length, plot = F))
## List of 6
##  $ breaks  : num [1:17] 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 ...
##  $ counts  : int [1:16] 1 7 41 57 122 180 287 372 524 637 ...
##  $ density : num [1:16] 0.00479 0.03352 0.19631 0.27292 0.58415 ...
##  $ mids    : num [1:16] 0.075 0.125 0.175 0.225 0.275 0.325 0.375 0.425 0.475 0.525 ...
##  $ xname   : chr "a$Length"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"

To subset a list we use double square brackets. In this case, hist(a$Length, plot=F)[[1]] returns the first list item, breaks, which indicates the interval defining each bin.

hist(a$Length, plot=F)[[1]]
##  [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
## [15] 0.75 0.80 0.85

Single square brackets pick out elements from within list items. Thus, hist(a$Length, plot=F)[[1]][1:2] returns the first two elements of the first list item:

hist(a$Length, plot=F)[[1]][1:2]
## [1] 0.05 0.10

From the documentation for hist() we learn: “If right = TRUE (default), the histogram cells are intervals of the form (a, b], i.e., they include their right-hand endpoint, but not their left one, with the exception of the first cell when include.lowest is TRUE.” include.lowest = T is the default. Thus, the first bin contains all a$Length observations greater than or equal to 0.05 and less than or equal to 0.1.

3.6.5 Density plots

a$Length is a continuous variable, so we should really use a density plot.

plot(density(a$Length))

Density plots are basically smoothed histograms, but the density curve does not represent frequencies; instead, the height of the curve represents the probability density function, which can be used to estimate probabilities. For example, the max of the density plot is approximately 3.5 at around x = .6. If we take a small interval around .6—say, .59 to .61—then the height multiplied by the density yields the area under the curve in that region, which is equivalent to the probability for observations occurring in that interval: 3.5 x .02 = 0.07. An interval of the same length in a lower frequency region of the distribution will have a much lower associated probability. For example, the density at x = .4 is about 1.5, so the interval between .39 and .41 would be roughly equivalent to a probability of 1.5 x .02 = 0.03 The area under the density curve is a probability and thus integrates to 1. Read about the details of the algorithm that creates the default density object with?density.

3.6.6 Boxplots

Boxplots are another handy way to summarize, and compare, distributions.

boxplot(a$Length, main="Boxplot of Length", xlab="Length", ylab="Values")

The box in a boxplot represents the interquartile range, or IQR, which contains the observations between the first and third quartiles— the middle 50% of the observations in a distribution.4 The box thus represents the central tendency of the distribution, with its height (or width, depending on the orientation of the box) indicating the spread of the data. The black center line is the median; the edges of the box are known as the “hinges.” The lines extending beyond the hinges are the “whiskers”; they indicate observations that are more extreme than the IQR—less than the first quartile and greater than the third quartile. The whiskers extend 1.5 times the IQR beyond the hinges. If there are no observations that extreme then the whisker extends only as far as the minimum or the maximum. Observations beyond the whiskers—potential outliers—are represented as points. In the above boxplot we can see that there are a number of points smaller than the lower whisker.

As with hist() a call to boxplot() creates an object that we can inspect using str():

str(boxplot(a$Length, plot=F))
## List of 6
##  $ stats: num [1:5, 1] 0.205 0.45 0.545 0.615 0.815
##  $ n    : num 4177
##  $ conf : num [1:2, 1] 0.541 0.549
##  $ out  : num [1:49] 0.175 0.17 0.075 0.13 0.11 0.16 0.2 0.165 0.19 0.175 ...
##  $ group: num [1:49] 1 1 1 1 1 1 1 1 1 1 ...
##  $ names: chr "1"

boxplot(a$Length, plot=F)[[1]] in this case returns a five number summary of the distribution: the minimum or the bottom of the lower whisker, the max or the top of the upper whisker, the first and third quartile (the hinges), and the median (the center line).

Boxplots really shine when it comes to comparisons. Here is a boxplot with Sex on the x-axis and Length on the y.

boxplot(a$Length ~ a$Sex, main="Boxplot of Length by Sex", 
        xlab="Sex", ylab="Length")

Here we can see that Sex == I accounts for the small lengths that we saw in the first boxplot. While male and female abalone have similar lengths, indeterminates are consistently smaller; the top hinge for Sex == I is below the lower hinge of the other two boxes.

3.6.7 More density plots

Overlain density plots tell a similar story: one of these distributions is not like the other.

plot(density(subset(a, Sex=="M")$Length), xlim = c(0,1), main =
       "Density plot of Length by Sex: \n I  (green), M (black), F (green)")
lines(density(subset(a, Sex=="I")$Length), col=2)
lines(density(subset(a, Sex=="F")$Length), col=3)

This plot really deserves a legend. Unfortunately, legends are a pain to create in base R—much easier in the ggplot2 package. To create a legend in base R we have to keep track of colors. Black is represented by 1, red by 2 and green by 3. Here is the same plot with a legend.

plot(density(subset(a, Sex=="M")$Length), xlim = c(0,1), main =
       "Density plot of Length by Sex")
lines(density(subset(a, Sex=="I")$Length), col=2)
lines(density(subset(a, Sex=="F")$Length), col=3)
legend("topright",title="Sex", fill=1:3,
       legend =c("M", "I", "F"))

We can also put density plots side by side:

par(mfrow = c(1, 3))
plot(density(subset(a, Sex == "M")$Length), main = "Length for M")
plot(density(subset(a, Sex == "I")$Length), col = 2, main = "Length for I")
plot(density(subset(a, Sex == "F")$Length), col = 3, main = "Length for F")

par(mfrow = c(1, 1))

We have used the par(mfrow = c(1, 3)) command to tell R how many plots per row we want. Here we specified 3 columns in 1 row; par(mfrow = c(3, 1)) would have arranged 3 rows of plots all in 1 column. It is good practice afterwards to reset this parameter to the default: par(mfrow = c(1, 1)).

Clearly, it is easier to compare distributions when they are all on the same plot (or if they were arranged one above the other rather than side-by-side). Notice also that there are problems here with y-axis scales (they should all have the same scale). These again are formatting challenges handled automatically in ggplot2, so rather than laboring over these base R plots we will return to these issues in the next chapter when we introduce ggplot2.

3.6.8 Scatterplots

Scatterplots are great for showing bivariate relationships between continuous variables. (If one of your variables is a factor, then use a boxplot.) We might expect to find that abalone length and weight are correlated. Here is a scatterplot of length against weight with an ordinary least squares (OLS) line added to summarize the linear relationship:

plot(a$Whole_weight~a$Length, 
     main="Relationship between Abalone Length and Weight")
abline(lm(a$Whole_weight~a$Length), col="red")

Notice that we have used the lm() function—the main R function for fitting linear models—to add the OLS line. R knows how to extract the intercept and slope from the model object created by lm(). Clearly, however, this is a lousy linear model; the data displays an upward curve that might be better captured with a quadratic term in the model.

A plot showing a nonlinear fit can certainly be produced in base R graphics but is not entirely straightforward: this, task, too, we leave for the next chapter with ggplot2.
One thing we can accomplish fairly easily here is to color the points according to sex. Given the sex differences we’ve observed so far, it might well be the case that they explain this nonlinear relationship. It is, at any rate, worth a look. The following plot colors points by sex, which allows us to visualize how a bivariate relationship—the one between length and weight—varies by sex. Later we will describe this variation across levels of a factor as an interaction. Here length and sex interact to explain weight.

library(scales)
plot(a$Whole_weight~a$Length, 
     main="Relationship between Abalone Length and Weight \n varying by Sex", 
     col = alpha(as.numeric(a$Sex), 0.5), pch=19)

I have added transparency to the points (making the points behind somewhat visible) so we can better see which regions of the distribution are dominated by the different groups.

One possibility for what is going on here—that could be explored further—is that the indeterminate abalone are smaller than males and females, in terms of both length and weight, and that what looks like a nonlinear relationship between length and weight is actually just a linear relationship that varies by group. Plotting these separate fits to investigate this idea is, again, sort of a pain in base R.

We will revisit this topic—displaying linear fits to subsets of the data in the same plot—in the next chapter using ggplot2.

3.6.9 Scatterplot matrix

R has many functions for producing multi-panel plots that display bivariate relationships among all the variables in a data set. Here is one implementation.

pairs(a[,-1])


  1. To follow this tutorial, type (or copy and paste) the code examples into your RStudio console and press return. You may, alternatively, open an R script (a .R document) by clicking on the File menu option. Select New File > R script. Type (or copy and paste) the code examples into your practice script, then highlight them and press control-return to run them.

  2. mtcars contains information on the design and performance characteristics of older model cars (1973-4).

  3. If you need information about an R function, you may type “?” in the console followed by the function: for example, ?head. In RStudio this will automatically bring up the documentation for that function under the “Help” tab of the plot pane. The documentation contains comprehensive information about the arguments to the function, details on usage, the values returned, along with examples. In the case of head() there are two non-optional arguments: x (data) and n (size), which has a default value of 6. If an argument has a default setting, and you leave it blank, R will use automatically use that default. R functions recognize arguments either by name or position. Thus, because the head() expects first a data argument then a size argument, these two uses of head() are identical: head(mtcars, n = 10) and head(mtcars, 10).

  4. The hinges are actually “versions” of the first and third quartiles. See the documentation for boxplot.stats() for details.