Chapter 1 Data Visualization

One of the primary jobs of a data analyst is to be able to communicate meaning extracted from data, and maybe the most important way to do this is through a well-crafted visualization of the data. In practice, there are many steps in an analysis project that would precede the visualization stage, but we’ll save those steps for later chapters so we can get to the fun stuff right away.

We’ll need the tidyverse library throughout this chapter.

library(tidyverse)

1.1 Scatter Plots

Both tidyverse and base R come with several built-in data sets. One of our most often-used ones will be mpg. You can load it simply by entering mpg.

mpg
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
##  2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
##  3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
##  4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
##  5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
##  6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
##  7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
## 10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

In the output displayed above, you can see that the data set is a tibble. (This is the name given to data tables in tidyverse. The name is chosen to indicate the difference as a data structure from data tables in base R.) This tibble has 234 rows (the observations) and 11 columns (the variables).

You can only see a portion of the mpg tibble, but you can access the full scrollable data set by entering:

View(mpg)

Though we can see all of the rows and columns now, it’s still not clear what everything means. Any data set you analyze should come with a data dictionary, which contains, at least, a definition of each of the data set’s variables along with the units of measurement. For built-in data sets, we can access the data dictionary as follows:

?mpg

The main purpose of data analysis is to use data to answer questions. For example, in reference to the mpg data set, we could ask, “Do cars with big engines burn more fuel than cars with small engines?”

Like most questions in data analysis, this is somewhat open-ended, and it’s often the job of the analyst to propose a more specific version of the questions. One way to do this would be to look for a relationship between engine size (which is the displ variable) and fuel efficiency (which could be either hwy or cty). There are statistical methods for detecting such relationships which we’ll see in Chapter 5, but for now, we’ll look for a visual relationship by obtaining a scatter plot with displ on the x-axis and hwy on the y-axis.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

Before we use the above plot to propose an answer to our question, we should make a few comments about the code used to generate it.

  • ggplot is a function included in the tidyverse package for generating graphics in R.
  • data = is the main argument to ggplot, and it’s where you specify the data set your visualization is based on.
  • The + sign at the end of the first row indicates that we’re adding a layer to our plot as defined by the next line.
  • The types of visualizations in tidyverse are known as geoms. geom_point is the scatter plot geom. We’ll see other geoms soon.
  • mapping = is the main argument to a geom. It specifies how variables within the data set are mapped to various aesthetic features of the visualization.
  • aes stands for aesthetic, and its arguments are the assignments of variables from the data set to various features (i.e., aesthetics) of the plot. In the above example, displ is mapped to the x aesthetic (which defines what the x-axis represents) and hwy is mapped to the y aesthetic (which defines what the y-axis represents).

The scatter plot seems to indicate that, generally, the bigger a car’s engine is, the worse its fuel efficiency is.

Before trying the exercises below, it will be necessary to distinguish between two types of variables. Consider the cyl variable, for example. A car engine can only have an integer number of cylinders – there are 4-cylinder engines, but not 4.7-cylinder engines. Thus, the values of the cyl variable can be sorted into discrete categories, one for each of the integers 4, 5, 6, and 8. We say that cyl is a categorical (or discrete) variable.

On the other hand, consider hwy. The values of this variable in mpg range from 12 to 44, so it at first seems like hwy would also be categorical, with the integers 12 through 44 being its categories. However, unlike cyl, it would make sense to assign any numerical value (within realistic bounds) to hwy. For example, it would make sense for hwy to take on a value such as 24.3 or any other number, integer or not, between some minimum and maximum. We thus say that hwy is a continuous variable.

1.1.1 Exercises

  1. What would you expect a scatter plot showing the relationship between hwy and cty to look like? Create such a plot.

  2. Classify each of the 11 variables in the mpg data set as either continuous or categorical.

  3. Create a scatter plot to illustrate the relationship between fuel efficiency and a car’s drive train. What does your plot say about this relationship?

  4. How is the plot from the previous problem qualitatively different from the plot of hwy vs displ? What accounts for this difference? (Hint: Think about what type of variable drv is.)

  5. Scatter plots aren’t always appropriate ways to visualize relationships among variables. Get a scatter plot of class vs. drv and explain why it’s not very useful. (Whenever we make statements like “variable vs. variable,” the first variable is considered to be the dependent, aka y-axis, variable and the second is the independent, aka x-axis, variable.)

  6. Explain why scatter plots are most appropriate when the variables assigned to x and y are both continuous.

1.2 Adding New Aesthetics to a Scatter Plot

There are some exceptions to the hwy vs. displ trend observed in the last section. Look at the points in the plot toward the upper right, which are cars with big engines and which yet have surprisingly high fuel efficiencies. Why might this be? What’s going on with these cars?

One conjecture might be that these are powerful cars (a consequence of a big engine) that are lightweight and aerodynamic (which lead to better fuel efficiency). In other words, maybe they’re sports cars. Notice that the mpg data set contains a class variable. The sports cars are probably the ones for which class = 2seater. One way to test this is by color-coding the points according to the class value. This is accomplished by assigning class to the color aesthetic:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

It seems we’re right; the pinkish-orange dots in the upper right indicate that these cars are sports cars. We also see that, as expected, SUVs and pickup trucks don’t get good gas mileage even when their engines are relatively small and that the two cars with highway mileages above 40 miles per gallon are both subcompact cars (and likely even hybrids).

Each geom is equipped with several aesthetics. So far we’ve seen the x, y, and color aesthetics for geom_point. You can investigate others by reading the geom_point documentation:

?geom_point

The exercises below will give you a chance to experiment with some of these aesthetics. As you’ll see, some aesthetics are designed for categorical and some for continuous variables.

1.2.1 Exercises

  1. Obtain a scatter plot of hwy vs displ, but map the variable class to the aesthetic shape rather than to color. Why do you get a warning message?

  2. Re-do the previous problem, mapping a categorical variable with fewer categories than class to the shape aesthetic.

  3. What happens if you map a continuous variable to shape?

  4. Try mapping variables, both continuous and categorical, to the size aesthetic. What do you observe?

  5. What happens when you map a continuous variable to color?

  6. Create a scatter plot with two extra aesthetics (in addition to x and y). Don’t map the same variable to each one. Why might this not be considered a very good practice?

  7. Map a single variable to two different aesthetics. Why might this sometimes be a good idea?

  8. Run the following code. Why might this be useful?

ggplot(mpg) +
  geom_point(aes(displ, hwy, color = cyl<6))

1.3 Labelling Visualizations

Recall our color-coded scatter plot from Section 1.2:

As it is, this plot would not be ready to include in a data analysis report because nothing is labeled. At a minimum, every visualization you create that someone else will see should have the following:

  • Meaningful labels for each axis
  • Meaningful labels for the legend (if applicable)
  • The units of measurement for each variable used (if applicable)

These labels are added by imposing another layer to the ggplot call. (Notice the + sign after the second line).

ggplot(mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  labs(x = "engine size (liters)",
       y = "fuel efficiency (miles per gallon)",
       color = "type of vehicle")

It’s often helpful to add other labels as well, including:

  • a title (a general description of what your plot is supposed to display)
  • a subtitle (a more detailed description; often omitted unless there’s a good reason to include one)
  • a caption (a good place to say where your data comes from)

Here’s how to do it:

ggplot(mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  labs(x = "engine displacement (L)",
       y = "highway fuel efficiency (mpg)",
       color = "type of vehicle",
       title = "Fuel Economy as a Function of Engine Size",
       subtitle = "Fuel Efficiency and Engine Size are Inversely Related",
       caption = "Data obtained from fueleconomy.gov")

1.4 Trend Curves

Scatter plots are a good way to visualize a relationship between two continuous variables. Another way is to use a trend curve, which is accomplished with geom_smooth. We won’t worry about labels with any of the following examples, but remember that labels should always be included on visualizations when they’ll be viewed by others.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

The gray area surrounding the curve is called an envelope. It measures how close the data is clustered around the curve – the wider the envelope, the more spread out the data is. (In statistical terms, it visualizes a statistic called standard error.) The envelope can be turned off if desired:

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy), se = FALSE)

geom_smooth comes equipped with several aesthetics, which you can view in the documentation: ?geom_smooth. You can map variables to these aesthetics just like you did with geom_point. Here’s the color aesthetic:

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

As you can see, it groups the data into the various categories of the drv variable and gives a separate curve for each category. This is a good way to compare trends among the different categories. (As you can see, color is an aesthetic suitable only for categorical variables.)

Trend curves are usually used in conjunction with scatter plots:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

Notice that the code above has a lot of repetition. When all the layers in a visualization have a mapping in common, that mapping can be placed in the ggplot argument instead. For example, the following code produces the same plot as above:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

1.4.1 Exercises

  1. Check the geom_smooth documentation (?geom_smooth) to see some of the other aesthetics available besides color. Produce two different trend curve plots, each of which uses a different aesthetic. Display your plots and describe what your chosen aesthetics do.

  2. Obtain a scatter plot that:

    • shows the relationship between city fuel efficiency and highway fuel efficiency,
    • color-codes the points by the type of drive train,
    • has an overlayed trend curve with no envelope, and
    • has a title, properly labeled axes (including units), and a properly labeled legend.
  3. Create a trend curve of hwy vs displ and map the class variable to color. What do you think of your visualization? What recommendation would you make regarding mapping categorical variables to color?

1.5 Time Series Visualizations

A very common question posed to data analysts is how certain types of data change over time. When a data set contains any kind of time (such as year, hour, quarter, month, etc.) variable, we refer to it as time series data. A good way to analyze time series data is with a line graph.

The following data set gives home runs totals in Major League Baseball by league every year from 1973 to 1995. Let’s analyze these home run totals as a time series. (This is not a built-in data set; running the following code chunk will import it from an external web site. We’ll learn how to import external data sets in Chapter 3.)

homeruns <- readr::read_csv("https://raw.githubusercontent.com/jafox11/MS282/main/homeruns.csv")
homeruns
## # A tibble: 46 x 3
##     year league home_run_total
##    <dbl> <chr>           <dbl>
##  1  1973 AL               1552
##  2  1973 NL               1550
##  3  1974 AL               1369
##  4  1974 NL               1280
##  5  1975 AL               1465
##  6  1975 NL               1233
##  7  1976 AL               1122
##  8  1976 NL               1113
##  9  1977 AL               2013
## 10  1977 NL               1631
## # ... with 36 more rows
## # i Use `print(n = ...)` to see more rows

The set up is the usual; the geom that produces a line graph is geom_line:

ggplot(data = homeruns) +
  geom_line(mapping = aes(x = year, y = home_run_total))

This doesn’t look right. If you look carefully, you’ll notice that there’s a vertical line segment above each year. This is because each year in the data set is entered twice, once for each league. It would be better to get separate line graphs for each league so that we can compare. This is easily accomplished by mapping the league variable to the color aesthetic:

ggplot(data = homeruns) +
  geom_line(mapping = aes(x = year, y = home_run_total, color = league))

It’s often helpful to overlay a scatter plot on top of the line graph. Notice that we can move the mapping argument up to the ggplot line since it applies to both geoms.

ggplot(data = homeruns, mapping = aes(x = year, y = home_run_total, color = league)) +
  geom_line() +
  geom_point()

Line graphs are nice for time series analyses because the lines that join successive points make it easier to visually follow changes over time.

1.5.1 Exercises

  1. It looks like from 1973 onward, the American League always hit more home runs than the National League. Research what change was instituted in 1973 that might explain this.

  2. It also looks like the gap between the American and National Leagues widened significantly in 1977 and then shrank significantly in 1993. Find out why this might have been.

  3. Do some research to explain why home run totals in both leagues was much lower than usual in 1981.

1.6 Box Plots

In Exercise 4 in Exercises 1.1.1, we considered the following scatter plot:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = drv, y = hwy))

The point of that exercise was to show that scatter plots are not the right visualization to use when either of the variables is categorical. A much better option is a box plot. Before we see how these are created, there are a few terms to define.

  • Given an ordered data set, the median is the number in the middle position if there are an odd number of points or the average of the two middle numbers if there are an even number of points. For example, in an ordered data set with 11 points, the median is the 6th point in the list. If there are 24 points, the median is the average of the 12th and 13th points. The median of a data set has the property that 50% of the points are less than or equal to the median and 50% are greater than or equal to it.

  • Given an ordered data set, the first quartile (Q1) is, like the median, a cut point in the list which divides the data points into the smallest 25% and the largest 75%. The third quartile (Q3) is a cut point which divides the list into the smallest 75% and the largest 25%. (In case you’re wondering, the second quartile (Q2) is just the median.) The data that lies between Q1 and Q3 thus comprises about 50% of the data in the set.

  • The interquartile range (IQR) is the value of Q3\(-\)Q1.


Example: Suppose our data set consists of the points 32.1, 56.3, 27.2, 6.7, 56.5, 24.7, 12.9, 35.8, 54.1, and 71.1. To find the median and quartiles, we first have to put the numbers in order: 6.7, 12.9, 24.7, 27.2, 32.1, 35.8, 54.1, 56.3, 56.5, 71.1. Since there are 10 data points, the median is the average of the 5th and 6th numbers: 33.95.

Since our 10 data points cannot be divided into groups of 4, we cannot find Q1 and Q3 exactly. There are actually several competing methods for finding quartiles in this case. To avoid too much of a detour, we will use R to compute quartiles. As you can see below, the way to enter a list of data points in R is to enter them into a character vector: c(32.1, 56.3, 27.2, 6.7, 56.5, 24.7, 12.9, 35.8, 54.1, 71.1).

x <- c(32.1, 56.3, 27.2, 6.7, 56.5, 24.7, 12.9, 35.8, 54.1, 71.1)
quantile(x, probs = c(0.25, 0.50, 0.75))
##    25%    50%    75% 
## 25.325 33.950 55.750

We thus see that Q1 = 25.325, Q2 = 33.95 (which we already knew), and Q3 = 55.75. The IQR is 55.75 \(-\) 25.325 = 30.425.

The code above refers to the fact that quartiles are actually specific examples of quantiles. Quantiles are a set of cut points that divide a numerical data set into specified percentiles. Quartiles, then, are quantiles for which the percentiles are 25%, 50%, and 75%.


We are now finally ready to create and interpret a box plot. The set-up is similar to that of scatter plots, trend curves, and line graphs; the new geom to use is geom_boxplot:

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = drv, y = hwy))

For each of the possible values of drv (4, f, and r), we get a diagram interpreted as follows:

  • The horizontal line inside the box is drawn at the median value of hwy for the given value of drv.

  • The top horizontal line is drawn at the third quartile, and the bottom horizontal line is drawn at the first quartile.

  • The length of the vertical line above (respectively, below) the box is the smaller of 1.5 \(\times\) IQR and the distance from the maximum data point to Q3 (respectively, the minimum data point to Q1).

  • The dots above and below the vertical lines represent data points that are greater than Q3 \(+\) 1.5 \(\times\) IQR and less than Q1 \(-\) 1.5 \(\times\) IQR, respectively. These should be considered potential outliers.

The above plot of hwy vs drv thus shows that:

  • Cars with front-wheel drive tend to get the best highway gas mileage but also that there’s a lot of variation among the values of hwy in that group, especially beyond the interquartile range.

  • Cars with rear-wheel drive have a larger interquartile range, but the values of hwy don’t stray very far outside that range.

  • Cars with 4-wheel drive have a median value that is much closer to Q1 than to Q3, which suggests that the values of hwy above the median are significantly farther from the median than those below.


It’s often helpful to produce box plots in which the boxes are displayed in ascending or descending order of height. For the box plot above, this would mean switching the boxes for r and f. We can do this automatically by specifying that we want to re-order the boxes so that the drv variable is ordered according to its mean value of hwy. Here’s how to do this:

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(drv,hwy), y = hwy))

Another option we have with box plots is to flip them so that the boxes are displayed horizontally. This is especially helpful when the categorical variable has a lot of values crowded onto the x-axis. We can do this by adding a coord_flip() layer:

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(drv, hwy), y = hwy)) +
  coord_flip()

Box plots also come equipped with various aesthetics, which you’ll explore in the exercises. The main thing to remember about box plots, though, is that they’re a good way to visualize how a continuous variable depends on a categorical one.

1.6.1 Exercises

These exercises will make use of the built-in diamonds data set. First take a few minutes to look through the data set and read the documentation.

View(diamonds)
?diamonds
  1. Obtain a visualization that shows the relationship between carat and price. (Pay attention to the types of variables you’re using so that you create the appropriate type of visualization.)

  2. Repeat the previous problem, but replace carat with clarity.

  3. According to your plot from the previous problem, which type of clarity is priced the highest? Which type has the most variation in price?

  4. Referring again to the plot from Exercise 2, why do you think all of the outliers are above the main cluster of data points?

  5. Obtain a box plot of price vs color. Three of the aesthetics included in geom_boxplot are color, fill, and linetype. Try mapping the cut variable to each of these aesthetics. Which of these three (in your opinion) is the most effective way to distinguish among different diamond cuts in your box plot and why?

  6. Re-do Exercise 2, but display the boxes horizontally, arranged by average price value.

1.7 Visualizing Distributions: Bar Graphs and Histograms

The visualizations we’ve seen so far - scatter plots, trend curves, line graphs, and box plots - are ways to display covariation among two or more variables. This means that they show how changes in one variable are related to changes in another variable. In this section, we look at ways to visualize how values of a single variable vary. The two primary visualizations that accomplish this are bar graphs (for categorical variables) and histograms (for continuous variables).

Bar graphs and histograms both visualize how a single variable is distributed by providing a count of how many times each value of the variable is attained. We’ll start with bar graphs. Notice that we only have to provide an x mapping; the y-variable in the plot is the value count:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

We see that diamonds with an ideal cut are the most numerous in the set. Sometimes it’s preferable to have a bar graph show the percentages that each category makes up in the data set. In this case, we can think of the bar graph as representing a probability distribution of the different categories. We can override the default mapping of count to y by instead mapping the proportion stat(prop):

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))

The group = 1 mapping is included to make sure that the proportions calculated by stat(prop) are relative to the entire data set, not just relative to each category. Try running the code above without group = 1 and see what happens.

Anyway, from the above bar graph, we can read off how likely it is to choose any category randomly. For example, it looks like we have about a 22% chance of randomly selecting a “Very Good” diamond at random.


Histograms serve the same purpose as bar graphs but are used for continuous, rather than categorical, variables. The problem with continuous data is that we don’t want to count how many times each value of a variable shows up; there would be way too many such values and thus way too many bars in the bar graph.

We solve the problem of visualizing a continuous distribution by binning. Binning is the process of subdividing the continuous interval of values, stretching from the minimum value to the maximum, into subintervals (or bins). Each bin is then treated like a category, and a count of the number of values in each bin is made and used to determine the height of a bar (just like in a bar graph). The default in ggplot is to use 30 bins.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price))

You can override the default number of bins by either specifying how many bins you want or by specifying how wide you want your bins to be:

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), bins = 50)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 1500)

Choosing the number of bins in a histogram is a subtle art; you don’t want them to be so wide that they make the distribution look too course, and you don’t want them to be so narrow that they make the distribution look too fine or “granular.”

Both geom_bar and geom_histogram have several aesthetics, which you’ll explore in the exercises.

1.7.1 Exercises

  1. Create a visualization of the distribution of the clarity variable in diamonds.

  2. Re-do the previous problem, but have the y-axis record the percentages of each clarity value within diamonds.

  3. Create a visualization of the distribution of the carat variable in diamonds.

  1. Create a histogram for which the bins are way too narrow. Then create one for which they’re way too wide. What do you observe in each case?

  2. Create a bar graph that shows the distribution of the price variable in diamonds. Why does it look so spiky? What should you have done to visualize the distribution of price?

  3. Two useful aesthetics for bar graphs are color and fill. For your bar graph from Exercise 1 above, try mapping cut to color and then to fill. Which seems to be the more helpful aesthetic?

  4. A more useful way to apply the color or fill aesthetics is to pair them with the optional position argument. Run the following code and reflect on how the bar graph might be better than the ones from Exercise 6.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = clarity, fill = cut), position = "dodge")
  1. Explain why the following code would produce a bar graph that would contain some redundancy but would also be visually pleasing:
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))
  1. Reproduce the bar graph from the previous problem, but add an extra layer:+ coord_polar(). What happens?

  2. Create a visualization of the distribution of the manufacturer variable in mpg. Explain why this visualization would look better if the bars were displayed horizontally rather than vertically, then create such a visualization. (Hint: Recall how this is done for box plots.)

  3. The aesthetics for geom_histogram are the same as those for geom_bar. Create a histogram that shows the distribution of carat within diamonds and use an aesthetic to show the distribution of cut values within each bin.

  4. A geom very similar to geom_histogram is geom_freqpoly. Create a visualization of the distribution of price within diamonds using geom_freqpoly instead of geom_histogram. Why might this sometimes be preferable to a histogram?

  5. As you may have seen in Exercise 11, you can map a third variable to the color or fill aesthetic, but the result is somewhat visually confusing. Retry Exercise 11, but use geom_freqpoly rather than geom_histogram. Do you think this is a better visualization?

  6. Using the mpg data set, create a visualization of the distribution of types of cars in the data set, and include a breakdown of the distribution of drive trains present for each type of car. Label the x- and y-axes and the legend with meaningful names, and give your plot an appropriate title.

1.8 Summary

There are thousands of combinations of visualizations and aesthetics available for ggplot, and these visualizations are almost infinitely customizable. We’ve only scratched the very outermost surface here by seeing how a few commonly used visualizations are produced.

But for each visualization, the basic grammar is the same:


ggplot(data = <DATASET>) +
  geom_<GEOM TYPE>(mapping = aes(<MAPPINGS>), position = <POSITION>) +
  <NEXT GEOM LAYER> +
  <NEXT GEOM LAYER> +
  <ETC> +
  <COORDINATES>

And the visualizations we’ve seen at this point are:

geom x y
scatter plot continuous continuous
trend curve continuous continuous
line graph continuous continuous
box plot categorical continuous
bar graph categorical count or proportion
histogram continuous count or proportion
frequency polynomial continuous count or proportion

1.9 R Markdown

The formatting you’ll use for most of our projects is provided by R Markdown, a typesetting language which allows for a combination of text, R code, and R output, including visualizations.

From the “File” menu in RStudio, choose “New File” and then “R Markdown.” Before the file opens, you’re given the option to enter your name, the title of your report, and the output type. Choose “HTML” for the output type.

When your .Rmd file is created, some sample input is included so that you can see how the document is put together. R Markdown files consist of a combination of text and code chunks. A code chunk has the form:

```{r <OPTIONAL NAME OF CHUNK>, <OPTIONAL SETTINGS>}
<CODE GOES HERE>
```

Each chunk is equipped with a run button (a green triangle) that you can use to test the code in your chunk. Note that if your code chunk makes use of any tidyverse functions, you have to include a code chunk containing our usual library(tidyverse).

Sometimes it might be necessary to hide the code, output, or both of a code chunk for the purpose of producing a professional looking report. These settings are applied in the heading of your code chunk (where it says <OPTIONAL SETTINGS> above). For example, in the sample .Rmd file, in the first code chunk, include=FALSE is an option that instructs the compiler to not include this code or its output in the compiled report. Here are some settings to keep in mind:

  • include=FALSE prevents both the code and the code’s output from appearing in the compiled document.

  • echo=FALSE prevents the code, but not the output, from appearing.

  • eval=FALSE prevents the output, but not the code, from appearing.

  • message=FALSE prevents any messages generated with the output from appearing.

  • warning=FALSE prevents any warnings generated with the output from appearing.

Since you will almost never want messages or warnings to show up in your compiled document, you can set these options globally at the beginning of your .Rmd file. After the --- that follows your title, name, date, output, etc, just enter the following code chunk. Then you won’t have to turn off messages and warnings within each individual chunk.

```{r, include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE)
```

There are other settings that might come in handy as well. There is also a lot of formatting syntax used in R Markdown, including ways to create bold font, italics, bulleted lists, hyperlinks, etc. Your best bet is to Google “how to ______ in R Markdown” when you need to apply some kind of formatting.

When you’re ready to compile your .Rmd file into a finished html document, click the “Knit” button at the top of the editor window.


It’s a good practice to show at least part of the data set you’re using in your analysis report. However, if you try our usual View(<DATA SET>) in an R Markdown file, nothing shows up. Instead, use the datatable function from the DT package. You’ll first have to install this package, although don’t do it in an R Markdown code chunk. Do it in any R script file.

install.packages("DT")

Then load the DT library in your R Markdown file. You could do this right after you load tidyverse:

library(DT)

The following code shows how to use the datatable function. The options = list(scrollX = TRUE) argument ensures that if the data table is too wide for the screen, a horizontal scroll bar will be provided in the compiled document. If you put this in a code chunk in an R Markdown file, the data table will appear in your report as it is seen below.

datatable(mpg, options = list(scrollX = TRUE))

1.10 Sample Report

1.11 Project

Project Description: The purpose of this project is to use ggplot and its various geoms to answer questions about a data set by creating meaningful and aesthetically pleasing visualizations.

In particular, we will be analyzing data relating to the tv show The Office, which, as everyone knows, is the best show of all time. You can import the data set by executing all three commands in the following code chunk. (The data set is a compilation of the ones found here and here.)

office_ratings <- readr::read_csv('https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv')
office_ratings$season <- as.character(office_ratings$season)
office_ratings$air_date <- as.Date(office_ratings$air_date, "%m/%d/%Y")

The name of the data set to which you’ll refer is office_ratings. And here’s the data dictionary:

variable data type description
season categorical season during which the episode aired
episode categorical episode number within the season
title categorical title of episode
viewers continuous number of viewers in millions on original air date
imdb_rating continuous average fan rating on IMDb.com from 1 to 10
total_votes continuous number of ratings on IMDb.com
air_date date date episode originally aired

Instructions:

Use well-chosen visualizations to answer the following questions.

  1. How are each of the three continuous variables distributed? (Where is the peak and what does it tell you? What does the shape of the distribution tell you? Are there any extreme values?)

  2. Is it the case that the more people watch an episode, the better it’s liked?

  3. Are there any exceptions to the trend you noticed in the previous problem? Use a visualization to try to explain these exceptions.

  4. Is it the case that the more people watch an episode, the more people leave an IMDb rating? Are there exceptions? If so, use a visualization to try to explain them.

  5. How did the show’s popularity change over time?

  6. How did the show’s appeal change over time? (Be careful, popularity and appeal are not the same thing. Think about which variables address these two attributes.)

  7. If the trends in the previous two problems are different, try to explain why this might be.

  8. Is there a trend in total viewership within the individual seasons? Are there any notable changes in viewership within any season? If so, can you explain the reason for these changes?

These problems should be worked out in a scratch work .R file which you will not turn in. Once you’re done, write up your results in an R Markdown report, making sure to follow the guidelines below.

Guidelines:

  • Include a title that describes your analysis, your name, and the date in the heading.
  • Also include a preliminary code chunk in which you load the libraries you’ll need, and briefly say what each one is for.
  • Begin with an introductory paragraph containing, at least, a description of the data set (including what it contains, its size, and its source) and a nicely displayed data table using the datatable function. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.)
  • Clearly describe what you’ll be doing with the data, and include any questions you’ll be trying to answer.
  • Follow the introduction and problem statement with the body of your report which will contain your work, broken into relevant section headings.
  • The body should include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but it should avoid large blocks of text.
  • Do not use the problem-numbering in the Instructions above in the body. That’s just for your benefit as you prepare your work. Your narrative structure - not an enumerated list - should lead the reader from one question to the next.
  • The finished report should show all of your R code and its output, but it should not show any warning or error messages. (This is for my benefit. A professional report might not include the code, depending on the audience.)
  • All graphics should look nice, be easy to read, and be fully labelled.
  • You should include insightful statements about the meaning you’re extracting from your graphics rather than just superficial descriptions of your visualizations.
  • End with an overall concluding statement which summarizes your findings.

Grading Rubric:

  • Graphics: Do the graphics convey meaning? Do they look nice? Are they fully labelled? Are the geoms used appropriate for the data being displayed? (30 points)

  • Insights: Are insights fully explained, well-written, and significant (i.e., not superficial)? Are your insights derived from the graphics? (30 points)

  • Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)

  • Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? (15 points)