## 4.3 Base R graphics

This chapter covers **base** R graphics, rather than **ggplot2**, which is covered in Chapter 2: Visualizing data of the ds4psy textbook.

Our main goals are:

Hands-on instructions on visualizing data in

**base**R.Distinguishing between different types of visualizations.

Adding aesthetics: Color, shape, size, etc.

Two important caveats:

Graphs often requires transforming data into a specific format or shape. We ignore this here, but will return to this topic in our part on wrangling data.

We occasionally use colors, but do not cover how to specify and select colors in R. In the following, we occasionally select colors and color functions of the

**unikn**package (see Appendix D: Using colors for details).

### 4.3.1 Basic plots

Due to the inclusion of the core packages **graphics** and **grDevices**, a standard installation of R comes with pretty powerful tools for creating a variety of visualizations.

In this chapter, we can only introduce some very basic commands for creating typical (named) graphs. We can distinguish between basic and complex plots:

*Basic plots*create an entire plot in one function call:

`hist()`

creates histograms;

`plot()`

creates point and line plots (as well as more complex ones);`barplot()`

creates bar charts;`boxplot()`

creates box plots; and`curve()`

allows drawing arbitrary lines and curves.

*Complex plots* (discussed below) are created by multiple function calls.
They are typically started with a generic call to:

`plot()`

, and then followed by more specific plot functions, like`grid()`

,`abline()`

,`points()`

,`text()`

,`title()`

, etc.

#### Histograms

Histograms are one of the simplest types of plots: They show the distribution of values of a single variable.

For demonstration purposes, we create a vector `x`

that randomly draws 500 values of a normal distribution:

`<- rnorm(n = 500, mean = 100, sd = 10) v `

In R, the `hist()`

function allows specifying the data to plot as an argument `x`

.
Providing our vector `v`

to the function yields the following:

`hist(x = v)`

This looks fairly straightforward, but due to the random nature of `x`

the distribution of its values will vary every time we re-create the vector `x`

.

Note that the `hist()`

command added some automatic elements to our plot:

- a descriptive title (above the plot);
- an x- and a y-axes, with appropriate value ranges, axis ticks and labels;
- some geometric objects (here: vertical bars or rectangles) that represent the values of data.

Under the hood, the function even re-arranged the input vector and computed something:
It categorized the values of `x`

into bins of a specific width and counted the number of values falling into each bin (to show their frequency on the y-axis).

Whenever we are unhappy with the automatic defaults, we can adjust some parameters.
In the case of histograms, an important parameter is the number of separate bins into which the data values are being categorized. This can be adjusted using the `breaks`

argument:

```
# specifying breaks:
hist(v, breaks = 5)
```

`hist(v, breaks = 25)`

Once we have settled on the basic parameters, we can adjust the labels and aesthetic aspects of a plot.
A good plot should always contain informative titles, value ranges, and labels.
In the following expression, we adjust the main title (using the `main`

argument),
the label of the x-Axis (using `xlab`

argument),
the main color of the bars and their border (using the `col`

and `border`

arguments):

```
# with aesthetics:
hist(v, breaks = 20,
main = "A basic histogram (showing the distribution of x)",
xlab = "Values of x",
col = "gold", border = "blue")
```

Note that we did not adjust the range of values on the x- and y-axes.
If we wanted to do this, we could have done so by providing the desired ranges (each as a vector with two numeric values) to the `xlim`

and `ylim`

arguments:

```
# with aesthetics:
hist(v, breaks = 20,
main = "A basic histogram (showing the distribution of x)",
xlab = "Values of x",
col = "gold", border = "maroon",
xlim = c(50, 150), ylim = c(0, 120))
```

#### Scatterplots

The `plot()`

function shows relationships between two variables `x`

and `y`

.
Actually, `plot()`

is a flexible plotting function in R:
On the one hand, it allows defining new plots (e.g., create a new plotting canvas).
On the other hand, calling the function with some data arguments aims to directly create a plot of them.

In this section, we will call it with two vectors `x`

and `y`

to create a *scatterplot* (i.e., a plot of points).
However, we will also see that `plot()`

allows creating different plots of the same data, specifically:

- a line plot;
- a step function.

We first create some data to plot.
Here are two simple numeric vectors `x`

and `y`

(where `y`

is defined as a function of `x`

):

```
# Data to plot:
<- -10:10
x <- x^2 y
```

When providing these vectors `x`

and `y`

to the `x`

and `y`

arguments of the `plot()`

function, we obtain:

```
# Minimal scatterplot:
plot(x = x, y = y)
```

Thus, the default plot chosen by `plot()`

was a scatterplot (i.e., a plot of points).
We can change the plot type by providing different values to the `type`

argument:

```
# Distinguish types:
plot(x, y, type = "p") # points (default)
```

`plot(x, y, type = "l") # lines`

`plot(x, y, type = "b") # both points and lines`

`plot(x, y, type = "o") # both overplotted `

`plot(x, y, type = "h") # histogram or height density`

`plot(x, y, type = "s") # steps`

See the documentation `?plot()`

for details on these types and additional arguments.
For most datasets, only some of these types make sense.
Actually, one of the most common uses of `plot()`

uses the type `n`

(for “no plotting”):

`plot(x, y, type = "n") # no plotting / nothing`

As we see, this does not plot any data, but created an empty plot (with appropriate axis ranges). When creating more complex plots (below), we start like this and then add various custom objects to our plot.

Once we have selected our plot, we can fiddle with its aesthetic properties and labels to make it both prettier and more informative:

```
# Set aesthetic parameters and other options:
plot(x, y, type = "b",
lty = 2, pch = 16, lwd = 2, col = "red", cex = 1.5,
main = "A basic plot (showing the relation between x and y)",
xlab = "X label", ylab = "Y label",
asp = 1/10 # aspect ratio (x/y)
)
```

#### Overplotting

A common problem with scatterplots is *overplotting* (i.e., too many points at the same locations).
For instance, suppose we wanted to plot the following data points:

```
# Data to plot:
<- runif(250, min = 0, max = 10)
x <- runif(250, min = 0, max = 10) y
```

Here is how basic scatterplot (with filled and fairly large points) would look like:

```
# Basic scatterplot:
plot(x, y, type = "p",
pch = 20, cex = 4, # filled circles with increased size
main = "An example of overplotting")
```

**Note:** In case you’re wondering what `pch`

and `cex`

mean:

- Calling
`?par()`

provides detailed information on a large variety of graphical parameters. - Calling
`par()`

shows your current system settings.

One of several solutions to the overplotting problem lies in using transparent colors, that allow viewing overlaps of graphical objects. There are several ways of obtaining transparent colors.
For instance, the following solution uses the **unikn** package to select some dark color (called `Petrol`

) and then use the `usecol()`

function to set it to 1/3 of its original opacity:

```
library(unikn)
<- usecol(Petrol, alpha = 1/3) my_col
```

Providing `my_col`

to the `col`

argument of `plot()`

yields:

```
# Set aesthetic parameters and other options:
plot(x, y, type = "p",
pch = 20, cex = 4,
col = my_col,
main = "Addressing overplotting (by color transparency)")
```

Note that the following `type`

variants of `plot()`

may look pretty, but unless we are trying to make an artistic point they make very limited sense given this data:

```
# Select colors:
<- usecol(c(Karpfenblau, Pinky, Petrol, Bordeaux), alpha = 2/3)
my_cols
# Plot 4 types of same data:
plot(x, y, type = "l", col = my_cols[1], main = "A: Line plot")
plot(x, y, type = "b", col = my_cols[2], main = "B: Both points and lines")
plot(x, y, type = "h", col = my_cols[3], main = "C: Height density plot")
plot(x, y, type = "s", col = my_cols[4], main = "D: Step plot")
```

Thus, which `type`

of `plot()`

makes sense is primarily a function of the data that is to be shown.^{1}

#### Bar plots

One of the most common, but also quite complicated types of plots are bar plots (aka. bar charts). The two main reasons why bar plots are complicated are:

the bars often represent processed data (e.g., counts, or the means, sums, or proportions of values).

the bars can be arranged in multiple ways (e.g., stacked vs. beside each other, grouped, etc.)

When we have a named vector of data values that we want to plot, the `barplot()`

command is pretty straightforward:

```
# A vector as data:
<- c(1, 3, 2, 4, 2) # some values
v names(v) <- c(LETTERS[1:5]) # add names
barplot(height = v, col = Seeblau)
```

In most cases, we have some more complicated data (e.g., a data frame or multiple vectors). To create a bar graph from data, we first create a table that contains the values we want to show.

A simple example could use the `mpg`

data from **ggplot2**:

```
# From data:
<- ggplot2::mpg
mpg
# Turn class into a factor variable:
$class <- factor(mpg$class) mpg
```

The following `table()`

function creates a simple table of data by counting the number of observations (here: cars) for each level of the `class`

variable:

```
# Create a table:
<- table(mpg$class)
tb # names(tb)
tb
```

```
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
```

Providing this table as the `height`

argument of `barplot()`

creates a basic bar plot:

`barplot(height = tb) # basic version`

Adding aesthetics and labels renders the plot more colorful and informative:

```
barplot(height = tb,
main = "Counts of cars by class",
xlab = "Class of car",
las = 2, # vertical labels
col = usecol(pal_unikn_light))
```

An alternative way of creating a `barplot()`

would use the `data`

and `formula`

arguments:

Using the `UCBAdmissions`

data:

```
<- as.data.frame(UCBAdmissions)
df
<- df[df$Dept == "A", ]
df_A <- df[df$Dept == "E", ]
df_E
# Select 2 colors:
<- c(Seeblau, Bordeaux) # two colors my_cols
```

Create two bar plots:

```
# A:
barplot(data = df_A, Freq ~ Gender + Admit, beside = TRUE,
main = "Department A", col = my_cols, legend = TRUE)
```

```
# E:
barplot(data = df_E, Freq ~ Gender + Admit, beside = TRUE,
main = "Department E", col = my_cols, legend = TRUE)
```

Problem: Legend position overlaps with bars.

Two possible solutions:

```
# Moving legend position:
# Solution 1: specify args.legend (as a list)
barplot(data = df_E, Freq ~ Gender + Admit, beside = TRUE,
main = "Department E", col = my_cols,
legend = TRUE, args.legend = list(bty = "n", x = "topleft"))
```

```
# Solution 2: Adjust the size of the x-axis:
barplot(data = df_E, Freq ~ Gender + Admit, beside = TRUE,
main = "Department E", col = my_cols,
legend = TRUE, xlim = c(1, 8))
```

### 4.3.2 Complex plots

The commands covered so far provide shortcuts and all do multiple things (e.g., select default layouts, range and labels on axes, as well as drawing particular objects). If we want more control about our plots or modify some of the automatic choices, we can specify all parts of a plot in detail.

*Complex plots* are created by multiple function calls and typically start by calling `plot()`

with `type = "n"`

to create a basic plot or canvass, and then adding additional objects to it.
More specific plot functions include `grid()`

, `abline()`

, `points()`

, `text()`

, `title()`

, as well as setting graphical parameters (with `par()`

).
In the following, we will illustrate the most important functions in the context of some visualizations.

#### Composing plots as programs

More advanced plots in the **base** R plotting system are created by calling successive R functions to incrementally build a visualization.

When constructing more advanced visualizations, plotting occurs in two main stages:

Calling

`plot()`

for creating and defining the dimensions of a new plot;Using annotation functions for adding elements to the plot (e.g., points, lines, or text elements, as well as grids, titles, labels, and legends).

This two-stage process essentially turns the holistic task of “making a plot” into one of step-wise planning and composition. Rather than just calling one command that creates the entire plot, we start with a simple and reduced version (think: a blank canvas) and then use the tools provided by R (think: brushes, color, shapes, etc.) to improve our visualization. If we include the preparatory steps that precede plotting, the process of creating a visualization is structured like any other computer program:

Define data objects and other elements to be used (e.g., the values of parameters, colors),

Prepare the plotting canvas (dimensions, axes),

Add objects by calling functions that create new objects.

Note that — just as in other programs — the order of operations matters. When creating complex graphs, this is true in both a logical and a visual sense: We generally need to define an object (e.g., a value or color) prior to using it in an evaluated expression or function. But when creating a visualization and wanting to show some object A in front of another object B, we need to draw B before drawing A (so that A is plotted on top of B).

#### Starting a new plot

We have seen above that the `plot()`

function can create a scatterplot, or other types of plot depending on the type of object being plotted. However, `plot()`

can also be used to merely create a blank plot (i.e., start a new screen device, e.g., in the **Plots** window of RStudio) to which we then add the elements of our graph.

To understand this process, evaluate and compare the result of the following commands:

```
plot(x = 0)
plot(x = 0, type = "n")
```

We see that both commands create a new plot, but `plot(x = 0)`

creates a scatterplot of a single point (with coordinates of \((0, 0)\)) whereas adding the argument `type = "n"`

instructed R to merely define a new plotting canvas. However, the new plot still used `x`

to choose ranges for both axes.
If we want different ranges, we either can change our data inputs (to `x`

and/or `y`

) or explicitly control the ranges of axes:

```
# (a) use data to define ranges:
<- -10:10
xs <- xs^2
ys
plot(x = xs, y = ys, type = "n")
# (b) define ranges without data:
plot(x = 0, xlim = c(-10, 10), ylim = c(0, 100), type = "n")
```

Note that both plots just created use the same ranges, but differ in the labels used on the x- and y-axes.
If we wanted to control these manually, we could have used the `xlab`

and `ylab`

arguments of `plot()`

.

#### Annotating a plot

Once we have created a new screen device and defined the dimensions of our plotting region, we can use so-called *annotation functions* to add to an existing plot. Some key annotation functions include:

`points()`

adds points (or circles) to a plot`abline()`

,`lines()`

, and`segments()`

add various types of lines to a plot (given their`x`

and`y`

coordinates, or defining other properties, like a linear function’s intercept and slope)`rect()`

adds rectangles or boxes (given the coordinates of their corners)`text()`

adds text labels to a plot (to specific`x`

and`y`

coordinates)

Other functions allow adding or editing elements on the plot background, axes, and margins:

`axis()`

adds axis labels and ticks`grid()`

adds orientation lines`title()`

adds labels to the axes, title, and subtitle, and outer margin`mtext()`

: add arbitrary text labels to the plot’s (inner or outer) margins

Overall, these and other functions allow the composition of arbitrarily complex visualizations. As an example, the following code illustrates the process of first defining a plotting region and then using functions for adding various points, lines, shapes, and text elements. To really understand this process, it makes sense to evaluate the chunk in a step-by-step fashion (i.e., one command at a time):

```
# (0) Define some colors: -----
library(unikn) # to get color palettes
<- usecol(pal_unikn_light)
my_colors # seecol(my_colors) # view color palette
# (1) Create canvas (but specify dimensions and some labels): -----
plot(x = 0, xlim = c(-10, 10), ylim = c(0, 100),
type = "n",
main = "Plotting various shapes and text labels",
xlab = "X value", ylab = "Y value")
# (2) Add lines: -----
abline(h = 50, lty = 2)
abline(v = 0, lty = 2)
abline(a = 10, b = 10)
# (3) Add points (as circles): -----
points(x = c(-5, 0, 5), y = c(40, 50, 60),
pch = 21, cex = 20, lwd = 4,
col = my_colors[c(1, 3, 5)],
bg = usecol("grey", alpha = 1/2))
# (4) Add text labels: -----
text(x = 0, y = 50, labels = "R", cex = 4)
text(x = -5, y = 40, labels = "me", cex = 2)
text(x = 5, y = 60, labels = "stuff", cex = 2)
# (5) Add other shapes: -----
rect(xleft = c(-1, -9), xright = c(7, 1),
ybottom = c(0, 80), ytop = c(20, 95),
col = usecol(my_colors, alpha = 1),
border = my_colors[c(7, 5)], lwd = 4)
# (6) More text labels: -----
text(x = -4, y = 87, labels = "R pour l'art", col = Bordeaux, cex = 1.5)
text(x = 3, y = 10, labels = "WOW!", col = Karpfenblau, cex = 3)
# (7) Add line segments: -----
segments(x0 = c(-4, -4, 5, 9), y0 = c(20, 40, 25, 30),
x1 = c(9, 3, -8, -4), y1 = c(35, 85, 60, 80),
lwd = 5, col = my_colors)
```

Figure 4.3 shows that using transparent colors somewhat tempers the constraints imposed by sequentially plotting objects. For instance, the horizontal, vertical, and diagonal lines created by `abline()`

(in Step 2) are not completely covered by the round shapes (due to setting `alpha = 1/2`

in their background color `bg`

). By contrast, the two rectangles created by `rect()`

(in Step 5) used a setting of `alpha = 1`

in their `col`

definition. As a color’s alpha level ranges from 0 to 1 (with 0 indicating complete transprency and 1 indicating no transparency), using `alpha = 1`

is equivalent to “no transparency” — which is why the rectangles cover the lines created by `abline()`

.

Whereas Figure 4.3 arranges graphical elements without a particular purpose, the same methods can be used to produce highly versatile and useful visualizations.
For instance, the R package **riskyr** package (Neth et al., 2021) provides a range of visualizations that depict the effects of probabilistic binary distinctions based on a population of elements (aka. Bayesian reasoning or diagnostic testing).

Here is an example that illustrates the type of problem addressed by such diagrams:

Suppose there is a pandemic (e.g., called COVID-19) that infects

**5%**of some population of people.There is a diagnostic screening test with the following properties:

If someone is infected with COVID-19, the tests detects this accurately with a probability of

**99%**.If someone is not infected with COVID-19, the tests detects this accurately with a probability of

**95%**.

Someone receives a positive test result. How likely is it that this individual is infected?

In diagnostic terms, the problem specifies the *prevalence* of some condition (here: 5% of some population are infected with COVID-19) and the *sensitivity* (99%) and *specificity* (95%) of some diagnostic test.
The problem then asks for the conditional probability of being infected, given a positive test result (which is known as the test’s *positive predictive value*, PPV).

Research has shown that this problem is notoriously difficult, even for medical experts (see REFs for an analysis and meta-review). As the correct answer is the inverse conditional probability of the sensitivity provided (and we are also provided with a base rate/prevalence and the specificity), the answer could be computed using Bayes’ theorem.

**ToDo:** Bayes theorem: Show formula and result: PPV = 51%

Note: Given these values, the predictive value of a positive test result (PPV) is about 51%.
Thus, throwing a coin is about as accurate as a single result of this test.
(Importantly, this does *not* mean that the test is bad or useless. Things change dramatically if the prevalence of the condition in your environment (which is typically only a sub-part of the population) increased. For instance, if the prevalence of Covid in your environment was 20%, the same test would have a PPV of 83.2%. Alternatively, performing multiple tests with positive results would quickly raise the value of the test results.)

However, there are many other ways of computing the correct answer — and most of them involve a representational change in perspective (see Neth et al., 2021, for an analysis).
Our goal here is *not* to solve the problem, but merely to illustrate that the problem *could* be solved by drawing frequency diagrams — and that we could use **base** R functions for creating relatively complex diagrams out of lines, shapes, and text labels.
For many problems, we do not even need to do this ourselves, as the chances are quite high that someone else has already written an R package that does the job for us. In the current case, the **riskyr** package illustrates the relationship between a diagnostic test’s accuracy and predictive values in a variety of ways.
(Note that you do *not* need to install the package for yourself, unless you wanted to re-create the plots.)

Figure 4.4 illustrates the relation between the test’s ability for detecting infected vs. healthy people and its predictive values in a diagram that combines two trees (one top-down, and an inverted one bottom-up). To provide concrete frequencies, both trees assume a fixed population of 10,000 individuals. The top one first dissects the population by condition (`cd`

, i.e., 5% of the people are infected vs. 95% are not infected), then by the specificity and sensitivity of the test. The lower and inverted tree re-combines the four middle cells (in which green cells indicate correct classifications, whereas red ones indicate erroneous classifications) by test result or decision (`dc`

, i.e., positive vs. negative tests). This reveals that the desired probability of being infected, given a positive test result is only \(PPV=51\)% — much lower than most people would expect.

Importantly, we could create visualizations like Figure 4.4 from scratch by plotting boxes, lines, and labels. As this would be a lengthy and laborious process, R packages like **riskyr** facilitate creating various types of plots (see riskyr.org for interactive versions). But it is good to know that we can always create things ourselves, if we had to or wanted to do so.

### 4.3.3 Useful combinations

In practice, most R plots require not just one command, but multiple functions. We typically start out with a simple plot and then add or change elements to improve it. Here are some combinations that are quite common and worth knowing:

**Example 1:** A scatterplot with a grouping variable (and legend):

The `iris`

data contained in `datasets`

of R contains four types of measurements of three `Species`

of flowers. Suppose we wanted to plot the relationship between `Sepal.Length`

and `Petal.Length`

as a scatterplot.
The following plot would reveal a positive correlation between both measurements:

```
plot(x = iris$Sepal.Length, y = iris$Petal.Length, # x and y variables
pch = 16, cex = 2, # aesthetic parameters
xlab = "Sepal Length", ylab = "Petal Length", # axis labels
main = "Flower characteristics in Iris") # title
```

However, the plot is deficient in an important respect:
We lack all information regarding the `Species`

of each point (which happens to be a factor variable of the `iris`

data). An interesting feature of scatterplots is that we can assign its `col`

aesthetic to a factor variable (i.e., internally encoded by a different integer value for each level of a categorical variable).
The R plotting system uses a color vector `palette()`

as its default colors.
Setting `col = iris$Species`

essentially instructs R to use the 1st color of `palette()`

for the 1st species, the 2nd color of `palette()`

for the 2nd species, etc.

```
plot(x = iris$Sepal.Length, y = iris$Petal.Length, # x and y variables
col = iris$Species, # color by species!
pch = 16, cex = 2, # aesthetic parameters
xlab = "Sepal Length", ylab = "Petal Length", # axis labels
main = "Flower characteristics in Iris") # title
# Adding grid:
grid()
# Adding a legend:
legend (x = 4.5, y = 7, legend = levels(iris$Species),
pch = 16, col = c(1:3))
```

Note that we used the `legend()`

function to explicate the mapping of colors to species.
The function’s `x`

- and `y`

-coordinates placed its top-left corner inside the plot, but its two key arguments were:

`legend`

to print the levels of`iris$Species`

(as a character vector);`col`

explicitly set to the first three colors of the default`palette()`

.

Defining another color palette:

```
# (a) Default (base R):
palette() # show current default colors
```

```
## [1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710"
## [8] "gray62"
```

```
# Set to different values:
palette(rainbow(4)) # use color function
palette(c("steelblue", "gold", "firebrick", "forestgreen")) # use color() names
# (b) Use hcl color palettes:
# hcl.pals()
palette(hcl.colors(3, "Red-Blue"))
palette(hcl.colors(3, "Viridis"))
# (c) color packages:
library(unikn)
palette(usecol(pal_unikn_pref)) # use color function with palette
palette(c(Seeblau, Pinky, Seegruen)) # use color names
palette(usecol(c(Seeblau, Pinky, Seegruen), alpha = 2/3)) # by names, plus transparency
# palette("default") # reset to default color palette
```

Adding a linear regression line:

```
plot(x = iris$Sepal.Length, y = iris$Petal.Length, # x and y variables
col = iris$Species, # color by species!
pch = 16, cex = 2, # aesthetic parameters
xlab = "Sepal Length", ylab = "Petal Length", # axis labels
main = "Flower characteristics in Iris") # title
# Adding grid:
grid()
# Adding a legend:
legend (x = 4.5, y = 7, legend = levels(iris$Species),
pch = 16, col = c(1:3))
# Linear regression:
<- lm(Petal.Length ~ Sepal.Length, data = iris) # carry out linear regression
fit # fit
# summary(fit)
abline(fit, lty = "dashed", col = Karpfenblau, lwd = 2) # add regression line
# Adding text annotation:
text(x = 7, y = 6.7, col = Karpfenblau,
labels = "R^2 = .76\nP < 2.2e-16") # add a label to the plot at (x,y)
```

An alternative (older) example:

```
# Demo: Basic scatterplot (based on some data) with a regression line:
# Create data:
<- ds4psy::what_year(ds4psy::sample_date(from = "1990-01-01", to = "2002-12-31", size = 100),
year as_integer = TRUE)
<- 2 * (year - 1990) + runif(length(year), 0, 100)
value <- data.frame(year, value)
df
# Scatterplot:
plot(df$year, df$value, type = "p",
col = "skyblue", lwd = 2, # aesthetic parameters
xlab = "X label", ylab = "Y label", # axis labels
main = "Line plot with regression line") # plot title
# Regression:
<- lm(value ~ year, data = df) # carry out linear regression
fit fit
```

```
##
## Call:
## lm(formula = value ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -3867.288 1.971
```

```
abline(fit, lty = "dashed", col = "blue", lwd = 2) # add regression line
# Adding text annotation:
summary(fit)
```

```
##
## Call:
## lm(formula = value ~ year, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.255 -24.572 6.811 21.979 45.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3867.2876 1556.7758 -2.484 0.0147 *
## year 1.9707 0.7797 2.527 0.0131 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.8 on 98 degrees of freedom
## Multiple R-squared: 0.0612, Adjusted R-squared: 0.05162
## F-statistic: 6.388 on 1 and 98 DF, p-value: 0.01309
```

```
text(x = 2000, y = 20,
labels = "R^2 = 0.123\nP = 1.2e-12") # add a label to the plot at (x,y)
```

**Example 2:** Plotting grouped raw data as a box plot:

A scatterplot makes sense when both the x- and the y-dimension can be mapped to continuous variables.
If one variable is continuous, but another one is categorical, the `plot()`

function automatically chooses a box plot as a better alternative (see Figure **??**).
For instance, when using the `mpg`

data to plot the values of the continuous variable `cty`

as a function of the categorical variable `class`

, we get:

```
plot(cty ~ class, data = mpg,
pch = 20, cex = 2,
col = usecol(Seegruen, alpha = 1/4),
main = "Fuel consumption in city by class of car")
```

Alternatively, it often is desirable to also see the individual data points.
A means of doing so is provided by the `jitter(x, factor)`

function, that accepts a numeric vector `x`

and a factor or amount by which the values of the numeric vector are to be jittered (i.e., randomly increased and decreased). The idea behind jittering is that adding random noise can sometimes help to distinguish between elements (e.g., when plotting them). Jittering works best in combination with transparent colors, as this enables us to see the density of overlapping elements.

In our present context, we primarily would need to add noise to our x-variable `class`

, so that its values become distinguishable. Using `jitter(as.numeric(class), 3/4)`

as our x-variable automatically transforms the boxplot into a scatterplot. However, we could also add some noise to our y-variable `cty`

, which reduces the amount of overlap of points. Here are the results of both plots:

```
plot(cty ~ jitter(as.numeric(class), 3/4), data = mpg,
pch = 20, cex = 2,
col = usecol(Bordeaux, alpha = 1/4),
main = "A. cty by class (with horizontal jitter)")
plot(jitter(cty, amount = 3/4) ~ jitter(as.numeric(class), 3/4), data = mpg,
pch = 20, cex = 2,
col = usecol(Bordeaux, alpha = 1/4),
main = "B. cty by class (with horizontal and vertical jitter)")
```

The jittered raw data plots of Figure 4.6 provide some additional information over the boxplot of Figure 4.5 (e.g., they nicely illustrate the group sizes), but also have some deficiencies:

As we converted

`class`

into a numeric variable, their x-axis is numeric, rather than categorical. Can we use the previous class labels on our x-axis?The raw data plots lack the explicit information about means and dispersion of groups that the boxplot provided. Can we combine both plots to provide a more complete picture?

The answer to both questions is yes, of course. R provides several approaches for tackling these challenges, but all involve creating a plot in several steps.

To gain more control about the axes, we could use an initial `plot()`

function to define the basic dimensions and labels of our plot, but not yet include any axes (by setting `axes = FALSE`

). We then add two `axis()`

commands (to explicitly define an x- and y-axis and specify its steps, labels, and the orientation of labels) and a `grid()`

command (to show some orientation lines in the background). Finally, we can use the `points()`

annotation function with the `jitter()`

instructions from above to add our raw data points to the canvas. Overall, the resulting plot provides a better impression of the raw data points:

```
plot(x = 0, type = "n",
xlim = c(0.5, 7), ylim = c(0, 40),
axes = FALSE,
main = "Raw data plot (with random jitter)",
xlab = "Class X", ylab = "Value of Y")
axis(1, at = 1:7, labels = levels(mpg$class), las = 2) # x-axis
axis(2, at = seq(0, 40, by = 5), las = 1) # y-axis
grid() # add grid lines
# plot raw data (as points with jitter):
points(jitter(cty, 3/4) ~ jitter(as.numeric(class), 3/4), data = mpg,
pch = 20, cex = 2,
col = usecol(Bordeaux, alpha = 1/4))
```

To combine the benefits of the more abstract box plot with those of plotting the raw data, we could first draw a boxplot and later add the individual data points to it. Saving the boxplot to an R object also allows us to later use this object to add details to the plot. The following example inspects and uses the object `bp`

to add text labels of group sizes and label two outliers:

```
# boxplot:
<- boxplot(cty ~ class, data = mpg,
bp col = usecol(Seegruen, alpha = 1/4),
las = 2, ylim = c(0, 40),
main = "Boxplot (with jittered raw data and some details)",
xlab = "Class of car", ylab = "MPG in city")
grid() # add grid lines
# plot raw data (as points with jitter):
points(jitter(cty, 3/4) ~ jitter(as.numeric(class), 3/4), data = mpg,
pch = 20, cex = 2,
col = usecol(Bordeaux, alpha = 1/4))
# Add group sizes (as text labels):
text(x = 1:7, y = 0, labels = paste0("[", bp$n, "]"), cex = .85)
## Check some outliers:
# bp$out
# bp$group
# mpg[mpg$cty>30, ]
# Add some text labels:
text(x = 2, y = bp$out[4], labels = "VW New Beetle 1999",
cex = .85, col = Bordeaux, pos = 3)
text(x = 6, y = bp$out[6], labels = "VW Jetta 1999",
cex = .85, col = Bordeaux, pos = 3)
```

**Example 3:** Plotting curves from functions:

Given some diagnostic test’s `sens`

and `spec`

values, what are the test’s positive and negative predictive values (`PPV`

and `NPV`

) as a function of `prev`

?

Without understanding any details, we can use the functions provided by some package to compute the desired values.
Here, we use the **riskyr** package (mentioned above, but not needed to create the visualizations of this section):

```
library(riskyr)
comp_PPV(prev = .05, sens = .99, spec = .95)
```

`## [1] 0.5103093`

`comp_NPV(prev = .05, sens = .99, spec = .95)`

`## [1] 0.9994463`

However, rather than just making predictions for individual prevalence values `prev`

, we want a curve that shows PPV and NPV as a function of `prev`

:

```
# Prepare canvas:
plot(x = 0, type = "n",
xlim = c(0, 1), ylim = c(0, 1),
main = "Predictive values as a function of prevalence",
xlab = "Prevalence", ylab = "Predictive values")
grid()
# Set parameters and compute some points:
<- .05
prev <- comp_PPV(prev = prev, sens = .99, spec = .95)
PPV <- comp_NPV(prev = prev, sens = .99, spec = .95)
NPV
<- "darkorange"
col_PPV <- "steelblue"
col_NPV
# Curves:
curve(expr = comp_PPV(prev = x, sens = .99, spec = .95),
from = 0, to = 1, col = col_PPV, lwd = 2, add = TRUE)
curve(expr = comp_NPV(prev = x, sens = .99, spec = .95),
from = 0, to = 1, col = col_NPV, lwd = 2, add = TRUE)
# Vertical line:
abline(v = prev, lty = 2, lwd = 1, col = "grey50")
# Points:
points(x = prev, y = PPV,
pch = 20, cex = 2, col = col_PPV)
points(x = prev, y = NPV,
pch = 20, cex = 2, col = col_NPV)
# Text labels:
text(x = prev, y = PPV, label = "PPV",
cex = 1, pos = 4, col = col_PPV)
text(x = prev, y = NPV, label = "NPV",
cex = 1, pos = 1, col = col_NPV)
```

Note that creating such plots implies many organizational and structural aspects. For instance, we first define some constants so that they can be used in various commands later. Similarly, the order of functions matters, as objects are plotted on top of earlier ones.

Overall, designing such plots implies writing a small script or computer program. Each step either defines some feature or computes an object by applying a function. When the steps are arranged and evaluated in the right order, they create a graphical representation.

+++ here now +++

#### Checking and setting graphical parameters (`par()`

)

The appearance of any graphical device in R is governed by many graphical parameters.
The `par()`

function allows checking and setting these parameters.
It accepts a large number of parameters as tagged values (i.e., in a form `tag = value`

, where `tag`

is the name of a graphical parameter and `value`

On `par()`

:

See https://bookdown.org/rdpeng/exdata/the-base-plotting-system-1.html#some-important-base-graphics-parameters

`opar <- par()`

stores original (default) par settings (in an R object`opar`

, short for “original par”)

Note: As some parameters can only be read, but not set, we typically exclude “read-only” parameters by storing `opar <- par(no.readonly = TRUE)`

.

Setting

`par`

:`par(mar = c(0, 0, 0, 0))`

to remove margins`par(opar)`

restores the original (default) par settings

See `?par`

for the list of parameters and options.

#### Arranging multiple plots

- Setting
`par(mfrow = c(1, 2))`

#### Practice

- Which graphical parameters can only be read, but not set?

**Hint:** Study the documentation of `par()`

, or simply try storing and re-setting `all_par <- par(); par(all_par)`

.

### References

*riskyr: Rendering risk literacy more transparent*. https://riskyr.org/, https://CRAN.R-project.org/package=riskyr

However, remembering our notion of

*ecological rationality*yields at least two other factors that matter for designing a good visualization: What is the message to be conveyed by the plot, and who will be viewing it?↩︎