```
library(openair) # load the package
data2003 <- selectByDate(mydata, year = 2003)
scatterPlot(data2003, x = "nox", y = "no2")
```

# 22 Scatter plots

## 22.1 Purpose

Scatter plots are extremely useful and a very commonly used analysis technique for considering how variables relate to one another. R does of course have many capabilities for plotting data in this way. However, it can be tricky to add linear relationships, or split scatter plots by levels of other variables etc. The purpose of the `scatterPlot`

function is to make it straightforward to consider how variables are related to one another in a way consistent with other openair functions. We have added several capabilities that can be used just by setting different options, some of which are shown below.

There is less need for this function now that ggplot2 is available, but it still has some benefits for intercative use.

A smooth fit is automatically added to help reveal the underlying relationship between two variables together with the estimated 95% confidence intervals of the fit. This is in general an extremely useful thing to do because it helps to show the (possibly) non-linear relationship between variables in a very robust way — or indeed whether the relationship is linear.

It is easy to add a linear regression line. The resulting equation is shown on the plot together with the R\(^2\) value.

For large data sets there is the possibility to `bin’ the data using hexagonal binning or kernel density estimates. This approach is very useful when there is considerable over-plotting.

It is easy to show how two variables are related to one another dependent on levels of a third variable. This capability is very useful for exploring how different variables depend on one another and can help reveal the underlying important relationships.

A plot of two variables can be colour-coded by a continuous colour scale of a third variable.

It can handle date/time x-axis formats to provide an alternative way of showing time series, which again can be colour-coded by a third variable.

The `scatterPlot`

function isn’t really specific to atmospheric sciences, in the same way as other plots. It is more a function for convenience, written in a style that is consistent with other openair functions. Nevertheless, along with the `timePlot`

function they do form an important part of openair because of the usefulness of understanding show variables relate to one another. Furthermore, there are many options to make it easy to explore data in an interactive way without worrying about processing data or formatting plots.

## 22.2 Examples

We provide a few examples of use and as usual, users are directed towards the help pages (type `?scatterPlot`

) for more extensive examples.

First we select a subset of data (2003) using the openair `selectByDate`

function and plot NO_{x} vs. NO_{2}

Often with several years of data, points are over-plotted and it can be very difficult to see what the underlying relationship looks like. One very effective method to use in these situations is to ‘bin’ the data and to colour the intervals by the number of counts of occurrences in each bin. There are various ways of doing this, but ‘hexagonal binning’ is particularly effective because of the way hexagons can be placed next to one another.^{1} To use hexagonal binning it will be necessary to install the hexbin package:

### 22.2.1 Hexaganol binning

Now it should be possible to make the plot by setting the method option to `method = "hexbin"`

, as shown in Figure @ref(fig:scatterPlot2). The benefit of hexagonal binning is that it works equally well with enormous data sets e.g. several million records. In this case Figure 22.2 provides a clearer indication of the relationship between NO_{x} and NO_{2} than Figure 22.1 because it reveals where most of the points lie, which is not apparent from Figure 22.1. Note that For `method = "hexbin"`

it can be useful to transform the scale if it is dominated by a few very high values. This is possible by supplying two functions: one that that applies the transformation and the other that inverses it. For log scaling for example (the default), `trans = function(x) log(x)`

and `inv = function(x) exp(x)`

. For a square root transform use `trans = sqrt`

and `inv = function(x) x^2`

. To not apply any transformation `trans = NULL`

and `inv = NULL`

should be used.

`scatterPlot(data2003, x = "nox", y = "no2", method = "hexbin", col= "jet")`

Note that when `method = "hexbin"`

there are various options that are useful e.g. a border around each bin and the number of bins. For example, to place a grey border around each bin and set the bin size try:

```
scatterPlot(mydata, x = "nox", y = "no2",
method = "hexbin", col = "jet",
border = "grey", xbin = 15)
```

The hexagonal binning and other binning methods are useful but often the choice of bin size is somewhat arbitrary. Another useful approach is to use a kernel density estimate to show where most points lie. This is possible in `scatterPlot`

with the `method = "density"`

option. Such a plot is shown in Figure 22.3.

```
scatterPlot(selectByDate(mydata, year = 2003),
x = "nox", y = "no2",
method = "density",
cols = "jet")
```

Sometimes it is useful to consider how the relationship between two variables varies by levels of a third. In openair this approach is possible by setting the option `type`

. When `type`

is another numeric variables, four plots are produced for different quantiles of that variable. We illustrate this point by considering how the relationship between NO_{x} and NO_{2} varies with different levels of O_{3}. We also take the opportunity to not plot the smooth line, but plot a linear fit instead and force the layout to be a 2 by 2 grid.

```
scatterPlot(data2003, x = "nox", y = "no2",
type = "o3", smooth = FALSE,
linear = TRUE, layout = c(2, 2))
```

Below is an extended example that brings together data manipulation, refined plot options and linear fitting of two variables with NO_{x}. The aim is to plot the weekly concentration of NO_{x} against PM_{10} and PM_{2.5} and fit linear equations to both relationships. To do this we need the \(x\) variable as NO_{x} and the \(y\) variable as PM_{10} or PM_{2.5}, which means we also need a column that will act as a grouping column i.e. identifies whether the \(y\) is PM_{10} or PM_{2.5}.

```
# load the packages we need
library(tidyverse)
# select the variables of interest
subdat <- select(mydata, date, nox, pm10, pm25) # calculate weekly averages
subdat <- timeAverage(subdat, avg.time = "week")
# reshape so we have two variable columns
subdat <- pivot_longer(subdat, cols = c(pm10, pm25),
names_to = "pollutant")
head(subdat)
```

```
# A tibble: 6 × 4
date nox pollutant value
<dttm> <dbl> <chr> <dbl>
1 1997-12-29 00:00:00 128. pm10 21.8
2 1997-12-29 00:00:00 128. pm25 NaN
3 1998-01-05 00:00:00 189. pm10 33.6
4 1998-01-05 00:00:00 189. pm25 NaN
5 1998-01-12 00:00:00 203. pm10 29.1
6 1998-01-12 00:00:00 203. pm25 NaN
```

Now we will plot weekly NO_{x} versus PM_{10} and PM_{2.5} and fit a linear equation to both — and adjust some of the symbols (shown in Figure 22.5).

```
scatterPlot(subdat, x = "nox", y = "value",
group = "pollutant",
pch = 21:22, cex = 1.6,
fill = c("dodgerblue", "tomato"),
col = "white",
linear = TRUE,
xlab = "nox (ppb)",
ylab = "PM concentration (ug/m3)")
```

To gain a better idea of where the data lie and the linear fits, adding some transparency helps:

```
scatterPlot(subdat, x = "nox", y = "value",
group = "variable",
pch = 21:22, cex = 1.6,
fill = c("dodgerblue", "tomato"),
col = "white",
linear = TRUE,
xlab = "nox (ppb)",
ylab = "PM concentration (ug/m3)",
alpha = 0.2)
```

The above example will also work with `type`

. For example, to consider how NO_{x} againts PM_{10} and PM_{2.5} varies by season:

```
scatterPlot(subdat, x = "nox", y = "value",
group = "variable",
pch = 21:22, cex = 2,
fill = c("dodgerblue", "tomato"),
col = "white", linear = TRUE,
xlab = "nox (ppb)",
ylab = "PM concentration (ug/m3)",
type = "season")
```

Finally, we show how to plot a continuous colour scale for a third *numeric* variable setting the value of `z`

to the third variable. Figure 22.6 shows again the relationship between NO_{x} and NO_{2} but this time colour-coded by the concentration of O_{3}. We also take the opportunity to split the data into seasons and weekday/weekend by setting `type = c("season", "weekend")`

. There is an enormous amount of information that can be gained from plots such as this. Differences between weekdays and the weekend can highlight changes in emission sources, splitting by seasons can show seasonal influences in meteorology and background O_{3} and colouring the data by the concentration of O_{3} helps to show how O_{3} concentrations affect NO_{2} concentrations. For example, consider the summertime-weekday panel where it clearly shows that the higher NO_{2} concentrations are associated with high O_{3} concentrations. Indeed there are some hours where NO_{2} is >100 ppb at quite low concentrations of NO_{x} (\(\approx\) 200 ppb). It would also be interesting instead of using O_{3} concentrations from Marylebone Road to use O_{3} from a background site.

Figure 22.6 was very easily produced but contains a huge amount of useful information showing the relationship between NO_{x} and NO_{2} dependent upon the concentration of O_{3}, the season and the day of the week. There are of course numerous other plots that are equally easily produced.

```
scatterPlot(data2003,
x = "nox", y = "no2", z = "o3",
type = c("season", "weekend"),
limits = c(0, 30))
```

Figure 22.7 shows that `scatterPlot`

can also handles dates on the x-axis; in this case shown for SO_{2} concentrations coloured by wind direction for August 2003.

```
scatterPlot(selectByDate(data2003, month = 8),
x = "date", y = "so2",
z = "wd")
```

Similar to Chapter 11, `scatterPlot`

can also plot wind vector arrows if wind speed and wind direction are available in the data frame. Figure 22.8 shows an example of using the `windflow`

option. The Figure also sets many other options including showing the concentration of O_{3} as a colour, setting the colour scale used and selecting a few days of interest using the `selectByDate`

function. Figure 22.8 shows that when the wind direction changes to northerly, the concentration of NO_{2} decreases and that of O_{3} increases.

```
scatterPlot(selectByDate(mydata, start = "1/6/2001",
end = "5/6/2001"),
x = "date", y = "no2", z = "o3",
col = "increment",
windflow = list(scale = 0.15),
key.footer = "o3\n (ppb)",
main = NULL, ylab = "no2 (ppb)")
```

In fact it is not possible to have a shape with more than 6 sides that can be used to forma a lattice without gaps.↩︎