5 Displaying Data

Last chapter we met the struggle of having too much data. We laid out 420 enrollments figures for California schools, and 420 data points is hard to digest (and there are many data sets with millions of rows). Descriptive statistics offered ways to summarize that data and provide some order to the mess.

Another effective way of summarizing large quantities of data is to graph it. Graphing data is also a great way to look at multiple variables simultaneously and examine relationships between them.

Creating a basic plot in R takes only a few lines of code, but it is important to be mindful of how best to display data.

5.1 Data Types

Let’s begin then by taking a half step back. Last week we started working with data, but we can first talk a bit more about different types of data.

There are three types of data we’ll worry about here:

  • Numerical
  • Categorical
  • Ordinal

5.1.1 Numerical Data

Numerical data has numbers. Numerical data is that which can be counted: the number of books you own, hairs on your head, wings on a pterodactyl, ounces of water in a glass, sand on the beach, and on and on. Numerical data further comes in two distinct forms: discrete and continuous Numerical - Discrete

Discrete data comes in amounts that can possibly be counted. For instance, there is some number of items in your fridge currently. It may be 0, it may be 5, it could be 37. If you took them all out, you could count them, and there would be a finite number of items to count. Let’s create an object called fridge to save the information on how many items are in my fridge. Numerical - Continuous

On the other hand, continuous data is not possible to count. Imagine counting the seconds someone ran a 40 yard dash. It could be 10 seconds, or 10.1, or 10.01, or 10.001, or… you get it. And that’s only starting at 10. There are an infinite number of times someone could possibly run a 40 yard dash. An easy rule is that if the numbers are counted with decimals, then it is continuous.

There aren’t too many significant differences in what we do with continuous and discrete numerical data. The important thing to remember though is that numeric data has numbers, which will make a big difference in what we can do with it.

5.1.2 Categorical

Categorical data refers to categories or characteristics. Those could be hair color, color of shoes, car type, language spoken, etc. Those variables will generally not be recorded as numbers, but rather using characters or words.

5.1.3 Ordinal

Ordinal data is similar to categorical data, in that it refers to categories. However, it has a certain order. Categorical data like shoe color doesn’t have any particular hierarchy, green, brown, blue - none of those are the highest color of shoes or the lowest. Ordinal data does have an order. Take for instance happiness: how happy do you feel today? Very happy is higher than happy, which is higher that unhappy. Different answers to how happy you are fall into an implicit range that has some order.

## [1] "41"         "9.73"       "brown"      "Very Happy"

Okay, so those are the typical forms of data that we’ll encounter. Let’s take a look at some real data though, and review each type. Also, we’ll go over how to convert data between types for use.

5.2 Variable Types Review

We’ll use a data set politicalInformation from the package pscl. First we’ll need to load the pscl package using library(pscl), and once we have the data loaded we can review each variable.

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## [1] "y"             "collegeDegree" "female"        "age"          
## [5] "homeOwn"       "govt"          "length"        "id"

names() gave us all of the varaibles that we have avaliable. We can use table() to see the different values the data takes within each variable.

##    Very Low  Fairly Low     Average Fairly High   Very High 
##         105         334         586         450         325

y is a rating given to each individual on their political knowledge. These reviews were provided by the surveyors. As shown above, they range from “Very Low” to “Very High”, so they’re ordinal data because of their ranking. Currently they are written as characters. They could also be written as numbers though, with 5 as the highest level and 1 as the lowest level by using the as.numeric() command. That can change a character to a number. We’d need to write down what each means, but we can create a new variable that’s numeric. In addition, that variable would be discrete, since there are only 5 values.

##     Very Low Fairly Low Average Fairly High Very High
##   1      105          0       0           0         0
##   2        0        334       0           0         0
##   3        0          0     586           0         0
##   4        0          0       0         450         0
##   5        0          0       0           0       325
##   No  Yes 
## 1083  724

collegeDegree measures whether the individual had graduated college or not. There are a lot of ways to measure education, you can do it by asking how many years the person had been in school total (12 is a typical number for someone that graduated college) or you could have various levels, such as “not a high school graduate”, “high school graduate”, “college graduate”. Here, they’ve just measured whether the person has graduated college or not, and it is recorded with a Yes or No. That means the data is currently categorical, some people are in the category of college graduates, others aren’t. We can convert this variable to numerical as well using the ifelse() command. With ifelse we specify what we want to test (does collegeDegree equal Yes?) and say what we want to happen if that’s true (1) and what should happen if not (0). See the example below.

ifelse is a really common command for creating new varaibles. It’s worth practacing here and getting comfortable with.

##    0    1 
## 1083  724
##   No  Yes 
##  790 1017

female is very similar to collegeDegree. It is categorical, with Yes if the respondent is female and no if not. We could convert that to numeric just like above by making females 1 and males 0. Or we could create a new variable for if the respondent was male. These variables are also considered dichotomous (which means the variables have two values, typically yes/no or 1/0.

##   No  Yes 
##  790 1017
##    0    1 
## 1017  790
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
## 12 16 23 23 23 23 26 26 28 28 29 22 34 30 36 33 40 41 45 48 40 44 46 42 48 
## 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
## 37 41 34 35 38 36 32 39 35 26 31 30 26 36 40 33 25 21 20 17 29 16 24 17 14 
## 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 
## 17 18 17 11 16 22 17 17 14  9 12 13 10  9 15  6  9  5  1  2  6  5  2  1  1 
## 93 95 96 97 
##  2  1  1  1

age is numeric and discrete. Age can be measured in seconds or minutes, which would make it continuous and difficult to count, but here it is a series of integers representing years.

##   No  Yes 
##  602 1205

homeOwn measures whether the individual owns a home, and is categorical and dichotomous.

##   No  Yes 
## 1596  211

govt measures whether the individual works for the government, and is categorical and dichotomous.

## 18.5200004577637               19 22.0300006866455 22.1200008392334 
##                1                1                1                1 
## 23.3999996185303 24.7000007629395 25.5799999237061 28.8799991607666 
##                1                1                1                1 
## 30.2700004577637 32.0999984741211 
##                1                1

length measures the length of each interview. It’s rounded here to the second second, but it is numerical and continuous. There are a lot of values, so I’m only showing the top 10.

##  1  2  3  4  5  6  7  8  9 10 
## 27 11 23 30 33 41 27 23 36 21

id is the id for the interviewer and it is numerical and discrete. It is included so that if there are differences in how the surveyors rated people consistently (maybe interviewer 1 was really nice, maybe interviewer 2 had a bad day and was mean) we can control for their role in the scores.

5.3 Graphing Categorical and Ordinal

Okay, why such a long detour into data types. As I’ve already shown above, we can convert all of the data to numerical using commands like as.numeric() or ifelse(). It’s because the type of data you start with, whether it’s numerical, ordinal, or categorical, will have a significant impact on the correct way to graph it. And graphing data is really important for communicating data to your bosses or the public. People will flip through a report to find the pictures, which is good because graphs are a fantastic way to tell a story. With a good graph, a picture really is worth one-thousand words. But it’s also dangerous, because a bad graph can obscure the truth.

To illustrate, let’s start with a bad graph, using the same data we’ve talked about in this chapter. How many of the individuals that were surveyed were rated with each of the 5 categories of political knowledge?

##    Very Low  Fairly Low     Average Fairly High   Very High 
##         105         334         586         450         325

We can graph that result by using the plot() command. We’ll use plot() a lot in this chapter, it’s the simplest way to create a graph in r, and then we can customize every feature around it. First we’ll need to save the output of table showing how many people got each score; we’ll save it below as yplot.

That looks okay. It’s a line graph, which you may have guessed since there is a very prominent line on it. Let me highlight two aspects of this graph, which is true for other graphs too and should be something encountered in other math classes. The line sticking up with the numbers here - that’s the y-axis. The other line with the ratings (but it could be numbers too) - that is the x-axis. We’ll use those terms in the future.

We can see that very few people were rated Very Low, and that the largest number were rated as Average. Doesn’t this perfectly communicate the information? The name of the game in graphing is efficiency - always aim to use the least ink to communicate the most information.

What exactly is the value at the end of that arrow? It looks to be about 430. Are there 430 people that are between Fairly Low and Average? No, the survey didn’t give that as a rating. It’s a small issue, but the line implies there are values that don’t exist. It’s inefficient and doesn’t make sense. So should we just get rid of the line?

No more line, great. But that’s a lot of blank space. It communicates how many people are in each category, but this may be too little ink. This is a scatter plot, because you scatter points across the graph (the names don’t get any more creative, trust me). This is okay, but it isn’t much to look at.

So we’ve started with two graphs that communicate the same information, and even do so mostly honestly. But also aren’t ideal.

To put the rule simply, for categorical data you shouldn’t use a line or scatter plot. How should you graph it then? I’ve been controlling the type of plot that displays with the command type= below, if we remove that R will use the default for this type of data.

This is similar to a barplot, which uses a bar that represents the value of each category. We can make it look a bit better by using a specialized command for barlots, barplot()

That looks pretty good. Each person that was rated Very Low is represented in that bar, and you can quickly see which category was largest and their relative size. We’ve added some ink, but this graph is easy to see and read. There’s another option worth considering.

A pie chart can also be an effective way to communicate categorical data. You can see the relative size of each group, and they add up to 100% of the survey. However, it can be a little more difficult to compare here. Without looking above, which is larger, Fairly Low or Very High? And how many rated as Fairly Low? We’ve lost the scale with this chart. We could add that information manually, but pie charts are dangerous for that reason. They’re an inefficient way to communicate information. Let’s take that problem to an extreme with the variable for age, and then that will be the last pie chart in this book.

Pie charts can communicate relative values (which slice is largest) and shouldn’t be used for more than 3 categories. But they should be avoided, because there are better ways to communicate data.

Before we leave categorical data, let’s take a moment to go over the features of barplot(). I mentioned that you can customize anything, what is that ‘anything’? You can see details by typing ??barplot into the console. I’m going to use a few features, but figuring out how best to set them for any given graph will just take some practice and thinking about what communicates the information to you.

  • the first item table(politicalInformation$age), specifies the data that we’re graphing
  • width specifies how wide we want the bars
  • col changes the color of the bars, here I’ve chosen steelblue but there are a lot of colors that are options. Check out all the colors that are available here.
  • border changes the color of the borders for the bar
  • ylim changes how long the yaxis should be. R will guess how tall it should be based on the values in your data, but sometimes you’ll want to adjust it manually.

One difference between categorical and ordinal data comes up in how you should arrange it on a plot. Let’s take a look at another plot of the data on political information. I’m going to reorganize the x-axis using the command sort().

Now the data is in order, which makes it easier to see the relative ranks of each of the levels of political knowledge. We could probably tell that more people were rated as Fairly Low than Very High before, but now it’s very clear. Do you see any problems with this plot?

It’s ordinal data, so it can confuse people if the values aren’t in their natural order. Seeing Very Low next to Very High can be a bit confusing. We’ve added a bit of clarity, with a bit of confusion. If you’re plotting ordinal data with a bar plot, it’s best to keep it in the original order. It can make more sense though with categorical data.

There’s no natural hierarchy to eye color, so it makes sense to place those in order on a graph. You can put the largest category on the right or left side, that doesn’t make a large difference. But what if you’re not graphing categorical or ordinal data?

5.4 Graphing Numerical Data

We’ve already discussed when not to use scatter plots or line graphs, let’s talk about the appropriate data to graph with those now: hint, it’s numerical data.

Why? Because with line and scatter plots you want the x and y-axis to have more length or values than categorical data normally offers.

Line graphs are particularly good for plotting values that we measure over time. Let’s take a look at the US population from 1790 to 1970.

That graphs the relationship pretty clearly, US population has increased over that time span. Let’s add a few more details, changing the names of the axes (ylab= and xlab=) and making the line a little thicker (lwd=) and red (col=).

Great. But we should note one thing about this graph. Earlier I said not to use a line graph for the data on political information because we didn’t know the values between our different categories. Here, we only have data for each decade when the Census is conducted. So we know what the US population was in 1790 and 1800, but we don’t know what it was in 1795. Yet, we have a line that is estimating the population in that year. Why is that okay?

It isn’t always, but here it is because 1795 did have a value. Half-way between Very Low and Fairly Low didn’t exist, but there was a US population in 1795. We can use the two values from 1790 and 1800 to estimate it, and we might be a little off but we’re probably close. We’re assuming that the population grew at the same rate every year in between those points, which might not be right, but it gets us close to the truth we hope. We could add points for the dates we do have to better clarify this idea with type=“b” to specify we want both (the b) points and a line on the graph.

Now the dots identify where we did have a measurement, and the line is showing a guess at where the population was in between. More importantly, the line helps to highlight the trend.

A line graph typically has few observations (we have 16 here) and they are all connected in a continuous line (time only moves forward). One great feature of line graphs is that we can add a second line (or more). Generally, you’ll need to make sure that both lines are in the same unit. If we wanted to add a line to the plot above we’d want to make sure that it’s something measured over time, and more importantly that population is on the y-axis.

Let’s take a look at two lines for presidential elections in Louisiana. I can add the second line on top of my plot by using the command lines() to graph it. We’ll use the command subset() to limit the data just to Louisiana.

If you do add a second line to a line graph, it’s important to add a legend to identify which line is which. We’ve used the traditional colors of the Republican and Democratic party so it might be clear, but if you handed this to an alien from Mars they wouldn’t know that. We can add a legend with the legend() command. We need to specify where we want the legend on the plot, identifying where on the x and y axis it should sit, and the labels we want shown.

Check this link from the R Graph Gallery if you want more details about creating a legend, or search the help in R.

So a line graph is best for data that has a linear relationship to it. Even if the percentage vote for Democrats and Republicans doesn’t move up or down consistently, time moves forward. On the other hand, a scatter plot is better for data that can move in any direction on both axes.

Let’s take a look at data on the relationship with murder rates and assault rates at the state level.

That data is very scattered, but it forms a relationship. Knowing the assault rate of a state would help you to predict the murder rate. That is, states with higher assault rates have higher murder rates, and vice versa. That’s good to know if you’re wondering what a state’s murder rate is, but you only know the assault rate. They aren’t perfectly related, knowing one doesn’t exactly predict the other, but it’s a useful relationship.

Scatter plots shouldn’t just be used if there is a relationship between two variables, but they are a useful way for displaying correlations. That is the topic of the next chapter.

We have one more type of graph to discuss: the histogram. Histograms are very similar to the bar charts we discussed earlier, but more appropriate for numeric data. Bar charts are great when we have a few categories to plot on the x-axis. When there are many difference values, as there is generally often with numeric data, a histogram is what you want. Let’s run an example using the age variable from the politicalInformation data and talk about what we’re seeing.

Age has a lot of different values. A histogram groups them in to “bins” and counts the number of observations within each value. So here we have bins for people roughly between 20-25, 25-30, and onward, and the bar extends upwards to represent how many of the repsondents falls within each group. A histogram is a great way to see the shape of your data. From this graph you can quickly see that many of the people surveyed are between 30 and 50.

We have options to customize histograms as well. For instance, we can add more or fewer bins with breaks=. Let’s tell R to create 30 different bins in our data, which means asking it to divide the observations into 30 equally sized groups.

We get more bars, and didn’t significally change the graph. If we reduced the number of bars we might see more of a difference. Let’s check.

Now everyone is bundled into 20 year intervals, but the information still tells roughly the same story with less detail. So how many is the right number of breaks/bins? It depends on your data, but use your eye test and generally trust the default R gives you.

Histograms are great for seeing visually if your data is skewed. Here we can see that it isn’t quite normally distributed, and that there is a slight right tail. It’s not a large or long tail, but in addition to seeing how the mean and median compare, a histogram is a good way to detect skew.

5.5 Best practices

I’ve already mentioned that there are a lot of features you can customize in your graphs for R. Much of the difficult then is deciding exactly what you want to do, and how to make it look best. Let’s finish by talking about best practices. Some of this will resummarize what was mentioned above, but it’s good to end on a review.

We’ll start by looking at a basic scatter plot of the relationship between spending and math scores in California schools.

## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival

Okay, cool. Here is a basic list of the things I customize with my graphs when I produce them for presentations or graphs.

  • I re-label my axes. R labels them by default, but it does so in an fugly way by just inserting the name of the data frame and the column name. You can use ylab and xlab to make that more presentable.
  • I add a title to my plot using main=, so that a reader can quickly figure out what is graphed below
  • bty= can change the shape of the box around the plot. The default is a 4-sided box, which seems a little…boxy. I prefer an l shaped box, so I generally use bty=“l”. You can also get rid of the entire box with bty=“n”.
  • I change the type of dot that is being plotted if it is a scatter plot. pch=16 makes for filled dots, which look better to me.
  • I don’t normally change the color, unless there is a reason to. Adding bright colors can make the graph look busy, but if I’m highlighting a point or adding a line to summarize the trend, I might add color. Only add color with purpose.

That isn’t amazing, but it looks decent. That is the minimum to make a good looking graph. You can make it better and more informative, but first we walk before we run.

I want to mention that R Graph Gallery is a great resource to see how much one can do graphing in R. It’s a website where people post their graphs and the code in order to share and learn from each other. Have a look at all the incredible and effective ways one can display data. We’ve only scratched the surface in this chapter.

5.6 Video Review