Chapter 3 Frequencies and Basic Graphs

3.1 Get Ready

In this chapter, we explore how to examine the distribution of variables using some basic data tables and graphs. In order to follow along in R, you should load the anes20.rda and states20.rda data files, as shown below. If you have not already downloaded the data sets, see the instructions at the end of the “Accessing R” section of Chapter 2.


If you get errors at this point, check to make sure the files are in your working directory or that you used the correct file path; also make sure you spelled the file names correctly and enclosed the file path and file name in quotes. Note that <FilePath> is the place where the data files are stored. If the files are store in your working directory, you do not need to include the file path.

#If files are in your working directory, just:

In addition, you should also load the libraries for descr and Desctools, two packages that provide many of the functions we will be using. You may have to install the packages (see Chapter 2) if you have not done so already.


3.2 Introduction

Sometimes, very simple statistics or graphs can convey a lot of information and play an important role in the presentation of data analysis findings. Advanced statistical and graphing techniques are normally required to speak with confidence about data-based findings, but it is almost always important to start your analysis with some basic tools, for two reasons. First, these tools can be used to alert you to potential problems with the data. If there are issues with the way the data categories are coded, or perhaps with missing data, those problems are relatively easy to spot with some of the simple methods discussed in this chapter. Second, the distribution of values (how spread out they are, or where they tend to cluster) on key variables can be an important part of the story told by the data, and some of this information can be hard to grasp when using more advanced statistics.

This chapter focuses on using simple frequency tables, bar charts, and histograms to tell the story of how a variable’s values are distributed. Two data sources are used to provide examples in this chapter: a data set comprised of selected variables from the 2020 American National Election Study, a large-scale academic survey of public opinion in the months just before and after the 2020 U.S. presidential election (saved as anes20.rda), and a state-level data set containing information on dozens of political, social, and demographic variables from the fifty states (saved as states20.rda) In the anes20 data set, most variable names follow the format “V20####”, while in the states20 data set the variables are given descriptive names that reflect the content of the variables. Codebooks for these variables are included in the appendix to this book.

3.3 Counting Outcomes

Let’s start by examining some data from the anes20 data set. One of the most basic things we can do is count the number of times different outcomes for a variable occur. Usually, this sort of counting is referred to as getting the frequencies for a variable. There are a couple of ways to go about this. Let’s take a look at a variable from anes20 that measures the extent to which people want to see federal government spending on “aid to the poor” increased or decreased (anes20$V201320x).

One of the first things to do is familiarize yourself with this variable’s categories. You can do this with the levels() command:

#Show the labels for the categories of V201320x, from the anes20 data set
[1] "1. Increased a lot"    "2. Increased a little" "3. Kept the same"     
[4] "4. Decreased a little" "5. Decreasaed a lot"  
#If you get an error message, make sure you have loaded the data set.

Here, you can see the category labels, which are ordered from “Increased a lot” at one end, to “Decreased a lot” at the other. This is useful information but what we really want to know is how many survey respondents chose each of these categories. We can’t do this without using some function to organize and count the outcomes for us. This is readily apparent when you look at the way the data are organized, as illustrated below using just the first 50 out of over 8000 cases:

#Show the values of V201320x for the first 50 cases
 [1] 5. Decreasaed a lot   3. Kept the same      1. Increased a lot   
 [4] 3. Kept the same      3. Kept the same      2. Increased a little
 [7] 1. Increased a lot    3. Kept the same      3. Kept the same     
[10] 2. Increased a little 3. Kept the same      3. Kept the same     
[13] 3. Kept the same      2. Increased a little 3. Kept the same     
[16] 2. Increased a little 3. Kept the same      2. Increased a little
[19] 3. Kept the same      3. Kept the same      3. Kept the same     
[22] 1. Increased a lot    1. Increased a lot    3. Kept the same     
[25] 2. Increased a little 2. Increased a little 3. Kept the same     
[28] 2. Increased a little 1. Increased a lot    2. Increased a little
[31] 1. Increased a lot    5. Decreasaed a lot   2. Increased a little
[34] 2. Increased a little 2. Increased a little 3. Kept the same     
[37] 1. Increased a lot    2. Increased a little 1. Increased a lot   
[40] 3. Kept the same      3. Kept the same      1. Increased a lot   
[43] 4. Decreased a little 3. Kept the same      3. Kept the same     
[46] 4. Decreased a little 2. Increased a little 3. Kept the same     
[49] 3. Kept the same      2. Increased a little
5 Levels: 1. Increased a lot 2. Increased a little ... 5. Decreasaed a lot

In this form, it is difficult to make sense out of these responses. Does one outcome seem like it occurs a lot more than all of the others? Are there some outcomes that hardly ever occur? Do the outcomes generally lean toward the “Increase” or “Decrease” side of the scale? You really can’t tell from the data as they are listed, and these are only the first fifty cases. Having R organize and tabulate the data provides much more meaningful information.

What you need to do is create a table that summarizes the distribution of responses. This table is usually known as a frequency distribution, or a frequency table. The base R package includes a couple of commands that can be used for this purpose. First, you can use table() to get simple frequencies:

#Create a table showing the how often each outcome occurs

   1. Increased a lot 2. Increased a little      3. Kept the same 
                 2560                  1617                  3213 
4. Decreased a little   5. Decreasaed a lot 
                  446                   389 

Now we see not just the category labels, but also the number of survey respondents in each category. From this, we can see that there is more support for increasing spending on the poor than for decreasing it, and it is clear that the most common choice is to keep spending the same. Okay, this is useful information and certainly an improvement over just listing all of the data and trying to make sense out of them that way. However, this information could be more useful if we expressed it in relative rather than absolute terms. As useful as the simple raw frequencies are, the drawback is that they are a bit hard to interpret on their own (at least without doing a bit of math in your head). Let’s take the 2560 “Increased a lot” responses as an example. This seems like a lot of support for this position, and we can tell that it is compared to the number of responses in most other categories; but 2560 responses can mean different things from one sample to another, depending on the sample size (2560 out of how many?). Certainly, the magnitude of this number means something different if the total sample is 4000 than if it is 10,000. Since R can be used as an overpowered calculator, we can add up the frequencies from all categories to figure out the total sample size:

#Add up the category frequencies and store them in a new object, "sample_size"
#Print the value of "sample_size"
[1] 8225

So now we know that there were 8225 total valid responses to the question on aid to the poor, and 2560 of them favored increasing spending a lot. Now we can start thinking about whether that seems like a lot of support, relative to the sample size. So what we need to do is express the frequency of the category outcomes relative to the total number of outcomes. These relative frequencies are usually expressed as percentages or proportions.

Percentages express the relative occurrence of each value of x. For any given category, this is calculated as the number of observations in the category, divided by the total number of valid observation across all categories, multiplied times 100:

\[\text{Category Percent}={\frac{\text{Total cases in category} }{\text{Total valid cases} }}*100\] Or, if you are really itching for something that looks a bit more complicated (but isn’t):

\[\text{Category Percent}={\frac{f_k}{n}}*100\]

      \(f_k\) = frequency, or number of cases in any given category
      n = the number of valid cases from all categories

This simple statistic is very important for making relative comparisons. Percent literally means per one-hundred, so regardless of overall sample size, we can look at the category percentages and get a quick, standardized sense of the relative number of outcomes in each category. This is why percentages are also referred to as relative frequencies. In frequency tables, the percentages can range from 0 to 100 and should always sum to 100.

Proportions are calculated pretty much the same way, except without multiplying times 100:

\[\text{Category Proportion}={\frac{\text{Total cases in category} }{\text{Total valid cases} }}\] The main difference here is that proportions range in value from 0 to 1, rather than 0 to 100. It’s pretty straightforward to calculate both the percent and the proportion of respondents who chose “Increased a lot” when asked if they would like to see federal spending to aid the poor increased or decreased:

#Calculate Percent in "Increased a lot" category
[1] 31.12462
#Calculate Proportion in "Increased a lot" category
[1] 0.3112462

So we see that about 31% of all responses are in this category. What’s nice about percentages and proportions is that, for all practical purposes, the values have the same meaning from one sample to another. In this case, 31.1% (or .311) means that slightly less than one-third of all responses are in this category, regardless of the sample size. That said, their substantive importance can depend on the number of categories in the variable. In the present case, there are five response categories, so if responses were randomly distributed across categories, you would expect to find about 20% in each category. Knowing this, the outcome of 31% suggests that this is a pretty popular response category. Of course, we can also just look at the percentages for the other response categories to gain a more complete understanding of the relative popularity of the response choices.

Fortunately, we do not make this calculation manually for every category. Instead, we can use the prop.table function to get the proportions for all five categories. In order to do this, we need to store the results of the raw frequency table in a new object and then have prop.table use that object to calculate the proportions. Note that I use the extension “.tbl” when naming the new object. This serves as a reminder that this particular object is a table. When you execute commands such as this, you should see the new object appear in the Global Environment window.

#Store the frequency table in a new object called "poorAid.tbl"
#Create a proportion table using the contents of the frequency table

   1. Increased a lot 2. Increased a little      3. Kept the same 
           0.31124620            0.19659574            0.39063830 
4. Decreased a little   5. Decreasaed a lot 
           0.05422492            0.04729483 

It’s nice to see that the resulting table confirms our calculation for the proportion in the first category (always a relief when R confirms your work!). Here’s how you might think about interpreting the full set of proportions:

There are a couple of key takeaway points from this table. First, there is very little support for decreasing spending on federal programs to aid the poor and a lot of support for increasing spending. Only about 10% of all respondents (combining the two “Decreased” categories) favor cutting spending on these programs, compared to just over 50% (combining the two “Increased” categories) who favor increasing spending. Second, the single most popular response is to leave spending levels as they are (39%). Bottom line, there is not much support in these data for cutting spending on programs for the poor.

There are some things to notice about this interpretation. First, I didn’t get too bogged down in comparing all of the reported proportions. If you are presenting information like this, the audience (e.g., your professor, classmates, boss, or client) is less interested in the minutia of the table than the general pattern. Second, while focusing on the general patterns, I did provide some specifics. For instance, instead of just saying “There is very little support for cutting spending,” I included specific information about the percent who favored and opposed spending and who wanted it kept the same, but without getting too bogged down in details. Finally, you will note that I referred to percentages rather than proportions, even though the table reports proportions. This is really just a personal preference, and in most cases it is okay to do this. Just make sure you are consistent within a given discussion.

Okay, so we have the raw frequencies and the proportions, but note that we have to use two different tables to get this information, and those tables are not exactly “presentation ready.” It seems like a somewhat labor-intensive process to get just this far. Fear not, for there are a couple of alternatives that save steps in the process and still provide all the information you need in a single table. The first alternative is the freq command, which is provided in the descr package, a package that provides several tools for doing descriptive analysis. Here’s what you need to do:

#Provide a frequency table, but not a graph
freq(anes20$V201320x, plot=F)
PRE: SUMMARY: Federal Budget Spending: aid to the poor 
                      Frequency  Percent Valid Percent
1. Increased a lot         2560  30.9179        31.125
2. Increased a little      1617  19.5290        19.660
3. Kept the same           3213  38.8043        39.064
4. Decreased a little       446   5.3865         5.422
5. Decreasaed a lot         389   4.6981         4.729
NA's                         55   0.6643              
Total                      8280 100.0000       100.000
#If you get an error here, make sure the "descr" library is attached

As you can see, we get all of the information provided in the earlier tables, plus some additional information, and the information is somewhat better organized. The first column of data shows the raw frequencies, the second shows the total percentages, and the final column is the valid percentages. The valid percentages match up with the proportions reported earlier, while the “Percent” column reports slightly different percentages based on 8280 responses (the 8225 valid responses and 55 survey respondents who did not provide a valid response). When conducting surveys, some respondents refuse to answer some questions, or may not have an opinion, or might be skipped for some reason. These 55 responses in the table above are considered missing data and are denoted as NA in R. It is important to be aware of the level of missing data and usually a good idea to have a sense of why they are missing. Sometimes, this requires going back to the original codebooks or questionnaires (if using survey data) for more information about the variable. Generally, researchers present the valid percent when reporting results.

One statistic missing from this table is the cumulative percent, which can be useful for getting a sense of how a variable is distributed. The cumulative % is the percent of observations in or below (in a numeric or ranking sense) a given category. You calculate the cumulative percent for a given ordered or numeric value by summing the percent with that value and the percent in all lower ranked values. We’ve actually already discussed this statistic without actually calling it by name. In part of the discussion of the results from the table command, it was noted that just over 50% favored increasing spending. This is the cumulative percent for the second category (31.1% from the first and 19.7% from the second category). Of course, it’s easier if you don’t have to do the math in your head on the fly every time you want the cumulative percent. Fortunately, there is an alternative command, Freq, that will give you a frequency table that includes cumulative percentages (note the upper-case F, as R is case-sensitive). This function is in the DescTools package, another package with several tools for descriptive analysis.

#Produce a frequency table that included cumulative statistics
                   level   freq   perc  cumfreq  cumperc
1     1. Increased a lot  2'560  31.1%    2'560    31.1%
2  2. Increased a little  1'617  19.7%    4'177    50.8%
3       3. Kept the same  3'213  39.1%    7'390    89.8%
4  4. Decreased a little    446   5.4%    7'836    95.3%
5    5. Decreasaed a lot    389   4.7%    8'225   100.0%
#If you get an error here, make sure the "DescTools" library is attached

Here, you get the raw frequencies, the valid percent (note that there are no NAs listed), the cumulative frequencies, and the cumulative percent. The key addition is the cumulative percent, which I consider to be a sometimes useful piece of information. As pointed out above, you can easily see that just over half the respondents (50.8%) favored an increase in spending on aid to the poor, and almost 90% opposed cutting spending (percent in favor of increasing spending or keeping spending the same, the cumulative percentage in the third category).

So what about table, the original function we used to get frequencies? Is it still of any use? You bet it is! In fact, many other functions make use of the information from the table command to create graphics and other statistics, as you will see shortly.

Besides providing information on the distribution of single variables, it can also be useful to compare the distributions of multiple variables if there are sound theoretical reasons for doing so. For instance, in the example used above, the data showed widespread support for federal spending on aid to the poor. It is interesting to ask, though, about how supportive people are when we refer to spending not as “aid to the poor” but as “welfare programs,” which technically are programs to aid the poor. The term “welfare” is viewed by many as a “race-coded” term, one that people associate with programs that primarily benefit racial minorities (mostly African-Americans), which leads to lower levels of support, especially among whites. As it happens, the 2020 ANES asked the identical spending question but substituted “welfare programs” for “aid to the poor.” Let’s see if the difference in labeling makes a difference in outcomes. Of course, we don’t expect to see the exact same percentages because we are using a different survey question, but based on previous research in this area, there is good reason to expect lower levels of support for spending on welfare programs than on “aid to the poor.”

freq(anes20$V201314x, plot=FALSE)
PRE: SUMMARY: Federal Budget Spending: welfare programs 
                      Frequency  Percent Valid Percent
1. Increased a lot         1289  15.5676         15.69
2. Increased a little      1089  13.1522         13.26
3. Kept the same           3522  42.5362         42.88
4. Decreased a little      1008  12.1739         12.27
5. Decreasaed a lot        1305  15.7609         15.89
NA's                         67   0.8092              
Total                      8280 100.0000        100.00

On balance, there is much less support for increasing government spending on programs when framed as “welfare” than as “aid to the poor.” Whereas almost 51% favored increasing spending on aid to the poor, only 29% favored increased spending on welfare programs; and while only 10% favored decreasing spending on aid to the poor, 28% favor decreasing funding for welfare programs. Clearly, in their heads, respondents see these two policy areas as different, even though the primary purpose of “welfare” programs is to provide aid to the poor.

This single comparison is a nice illustration of how even very simple statistics can reveal substantively interesting patterns in the data.

3.3.1 The Limits of Frequency Tables

As useful and accessible as frequency tables can be, they are not always as straightforward and easy to interpret as those presented above. For many numeric variables, frequency tables might not be very useful. The basic problem is that once you get beyond 7-10 categories, it can be difficult to see the patterns in the frequencies and percentages. Sometimes there is just too much information to sort through effectively. Consider the case of presidential election outcomes in the states in 2020. Here, we will use the states20 data set mentioned earlier in the chapter. The variable of interest is d2pty20, Biden’s percent of the two-party vote in the states.

#Frequency table for Biden's % of two-party vote (d2pty20) in the states
freq(states20$d2pty20, plot=FALSE)
      Frequency Percent
27.52         1       2
30.2          1       2
32.78         1       2
33.06         1       2
34.12         1       2
35.79         1       2
36.57         1       2
36.8          1       2
37.09         1       2
38.17         1       2
39.31         1       2
40.22         1       2
40.54         1       2
41.6          1       2
41.62         1       2
41.8          1       2
42.17         1       2
42.51         1       2
44.07         1       2
44.74         1       2
45.82         1       2
45.92         1       2
47.17         1       2
48.31         1       2
49.32         1       2
50.13         1       2
50.16         1       2
50.32         1       2
50.6          1       2
51.22         1       2
51.41         1       2
53.64         1       2
53.75         1       2
54.67         1       2
55.15         1       2
55.52         1       2
56.94         1       2
58.07         1       2
58.31         1       2
58.67         1       2
59.63         1       2
59.93         1       2
60.17         1       2
60.6          1       2
61.72         1       2
64.91         1       2
65.03         1       2
67.03         1       2
67.12         1       2
68.3          1       2
Total        50     100
#If you get an error, check to be sure the "states20" data set loaded.

The most useful information conveyed here is that vote share ranges from 27.5 to 68.3. Other than that, this frequency table includes too much information to absorb in a meaningful way. There are fifty different values and it is really hard to get a sense of the general pattern in the data. Do the values cluster at the high or low end? In the middle? Are they evenly spread out? In cases like this, it is useful to collapse the data into fewer categories that represent ranges of outcomes. Fortunately, the Freq command does this automatically for numeric variables.

Freq(states20$d2pty20, plot=FALSE)
     level  freq   perc  cumfreq  cumperc
1  [25,30]     1   2.0%        1     2.0%
2  (30,35]     4   8.0%        5    10.0%
3  (35,40]     6  12.0%       11    22.0%
4  (40,45]     9  18.0%       20    40.0%
5  (45,50]     5  10.0%       25    50.0%
6  (50,55]     9  18.0%       34    68.0%
7  (55,60]     8  16.0%       42    84.0%
8  (60,65]     4   8.0%       46    92.0%
9  (65,70]     4   8.0%       50   100.0%

Now we see the frequencies for nine different ranges of outcomes, found in the “level” column. When data are collapsed into ranges like this, the groupings are usually referred to as intervals, classes, or bins, and are labeled with the upper and lower limits of the category. This function uses what are called right closed intervals (indicated by ] on the right side of the interval), so the first bin (also closed on the left,[25,30]) includes all values of presidential approval ranging from 25 to 30, the second bin ((30,35]) includes all values ranging from just more than 30 to 35, and so on. In this instance, binning the data makes a big difference. Now it is much easier to see that there are relatively few states with very high or very low levels of Biden support, and the most states are in the middle of the distribution. The cumulative frequency and cumulative percent can provide important insights: Then-candidate Biden received 50% or less of the vote in exactly half of the states.

If you think nine grouped categories is still too many for easy interpretation, you can designate fewer groupings by adding the breaks command:

#Create a frequency table for "d2pty20" with just the five groupings
Freq(states20$d2pty20, breaks=5, plot=FALSE)
         level  freq   perc  cumfreq  cumperc
1  [27.5,35.7]     5  10.0%        5    10.0%
2  (35.7,43.8]    13  26.0%       18    36.0%
3    (43.8,52]    13  26.0%       31    62.0%
4    (52,60.1]    11  22.0%       42    84.0%
5  (60.1,68.3]     8  16.0%       50   100.0%

I don’t see this as much of an improvement, in part because the cutoff points do not make as much intuitive sense to me. However, you can also specify exactly which values R should use to create the bins, using the breaks command:

#Create a frequency table with five user-specified groupings
Freq(states20$d2pty20, breaks=c(20,30,40,50,60,70), plot=FALSE)
     level  freq   perc  cumfreq  cumperc
1  [20,30]     1   2.0%        1     2.0%
2  (30,40]    10  20.0%       11    22.0%
3  (40,50]    14  28.0%       25    50.0%
4  (50,60]    17  34.0%       42    84.0%
5  (60,70]     8  16.0%       50   100.0%

We’ll explore regrouping data like this in greater detail in the next chapter.

3.4 Graphing Outcomes

As useful as frequency tables are, basic univariate graphs complement this information and are sometimes much easier to interpret. As discussed in Chapter 1, data visualizations help contextualize the results, giving the research consumer an additional perspective on the statistical findings. In the graphs examined here, the information presented is exactly the same as some of the information presented in the frequencies discussed above, albeit in a different format.

3.4.1 Bar Charts

Bar charts are simple graphs that summarize the relative occurrence of outcomes in categorical variables, providing the same information found in frequency tables. The category labels are on the horizontal axis, just below the vertical bars; ticks on the vertical axis denote the number (or percent) of cases; and the height of each bar represents the number (or percent) of cases for each category. It is important to understand that the horizontal axis represents categorical differences, not quantitative distances between categories.

The code listed below is used to generate the bar chart for V201320x, the variable measuring spending preferences on programs for the poor. Note here that the barplot command uses the initial frequency table as input, saved earlier as poorAid.tbl, rather than the name of the variable. This illustrates what I mentioned earlier, that even though the table command does not provide a lot of information, it can be used to help with other R commands. It also reinforces an important point: bar charts are the graphic representation of the raw data from a frequency table.

#Plot the frequencies of anes20$V201320x

Sometimes you have to tinker a bit with graphs to get them to look as good as they should. For instance, you might have noticed that not all of the category labels are printed above. This is because the labels themselves are a little bit long and clunky, and with five bars to print, some of them were dropped due to lack of room. We could add a command to make reduce the size of the labels, but that can lead to labels that are too small to read (still, we will look at that command later). Instead, we can replace the original labels with shorter ones that still represent the meaning of the categories, using the names.arg command. Make sure to notice the quotation marks and commas in the command. We also need to add axis labels and a main title for the graph, the same as we did in Chapter 2. Adding this information makes it much easier for your target audience to understand what’s being presented. You, as the researcher, are familiar with the data and may be able to understand the graph without this information, but the others need a bit more help.

#Same as above but with labels altered for clarity in "names.arg"
        names.arg=c("Increase/Lot", "Increase", 
                    "Same", "Decrease","Decrease/Lot"), 
        xlab="Increase or Decrease Spending on Aid to the Poor?",
        ylab="Number of Cases",
        main="Spending Preference")

I think you’ll agree that this looks a lot better than the first graph. By way of interpretation, you don’t need to know the exact values of the frequencies or percentages to tell that there is very little support for decreasing spending and substantial support for keeping spending the same or increasing it. This is made clear simply from the visual impact of the differences in the height of the bars. Images like this often make quick and clear impressions on people.

Now let’s compare this bar chart to one for the question that asked about spending on welfare programs. Since we did not save the contents of the original frequency table for this variable to a new object, we can insert table(anes20$V201314x) into the barplot command:

#Tell R to use the contents of "table(anes20$V201314x)" for graph
        names.arg=c("Increase/Lot", "Increase", 
                    "Same", "Decrease", "Decrease/Lot"),
        xlab="Increase or Decrease Spending on Welfare?",
        ylab="Number of Cases",
        main="Spending Preference")

As was the case when we compared the frequency tables for these two variables, the biggest difference that jumps out is the lower level of support for increasing spending, and the higher level of support for decreasing welfare spending, compared to preferences of spending on aid to the poor. You can flip back and forth between the two graphs to see the differences, but sometimes it is better to have the graphs side by side, as below.8

#Set output to one row, two columns
     ylab="Number of Cases", 
#Adjust the y-axis to match the other plot
     xlab="Spending Preference",
     main="Aid to the Poor",
#Reduce the of the labels to 60% of original
#Use labels for end categories, other are blank
     names.arg=c("Increase/Lot", "", "", "","Decrease/Lot"))
#Use "table(anes20$V201314x)" since a table object was not created
     xlab="Spending Preference", 
     ylab="Number of Cases",
     cex.names=.6, #Reduce the size of the category labels
     names.arg=c("Increase/Lot", "", "", "","Decrease/Lot"))

#reset to one row and one column

This does make the differences between the two questions more apparent. Note that I had to delete a few category labels and reduce the size of the remaining labels to make everything fit in this side-by-side comparison.

Finally, if you prefer to plot the relative rather than the raw frequencies, you just have to specify that R should use the proportions table as input, and change the y-axis label accordingly:

#Use "prop.table" as input
        names.arg=c("Increase/Lot", "Increase", 
                    "Same", "Decrease","Decrease/Lot"), 
        xlab="Increase or Decrease Spending on Aid to the Poor?",
        ylab="Proportion of Cases",
        main="Spending Preference")

Bar Chart Limitations. Bar charts work really well for most categorical variables because they do not assume any particular quantitative distance between categories on the x-axis, and because categorical variables tend to have relatively few, discrete categories. Numeric data generally do not work well with bar charts, for reasons to be explored in just a bit. That said, there are some instances where this general rule doesn’t hold up. Let’s look at one such exception, using a state policy variable from the states20 data set. Below is a bar chart for abortion_laws, a variable measuring the number of legal restrictions on abortion in the states in 2020.9

        xlab="Number of laws Restricting Abortion Access",
        ylab="Number of States",

Okay, this is actually kind of a nice looking graph. It’s easy to get a sense of how this variable is distributed: most states have several restrictions and the most common outcomes are states with 9 or 10 restrictions. It is also easier to comprehend than if we got the same information in a frequency table (go ahead and get a frequency table to see if you agree). The bar chart works in this instance, because there are relatively few, discrete categories, and the categories are consecutive, with no gaps between values. So far, so good, but in most cases, numeric variables do not share these characteristics, and bar charts don’t work well. This point is illustrated quite nicely in this graph of Joe Biden’s percent of the two-party vote in the states in the 2020 election.

par(las=2) #This tells R to plot the labels vertically
        xlab="Biden % of Two-party Vote",
        ylab="Number of States",
#This tells R to shrink the labels to 70% of normal size
                cex.names = .7) 

Not to put too fine a point on it, but this is a terrible graph, for many of the same reasons the initial frequency table for this variable was of little value. Other than telling us that the outcomes range from 27.52 to 68.3, there is nothing useful conveyed in this graph. What’s worse, it gives the misleading impression that votes were uniformly distributed between the lowest and highest values. There are a couple of reasons for this. First, no two states had exactly the same outcome, so there are as many distinct outcomes and vertical bars as there are states, leading to a flat distribution. This is likely to be the case with many numeric variables, especially when the outcomes are continuous. Second, the proximity of the bars to each other reflects the rank order of outcomes, not the quantitative distance between categories. For instance, the two lowest values are 27.52 and 30.20, a difference of 2.68, and the third and fourth lowest values are 32.78 and 33.06, a difference of .28. Despite these differences in the quantitative distance between the first and second, and third and fourth outcomes, the spacing between the bars in the bar chart makes it look like the distances are the same. The bar chart is only plotting outcomes by order of the labels, not by the quantitative values of the outcomes. Bottom line, bar charts are great for most categorical data but usually are not the preferred method for graphing numeric data.

Plots with Frequencies. We have been using the barplot command to get bar charts, but these charts can also be created by modifying the freq command. You probably notice that throughout the discussion of frequencies, I used commands that look like this:

freq(anes20$V201320x, plot=FALSE)

The plot=FALSE part of the command instructs R to not create a bar chart for the variable. If it is dropped, or if it is changed to plot=TRUE, R will produce a bar chart along with the frequency table. You still need to add commands to create labels and main titles, and to make other adjustments, but you can do all of this within the frequency command. Go ahead, give it a try.

So why, not just do this from the beginning? Why go through the extra steps? Two reasons, really. First, you don’t need a frequency table every time you produce or modify a bar chart. The truth is, you probably have to make several different modifications to a bar chart before you are happy with how it looks, and if you get a frequency table with every iteration of chart building, your screen will get to be a bit messy. More importantly, however, using the barplot command pushes you a bit more to understand “what’s going on under the hood.” For instance, telling R to use the results of the table command as input for a bar chart helps you understand a bit better what is happening when R creates the graph. If the graph just magically appears when you use the freq command, you are another step removed from the process and don’t even need to think about it. You may recall from Chapter 2 that I discussed how some parts of the data analysis process seem a bit like a black box to students—something goes in, results come out, and we have no idea what’s going on inside the box. Learning about bar charts via the barplot command gives you a little peek inside the black box. However, now that you know this, you can decide for yourself how you want to create bar charts.

3.4.2 Histograms

In Chapter 2, histograms were introduced as a tool to use for assessing a variable’s distribution. It is hard to overstate how useful histograms are for conveying information about the range of outcomes, whether outcomes tend to be concentrated or widely dispersed across that range, and if there is something approaching a “typical” outcome. In short, histograms show us the shape of the data.10

Let’s take another look at Joe Biden’s percent of the two-party vote in the states but this time using a histogram.

#Histogram for Biden's % of the two-party vote in the states
     xlab="Biden % of Two-party Vote",
     ylab="Number of States",
     main="Histogram of Biden Support in the States")

The width of the bars represents a range of values and the height represents the number of outcomes that fall within that range. At the low end there is just one state in the 25 to 30 range, and at the high end, there are four states in the 65 to 70 range. More importantly, there is no clustering at one end or the other, and the distribution is somewhat bell-shaped but with dip in the middle. It would be very hard to glean this information from the frequency table and bar chart for this variable presented earlier.

Recall that the bar charts we looked at earlier were the graphic representation of the raw frequencies for each outcome. We can think of histograms in the same way, except that they are the graphic representation of the binned frequencies, similar to those produced by the Freq command. In fact, R uses the same rules for grouping observations for histograms as it does for the binned frequencies used in the Freq command used earlier.

The histogram for the abortion law variable we look at earlier (abortion_laws) is presented below.

#Histogram for the number of abortion restrictions in the states
     xlab="Number of Laws Restricting Access to Abortions",
     ylab="Number of States",
     main="Histogram of Abortion Laws in the States")

This information looks very similar to that provided in the bar chart, except the values are somewhat less finely grained in the histogram. What the histogram does that is useful is mask the potentially distracting influence of idiosyncratic bumps or dips in the data that might catch the eye in a bar chart and divert attention from the general trend in the data. In this case, the difference between the two modes of presentation is not very stark, but as a general rule, histograms are vastly superior to bar charts when using numeric data.

3.4.3 Density Plots

Histograms are a great tool for visualizing the distribution of numeric variables. One slight drawback, though is that the chunky nature of the bars can sometimes obscure the continuous shape of the distribution. A density plot helps alleviate this problem by taking information from the variable’s distribution and generating a line that smooths out the bumpiness of the histogram and summarizes the shape of the distribution. The line should be thought of as an estimate of a theoretical distribution based on the underlying patterns in the data. Density plots can be used in conjunction with histograms or independent of histograms.

Adding density plots to histograms is pretty straightforward, using the lines function. This function can be used to add many different types of lines to existing graphs. Here is what this looks like for the histogram of Biden votes in the states.

     xlab="Biden % of Two-Party Vote",
     main="Histogram of Biden Votes in the States", 
     prob=T) #Use probability densities on the vertical axis

#Superimpose a density plot on the histogram
lines(density(states20$d2pty20), lwd=3) #make the line thick

The smoothed density line reinforces the impression from the histogram that there are relatively few states with extremely low or high values, and the vast majority of states are clustered in the 40-60% range. It is not quite a bell-shaped curve–somewhat symmetric, but a bit flatter than a bell-shaped curve. The density values on the vertical axis are difficult to interpret on their own, so it is best to focus on the shape of the distribution and the fact that higher values mean more frequently occurring outcomes.

You can also view the density plot separately from the histogram, using the plot function:

#Generate a density plot with no histogram
plot(density(states20$d2pty20) ,
     xlab="Biden % of Two-Party Vote",
     main="Biden Votes in the States",

The main difference here is that the density plot is not limited to the same x-axis limits as in the histogram and the solid line can extend beyond those limits as if there were data points out there. Let’s take a quick look at a density plot for the other numeric variable used in this chapter, abortion laws in the states.

#Density plot for Number of abortion laws in the states
     xlab="Number of Laws Restricting Access to Abortions",
     main="Abortion Laws in the States",

This plot shows that the vast majority of the states have more than five abortion restrictions on the books, and the distribution is sort of bimodal (two primary groupings) at around five and ten restrictions.

3.4.4 A few Add-ons for Graphing

As you progress through the chapters in this book, you will learn a lot more about how to use graphs to illuminate interesting things about your data. Before moving on to the next chapter, I want to show you a few things you can do to change the appearance of the simple bar charts and histograms we have been working with so far.

  • col=" " is used designate the color of the bars. Gray is the default color, but you can choose to use some other color if it makes sense to you. You can get a list of all colors available in R by typing colors() at the prompt in the console window.
  • horiz=T is used if you want to flip a bar chart so the bars run horizontally from the vertical axis.
  • breaks= is used in a histogram to change the number of bars (bins) used to display the data. We used this command earlier in the discussion of setting specific bin ranges in frequency tables, but for right now we will just specify a single number that determines how many bars will be used.

The examples below add some of this information to graphs we examined earlier in this chapter.

     xlab="Biden % of Two-party Vote",
     ylab="Number of States",
     main="Histogram of Biden Support in the States",
     col="white", #Use white to color the the bars
     breaks=5) #Use just five categories

As you can see, things turned out well for the histogram with just five bins, though I think five is probably too few and obscures some of the important variation in the variable. If you decided you prefer a certain color but not the default gray, you just change col="white" to something else.

The graph below is a first attempt flipping the barplot for attitudes toward the spending on aid for the poor, using white horizontal bars.

        names.arg=c("Increase/Lot", "Increase", "Same", "Decrease","Decrease/Lot"),
        xlab="Number of Cases",
        ylab="Increase or Decrease Spending on Aid to the Poor?",
        main="Spending Preference",
        horiz=T, #Plot bars horizontally
        col="white") #Use white to color the the bars

As you can see, the horizontal bar chart for spending preferences turns out to have a familiar problem: the value labels are too large and don’t all print. Frequently, when turning a chart on its side, you need to modify some of the elements a bit, as I’ve done below 11.

First, par(las=2) instructs R to print the value labels sideways. Anytime you see par followed by other terms in parentheses, it is likely to be a command to alter the graphic parameters. The labels were still a bit too long to fit, so I increased the margin on the left with par(mar=c(5,8,4,2)). This command sets the margin size for the graph, where the order of numbers is c(bottom, left, top, right). Normally this is set to mar=c(5,4,4,2), so increasing the second number to eight expanded the left margin area and provided enough room for value labels. However, the horizontal category labels overlapped with the y-axis title, so I dropped the axis title and modified the main title to help clarify what the labels represent.

#Change direction of the value labels
#Change the left border to make room for the labels
        names.arg=c("Increase/Lot", "Increase", "Same", "Decrease","Decrease/Lot"),
        xlab="Number of Cases",
        main="Spending Preference: Aid to the Poor",

As you can see, changing the orientation of your bar chart could entail changing any number of other characteristics. If you think the horizontal orientation works best for a particular variable, then go for it. Just take a close look when you are done to make sure everything looks the way it should look.

Whenever you change the graphing parameters as we have done here, you need to change them back to what they were originally. Otherwise, those changes will affect all of your subsequent work.

#Return graph settings to their original values

3.5 Next Steps

Simple frequency tables, bar charts, and histograms can provide a lot of interesting information about how variables are distributed. Sometimes this information can tip you off to potential errors in coding or collecting data, but mostly it is useful for “getting to know” the data. Starting with this type of analysis provides researchers with a level of familiarity and connection to the data that can pay dividends down the road when working with more complex statistics and graphs.

As alluded to earlier, you will learn about a lot of other graphing techniques in subsequent chapters, once you’ve become familiar with the statistics that are used in conjunction with those graphs. Prior to that, though, it is important to spend a bit of time learning more about how to use R to transform variables so that you have the data you need to create the best most useful graphs and statistics for your research. This task is taken up in the next chapter.

3.6 Exercises

3.6.1 Concepts and Calculations

  1. You might recognize this list of variables from the exercises at the end of Chapter 1. Identify whether a histogram or bar chart would be most appropriate for summarizing the distribution of each variable. Explain your choice.

    • Course letter grade
    • Voter turnout rate (%)
    • Marital status (Married, divorced, single, etc)
    • Occupation (Professor, cook, mechanic, etc.)
    • Body weight
    • Total number of votes cast in an election
    • #Years of education
    • Subjective social class (Poor, working class, middle class, etc.)
    • % of people living below poverty level income
    • Racial or ethnic group identification
  2. This histogram shows the distribution of medical doctors per 100,000 population across the states.

    • Assume that you want to describe this distribution to someone who does not have access to the histogram. What do you tell them?
    • Given that the intervals in the histogram are right-closed, what range of values are included in the 250 to 300 interval? How would this be different if the intervals were left-closed.

3.6.2 R Problems

  1. I’ve tried to get a bar chart for anes20$V201119, a variable that measures how happy people are with the way things are going in the U.S., but I keep getting an error message. Can you figure out what I’ve done wrong? Diagnose the problem and present the correct barplot.
        xlab="How Happy with the Way Things Are Going?",
        ylab="Number of Respondents")
  1. Choose the most appropriate type of frequency table (freq or Freq) and graph (bar chart or histogram) to summarize the distribution of values for the three variables listed below. Make sure to look at the codebooks so you know what these variables represent. Present the tables and graphs and provide a brief summary of their contents. Also, explain your choice of tables and graphs, and be sure to include appropriate axis titles with your graphs.

    • Variables: anes20$V202178, anes20$V202384, and states20$union.
  2. Create density plots for the numeric variables listed in problem 2, making sure to create appropriate axis titles. In your opinion, are the density plots or the histograms easier to read and understand? Explain your response.

  1. The code is provided here but might be a bit confusing to new R users. Look at it and get what you can from it. Skip it if it is too confusing.↩︎

  2. Note that the data for this variable do not reflect the sweeping changes to abortion laws in the states that took place in 2021 and 2022.↩︎

  3. There are other graphing methods that provide this type of information, but they require a bit more knowledge of measures of central tendency and variation, so they will be presented in later chapters.↩︎

  4. One option not shown here is to reduce the size of the labels using cex.names. Unfortunately, you have to cut the size in half to get them to fit, rendering them hard too read.↩︎