# 3 Bivariate Statistics – Case Study United States Presidential Election

## 3.1 Introduction

Studying the results of the 2020 US presidential election holds profound relevance in terms of understanding contemporary American politics, as well as research on electoral behavior. Analyzing the voting patterns unveils the interplay of demographics, ideology, and socio-economic factors shaping political preferences. Such insights not only inform electoral strategies but also deepen our comprehension of societal divisions and cleavages. The 2020 election also stands out through an effort to overturn the election result by Donald Trump and co-conspirators. This has caused turmoil and abet the January 6 United States Capitol attack (e.g., see https://www.britannica.com/event/January-6-U-S-Capitol-attack and https://statesuniteddemocracy.org/resources/doj-charges-trump/).

Fig. 1:

Before the election, a team of researchers from the American National Election Study (ANES) conducted a large representative survey (based on a random sample of the US population) to study voting intentions. We will work with these data in this case study. We are particularly interested in which individuals are more likely to support Donald Trump compared to Joe Biden. In doing so, we will present data in the form of cross tabulations. We will also take a closer look at the topic of correlation analysis by investigating the relationship between average education and average Trum support across US states.

## 3.2 Data overview and descriptive analyses

The data are read from a CSV file (comma-separated values). By using the “head” command, the first six lines of the data are displayed.

data_us <- read.csv("data/data_election.csv")
knitr::kable(head(data_us, 10), booktabs = TRUE,  caption = 'A table of the first 10 rows of the vote data.') %>%
kable_paper() %>%
scroll_box(width = "100%", height = "100%")
Table 3.1: Table 3.2: A table of the first 10 rows of the vote data.
vote trump education age
trump 1 3 46
other NA 2 37
biden 0 1 40
biden 0 3 41
trump 1 4 72
biden 0 2 71
trump 1 3 37
trump 1 1 45
refused / don’t know NA 3 43
biden 0 1 37

We can see from this that the following variables are included in the dataset:

vote = voting intention in the 2020 United States presidential election: “trump”, “biden”, “other”, “refused or don’t know”.

trump = numerical recode of the vote variable. 1=“trump”, 0=“biden”, NA=“other / refused / don’t know”.

education = respondent’s highest educational qualification: 1=no degree or high school, 2=some collage, 3=Associate or Bachelor’s degree, 4=Master’s or postgraduate degree, NA=not specified

age = age of respondent in years.

Question: What is the scale of measurement (nominal, ordinal, metric) of each variable?

Solution:
vote and trump –> nominal (i.e., information on if something is either the case or not, categories cannot be ranked and have no numerical meaning)
education –> ordinal (i.e., categories (or variable values) can be ranked, information on whether a variable value is “higher”/“more” or “lower”/“less”)
age –> metric (i.e., interval between ranked variable values can be compared; e.g., moving up 2 years from 20 to 22 is equivalent to moving up from 52 to 54)

### 3.2.1 Frequency tables

The following table shows the observed frequencies of the values of the variable vote.

dim(data_us) #total number of cases
## [1] 7272    4
table(data_us$vote) ## ## biden other refused / don't know ## 3759 274 223 ## trump ## 3016 We can see that 3,016 respondents chose Trump, whereas 3,759 opted for Biden. A difference of a couple of hundred responses. For a variable with only a few categories, looking at absolute numbers of observations might be informative. However, using relative frequencies in terms of proportions is typically more informative. Let’s get to it! prop.table(table(data_us$vote))
##
##                biden                other refused / don't know
##           0.51691419           0.03767877           0.03066557
##                trump
##           0.41474147

The relative difference between the group of Trump supporters and Biden supporters seems to be quite marginal in terms of percentage points. To be able to compare the figures from the survey data with the official vote result, we need to exclude the categories “refused (e.g., because respondent stated not going to vote) / don’t know”. We will also exclude the category “other candidates” that could have chosen in the survey and the election (together these candidates gathered less than 2% of the votes in the election). We will show later on how to recode variables. In the meantime, we can rely on the variable trump in which the categories “refused/don’t know” and “other” have already been set to NA (not available which is equal to missing values in terms of data analysis).

data_us_counted <- subset(data_us_counted, !is.na(data_us_counted$trump_nom)) # removal of missing values ggplot(data_us_counted, aes(fill=trump_nom, y=n, x=age_cat_nom)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels = scales::percent) + labs(x = "Age groups", y = "Shares", fill = "Voting intention US presidential election") + theme_minimal() ggplot(data_us_counted, aes(fill=age_cat_nom, y=n, x=trump_nom)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels = scales::percent) + coord_flip() + labs(x = "Voting intention US presidential election", y = "Shares", fill = "Age groups") + theme_minimal() ## 3.4 Scatter plot and correlation Scatter plots and correlation are suitable for showing relationships between two variables. In the following, we use the average approval of Trump across US states in % (perc_trump) as the variable to be explained (a.k.a. the outcome or dependent variable). State-specific variations in voting patterns are illustrated in the following map. As an explanatory variable, we use the proportion of people in the state who have a high level of education (Master’s or postgraduate) in % (perc_higheducation). Both variables originate from individual-level data of the ANES study and have been aggregated to the state level (i.e., the means per region were stored in a new dataset). Fig. 2: Regional US presidential election results Next, the data are read and the first six rows are displayed: data_states <- read.csv("data/data_states.csv") knitr::kable( head(data_states, 10), booktabs = TRUE, caption = 'A table of the first 10 rows of the regional vote data.') %>% kable_paper() %>% scroll_box(width = "100%", height = "100%") Table 3.3: Table 3.4: A table of the first 10 rows of the regional vote data. state perc_trump perc_higheducation 1. South Dakota 0.7333334 0.0588235 1. Nebraska 0.5416667 0.0816327 1. South Carolina 0.5000000 0.0840336 1. North Dakota 0.7619048 0.0869565 1. Montana 0.5238095 0.0869565 1. Arkansas 0.7142857 0.1132075 1. Louisiana 0.6022728 0.1170213 1. Wisconsin 0.5031056 0.1180124 1. Alaska 0.7500000 0.1250000 1. Utah 0.5526316 0.1250000 The variable state identifies the region. ### 3.4.1 Scatter plot To get a first impression of the empirical relationship between both variables, we use a scatter plot. Each point represents one of the 50 US states and Washington D.C. Note that it is a convention that the dependent variable is shown on the y-axis, whereas the independent variable is shown on the x-axis. sc1 <- ggplot(data=data_states, aes(x = perc_higheducation, y = perc_trump)) + geom_point() + xlab("Share of highly-educated persons in %") + ylab("Support of Trump in %") + scale_y_continuous(labels = scales::percent) + scale_x_continuous(labels = scales::percent) sc1 Question: What can be inferred from the figure? Your answer: Solution: The pattern indicates a negative relationship between the two variables. If the proportion of highly educated people in a region increases (moving to the right on the x-axis), we see that this is related to lower support for Trump (lower scores on the y-axis). ### 3.4.2 Correlation In the next step, we will take a graphical approach to how a correlation is determined. For this, we plot a vertical and a horizontal line that represents the mean of the variables through the cloud of observations represented by the black dots. We see that the dots are mainly in the upper-left quadrant (= low proportion of high education AND high Trump approval) and the lower-right quadrant (= high proportion of high education AND low Trumo approval). This indicates that higher average education tends to be associated with lower average Trump approval. In other words: The correlation is negative. #Obtaining means for each variable mean(data_states$perc_trump, na.rm=TRUE)
## [1] 0.470922
mean(data_states$perc_higheducation, na.rm=TRUE) ## [1] 0.1927645 #Scatter plot using percentages as scale units sc2 <- ggplot(data=data_states, aes(x = perc_higheducation, y = perc_trump)) + geom_point() + xlab("Share of highly-educated persons in %") + ylab("Support of Trump in %") + scale_y_continuous(labels = scales::percent) + scale_x_continuous(labels = scales::percent) + geom_hline(yintercept=0.470922, linetype="dashed", color = "red", size=1) + geom_vline(xintercept=0.1927645, linetype="dashed", color = "red", size=1) sc2 To quantify the relationship, we can calculate the covariance and subsequently the correlation (= standardized covariance fitted into the value range of -1 to +1). The widely used Pearson’s correlation coefficient is given as: $$r=\frac{Covariance(x,y)}{SD(x)*SD(y)}$$ $$r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$$ The numerator $${\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}$$ is particularly important. It represents the covariance, which is an unstandardized measure of association between x and y because its value heavily depends on the scales x and y are measured. At the same time, it serves to determine the direction of the empirical association. To do so, the classification of observations into above/below the mean quadrants is useful again. Observations that are above both means (i.e., upper right corner) have a positive sign and contribute as a product to a “positive” covariance or correlation. The same is true for observations in the lower left quadrant because here both differences have a negative sign, which again contributes as a product to a positive covariance (and thus correlation). In contrast, observations in the upper left and lower right quadrants contribute to a negative covariance (and thus correlation). If observations are evenly distributed across all quadrants, positive and negative products cancel each other out, which would lead to a zero covariance (and thus correlation). The step from covariance to correlation is that the covariance is standardized by the standard deviation of x and y which removes the scale units covariance is measured on and transforms the measure to range between -1 and +1. Here is an overview of possible scenarios in the correlation coefficient: Fig. 3: Different Correlations Scatter plots showing correlations (from left to right) of -1, -0.8, -0.4, 0, 0.4, 0.8 and 1. In the next step, the correlation between the share of highly educated people and the share of Trump supporters is calculated and a line representing their relationship is plotted. The line essentially depicts the bivariate association in the form of a regression slope, which always runs in the direction of the correlation. However, its interpretation differs from correlation. sc3 <- ggplot(data=data_states, aes(x = perc_higheducation, y = perc_trump)) + geom_point() + xlab("Share of highly-educated persons in %") + ylab("Support of Trump in %") + scale_y_continuous(labels = scales::percent) + scale_x_continuous(labels = scales::percent) + geom_smooth(method = lm, se = FALSE) sc3 vars <- c("perc_higheducation", "perc_trump") cor.vars <- data_states[vars] rcorr(as.matrix(cor.vars)) ## perc_higheducation perc_trump ## perc_higheducation 1.00 -0.69 ## perc_trump -0.69 1.00 ## ## n= 51 ## ## ## P ## perc_higheducation perc_trump ## perc_higheducation 0 ## perc_trump 0 In the case of correlation and regression coefficients, the interpretation always begins with mentioning x (or the independent variable) which exerts an effect on y (or the dependent variable). • Correlation coefficient: “If x increases, then y increases / decreases (depending on the sign of the correlation coefficient), on average. The magnitude of the correlation coefficient r indicates a weak/moderate/strong empirical association.” (Rule of thumb: +/- 0.1 weak, +/- 0.3 moderate, +/- 0.5 or higher strong association) • Regression coefficient: “If x increases by one unit, then y increases / decreases (depending on the sign of the regression coefficient) by coefficient units, on average.” The specific coefficient estimate that is referred to, is given in the regression output. To what extent the empirical relationship is strong, moderate, or weak can be determined by a standardization of the regression coefficients. Note that the details of regression analysis and its interpretation are covered in the case study on regression analysis. • Correlation is a standardized form of covariance, which comes with the advantage of being intuitively interpretable (-1 to +1) without any reference to the underlying scale the two variables are measured. However, the price to pay is that it conceals the magnitude by which y changes if x changes by one unit (rather it represents how dense observations cluster together in forming a linear relationship). Bivariate regression quantifies the expected increase in y if x increases by 1. We thus need to standardize (or “correct”) the covariance between x and y for the scale on which x is measured. This is achieved by dividing the covariance by the variance of x: $$b=\frac{Covariance(x,y)}{Var(x)}$$ Interpretation of the correlation coefficient from the example: The correlation between the proportion of people with high education and Trump support is negative. If the share of people with high education increases, then Trump support decreases, on average. The correlation coefficient of -0.69 indicates a strong empirical relationship between both variables. ## 3.5 Chi-square and Cramér’s V Now we return to cross tabs and how to make a statement about whether there is a correlation between the displayed variables and if so, how strong it is. To do so, we once again revisit the individual-level survey data and display the cross tab between education and trump. tab_xtab(var.row = data_us$trump, var.col = data_us$education, show.col.prc = TRUE, show.obs = TRUE, show.summary = TRUE) #setting show.summary = TRUE displays summary statistics at the bottom of the table trump education Total 1 2 3 4 0 567 45.2 % 687 50.1 % 1483 55.4 % 974 70.5 % 3711 55.5 % 1 687 54.8 % 685 49.9 % 1195 44.6 % 407 29.5 % 2974 44.5 % Total 1254 100 % 1372 100 % 2678 100 % 1381 100 % 6685 100 % χ2=196.388 · df=3 · Cramer’s V=0.171 · p=0.000 It has already become quite clear that low-educated people were more likely to support Trump compared to high-educated people (who were more in favor of Biden). To quantify the empirical association, we can calculate Cramér’s V. This measure quantifies empirical relationships if a nominal variable is involved (here: trump). It is based on a chi-square test for independence. The chi-square value indicates how strongly the real observations deviate from a theoretical distribution, wherein all observations are equally distributed (relative to marginal distribution) across the cells of the cross tab. The more the measured observations deviate from the theoretical distribution, the higher the chi-square value and subsequently Cramér’s V gets, which is normalized for the table size and ranges between 0 (= no association) and 1 (= exceedingly strong association). The formula for the chi-square test for cross tabs is the following: $$\chi^2= \sum{\frac{(O-E)^2}{E}}$$ $$O$$ refers to the observed frequencies and $$E$$ to the expected ones. The expected frequencies are obtained by multiplying the marginal frequencies (“total”) for each cell, divided by the total number of cases ($$E = \frac{R\times C}{n}$$). For the first cell in the table above, this would be (1254*3711)/6685=696.12. Thus, 696 observations would be theoretically expected in the first cell and 567 were observed. Quite different. This is done (usually through statistical software) for each cell and summed up according to the formula. Cramér’s V is a measure based on chi-square and adjusted for different-sized cross tabs. It is given by the formula: $$V=\sqrt{\frac{\chi^2}{n\times (m-1)}}$$, where $$n$$ is the number of observations and $$M$$ is the number of categories (or variable values) of the variable with the fewer categories (here trump with 2). Regarding interpretation, the following guidelines apply: • Cramér’s V ranges from 0 to 1, with no negative values possible • Hence, Cramér’s V provides no information on whether the association is positive or negative, we have to figure this out from the table (e.g., by calculating percentage differences between column values –> see above) • Rules of thumb regarding the size of the relationship: • In a small table (e.g., 2x2): between 0.1 and 0.3 –> weak, more than 0.3 and less than 0.5 –> moderate, more than 0.5 –> strong association • In a large table (e.g., 5x5): between 0.05 and 0.15 –> weak, more than 0.15 and less than 0.25 –> moderate, more than 0.25 –> strong association In R, chi-square and Cramér’s V can be calculated with the following commands: chisq.test(data_us$trump, data_us$education) ## ## Pearson's Chi-squared test ## ## data: data_us$trump and data_us$education ## X-squared = 196.39, df = 3, p-value < 2.2e-16 cramerV_tabelle <- table(data_us$trump, data_us\$education)
cramerV(cramerV_tabelle)
## Cramer V
##   0.1714

Question: How can Cramér’s V be interpreted in this case?

Solution:

Interpretation: There is an association between the variables education and trump. Given the rule of thumb for small tables, the association of 0.17 is weak. In our case, we know from the interpretation above that the association is negative (the higher the education, the lower the approval of Trump, on average).

Question: Why didn’t we calculate the Pearson’s correlation coefficient instead?