# Preface

**Last updated on: 20 Feb 2024**.

## Origin Story

I have probably used a half a dozen or so different textbooks over the years in my *Political Data Analysis* course, always thinking that I had finally found the right one, only to inevitably decide that I needed to find something else. I suppose I was a bit like Goldilocks: most books had many things to recommend them, but none were quite right. Some books had a social science orientation but not much to say about political science; some books were overly technical, while others were not technical enough; some books were essentially programming manuals with little to no research design or statistics instruction, while others were very good on research design or statistics but offered very little on the computing side. In the back of my mind, I figured I would eventually have to put up or shut up, I was going to have to put together my own textbook.

This book started as a collection of lecture notes that I put online for my undergraduates in the wake of the COVID-19 outbreak in March of 2020. I had always posted a rough set of online notes for my classes, but the pandemic pivot meant that the notes had to be much more detailed and thorough than before. This, coupled with my inability to find a textbook that was right for my class, led me to somewhat hastily cobble together my topic notes into something that started to look like a “book” during the summer of 2021. I used this new set of notes for the fall 2021 hybrid section of my undergraduate course, and it worked out pretty well. I then spent the better part of spring 2022 (a sabbatical semester) expanding and revising (and revising, and revising, and revising) the content, as well as trying to master Bookdown, so I could stitch together the multiple R Markdown files to look more like a book than just a collection of lecture notes. I won’t claim to have mastered either Bookdown or Markdown, but I did learn a lot as I used them to put together the first few drafts of this book.

## What this Book is (and isn’t) About

This book has two purposes: Provide students with a comprehensive, hands-on, accessible overview of how to do and interpret quantitative political and social data analysis, while also providing a gentle introduction to using the R programming environment to address those issues. I cover this in greater detail in Chapter 1, but it is worth reinforcing here that this is not a statistics textbook. Statistics are used and discussed, but the focus is on practical aspects of doing data analysis. At the same time, while this book gives students a good initiation to using R in social science research, it is not a comprehensive R manual. Also, although issues of research design and the scientific process are discussed in Chapter 1, and best practices are encouraged throughout, this is not the primary focus of this book. Finally, while the techniques and methods covered here are used across a number of disciplines in the social sciences, many of the examples and illustrations are likely to fit better with courses in political science, sociology, and kindred disciplines. Instructors of courses in disciplines that rely heavily on experimental methods may need to use supplemental materials.

This book relies on base R and a handful of additional packages. By choice, this book does not incorporate or emphasize `tidyverse`

. I think there is something to be gained by learning to work with the `tidyverse`

tools, but I also think doing so is much easier once you have at least a basic level of familiarity with R. In general, my approach is that you should sit up before you crawl, crawl before you walk, and walk before you run. While `tidyverse`

is not emphasized here, astute readers might notice that a couple of `tidyverse`

packages are used in a few places: the `Import Dataset`

tab in R Studio uses `Readxl`

to import data sets and, as a result, creates tibbles, and `dplyr`

is used to sample data in a couple of different places.

At the end of every semester in my Political Data Analysis course, I tell students that I think it would be appropriate for them to list *Data Analysis and R*, with a proficiency level of Basic, as skills on their resume, if they receive a grade of B+ or higher. This summarizes what students can get from this textbook, a good, solid introduction to political and social data analysis, using R.

## How to use this book

The structure of the book is motivated by an approach to education that is especially appropriate for teaching data analysis and related topics: *learn by doing.* In this spirit, virtually every graph and statistical table shown in this book is accompanied by the R code that produced the output. For a few tables and graphs, the code is a bit too complicated for new R users, so it is not included but is available as part of the online resources for this book. Generally, these are the numbered and titled figures and tables.

Students can (and should) “follow along” by running the R code on their own as they work through the chapters. Ideally, they will not encounter problems, and will obtain the same results when running the R code. However, anyone who has used R before, and especially anyone who can remember how things worked when they were first learning R, knows that “Ideally” is not how things always work out. This is one of the great things about this approach to teaching data analysis–learning from errors.

Each of the chapters includes a few exercises that can be used for class assignments. First, there are *Concepts and Calculation* exercises. These require students to apply key concepts to problems and examples, calculate and interpret statistics, or interpret R output. Most of these problems lean much more to the “Concepts” than to “Calculations” side, and the calculations tend to be pretty simple. Second, there are *R Problems* that require students to use R commands shown earlier in the chapter to analyze data and interpret the results. In some cases, there is a cumulative aspect to these problems, meaning that students need to use R code they learned in previous chapters. Students who follow along and run the R code as they read the chapters will have a relatively easy time with the R problems. My own preference is to assign both types of problems, but the specific goals and circumstances of individual courses may dictate using more of one type than the other.

## Keys to Student Success

**For instructors**, there are a couple of things that can help put students on the path to success. First, the preparatory chapters are very important to student success, and it is worth taking time to make sure students get through this material successfully. For students who have never had a course on data analysis or used R before (most of my students), this material can be challenging and intimidating. Early on, the process of downloading and installing R and RStudio, and then running the first bits of R code, can be especially frustrating and intimidating. In fact, this is where some students are lost for the semester. In my experience, a little extra attention and a helping hand at this point can make a big difference to student success with the rest of the book. One thing to consider to avoid many of the problems students typically encounter in the first couple of weeks of the semester, is to use the cloud version of R Studio (Posit.cloud). This is discussed at greater length in Chapter 2.

One of the interesting developments I have observed over the past decade or so is that students are increasingly disconnected from the internal structure of their laptops and personal computers. The concepts of directories, drives, folders, etc., seem to have lost meaning to many students. A classic example of this comes up almost every semester when I give instructions for loading one of the first data files I use in class, `anes20.rda`

. Students take my sample text,`load("<filepath>/anes20.rda")`

, and try to run it verbatim, as if `<filepath>`

is the name of an actual place on their device. I bring this up not to poke fun at students, but to highlight a very real issue that may cause problems early on for some students. Patience and a helping hand are encouraged. Again, these sort of problems can be avoided if students are using Posit.cloud, especially if instructors are able to pre-load all required packages and data sets. Alternatively, if students download RStudio directly to their own device, it is important to have them open a new folder, perhaps named for course they are in, and use it as their working directory, where they will store their data sets, script (command) files, and other course-related materials.

An important feature of R Studio is that it is relatively easy to open and work in either a Quarto or R Markdown window, if you want your students to use one of these platforms for doing homework. A key benefit of doing this is that students can run R commands within their document to see if they work, and then add text that appears after the resulting R output (statistical results and graphs), rather than copying the output from the RStudio window and pasting it into a different document, which typically requires some additional formatting. A brief guide to using Quarto is provided in the online resources for this book.

One of the most important things instructors can do is encourage students to follow along and run the R code as they work through the chapters. Running the code in this low stakes context will help prepare them for the end-of-chapter exercises and any other assignments that instructors might require. It is also important to encourage students to type in the commands, rather than copying and pasting them, as I think this reinforces the connection between the students and what R is doing. Each chapter begins with a “Get Ready” section that tells students which data sets and packages are used in the chapter. Many of the error messages students receive can be traced back to not loading the required data sets or not attaching the required libraries.

**For students**, the keys to success are largely the same as for most other subject areas. However, for students using this textbook, it is particularly important that they don’t fall behind. The material builds cumulatively and it will become increasingly difficult for students to do well if they fall behind. This applies to both aspects of the book, learning about data analysis and learning how to use R. As alluded to above, it is important for students to hunker down and get their work done in the first part of the book, where they will learn some data analysis basics and get their first exposure to using R. Also, for students, ask for help! Your instructors want you to succeed and one of the keys to success is letting the experts (your instructors) help you!

## Chapter Contents

The 18 chapters listed below constitute four broadly defined parts of the book. The *Preparatory* chapters (1,2, and 4) help orient students to key concepts and processes involved in political and social data analysis and also provide a foundation for using R in the other chapters. The *Descriptive* chapters (3, 5, 6, and 7) cover most of the important descriptive statistics, with an emphasis on matching the appropriate statistics and graphing techniques to the type of data being used. The middle chunk of the book (chapters 8-13) focuses on different aspects of *Statistical Inference and Hypothesis Testing*, again emphasizing the match between data type and appropriate statistical tests. This section also emphasizes the important role of *effect size* as a complement to measures of statistical significance. The final section (chapters 14-18) focuses on *Correlation and Regression*.

Chapter Topics | |
---|---|

1. Introduction to Research and Data | 10. Hypothesis Testing with two Groups |

2. Using R to Do Data Analysis | 11. Hypothesis Testing with Multiple Groups |

3. Frequencies and Bar Graphs | 12. Hypothesis Testing with Crosstabs |

4. Data Preparation | 13. Measures of Association |

5. Measures of Central Tendency | 14. Correlation and Scatterplots |

6. Measures of Dispersion | 15. Simple Regression |

7. Probability | 16. Multiple Regression |

8. Sampling and Inference | 17. Advanced Regression Topics |

9. Hypothesis Testing | 18. Regression Assumptions |

One of the goals in putting this book together was to provide broad enough coverage of topics that it could fit a number of different course needs. For courses focused on “the basics,” instructors may not be interested in the chapters on regression analysis and might focus on the preparatory and descriptive chapters, perhaps adding the chapters on crosstabs, measures of association, and correlation and scatterplots. For courses that emphasize regression analysis, instructors might want to skip the chapters on crosstabs and measures of association, or some other chapters, and make sure to cover the four regression chapters. Some courses do not emphasize statistical inference as much as others, so those instructors might assign fewer of the statistical inference and hypothesis testing chapters.

## Supplemental Resources

### Data Sets and Codebooks

The primary data sets used in the text and for end-of-chapter problems are listed in the table below:

Data Set | Description |
---|---|

anes20 | A collection of individual-level attitudinal, behavioral, and demographic variables taken from the 2020 American National Election Study. 221 variables with over 8000 respondents. |

countries2 | A collection of country-level measures of demographic, health, economic, and political outcomes. 49 variables from 196 countries. |

county20large | A collection of U.S. county-level measures of demographic, health, economic, and political outcomes. 96 variables from over 3100 counties. |

states20 | A collection of U.S. state-level measures of demographic, health, economic, and political outcomes. 87 variables taken from 50 states. |

These and other data sets can be downloaded at this link (you might have to right-click on this link to open the directory in a new tab or window) and the codebooks can be found in the appendix of this book. Only a handful of variables from each of these data sets are used in the text, leaving many others for homework, research projects, or paper assignments.