Preface

First things first, this book is a work in progress and will be updated over time as I address what are no doubt a number of silly errors. I’m not a great proof reader, especially of my own work, so most of the errors will come to my attention as other people read this book (some willingly, some out of obligation). If you find errors, please contact me directly (tomholbrook12@gmail.com) and let me know about them.

How to use this book

This book has two purposes: Provide students with a comprehensive, accessible overview of important issues related to political and social data analysis, while also providing a gentle introduction to using the R programming environment to address those issues. I cover this in greater detail in chapter 1, but it is worth reinforcing here that this is not a statistics textbook. Statistics are used and discussed, but the focus in on practical aspects of doing data analysis.

The structure of the book is motivated by an approach to education that is especially appropriate for teaching data analysis and related topics: learn by doing. In this spirit, virtually every graph and statistical table used in this book is accompanied by the R code that produced the output. For a few tables and graphs, the code is a bit too complicated for new R users, so it is not included but is available upon request. Generally, these are the numbered and titled figures and tables.

Students can “follow along” by running the R code on their own as they work through the chapters. Ideally, they will not encounter problems, and will obtain the same results when running the R code. However, anyone who has used R before, and especially anyone who can remember how things worked when they were first learning R, knows that “Ideally” is not how things always work out. This is the great thing about using this approach to teaching data analysis–learning from errors. The book relies on base R and a handful of additional packages (no tidyverse).

Each of the chapters includes a few exercises that can be used for assignments. First, there are Concept and Calculation exercises. These require students to calculate and interpret statistics, apply key concepts to problems and examples, or interpret R output. Most of these problems lean much more to the “Concepts” than to “Calculations” side, and the calculations that are used tend to be pretty simple. Second, there are R Problems that require students to analyze data and interpret the findings using R commands shown earlier in the chapter. In some cases, there is a cumulative aspect to these problems, meaning that students need to use R code they learned in previous chapters. Students who follow along and run the R code as they read the chapters will have a relatively easy time with the R problems. Most chapters include both Concept and Calculations and R Problems, though some chapters only include one or the other type of problems.

What’s in this Book?

The 18 chapters listed below constitute four broadly defined parts of the book. The Preparatory chapters (1,2, and 4) help orient students to key concepts and processes involved in political and social data analysis and also provide a foundation for using R in the other chapters. The Descriptive chapters (3, 5, 6, and 7) cover most of the important descriptive statistics, with an emphasis on matching the statistics to the type of data being used. The middle chunk of the of the book (chapters 8-13) focuses on different aspects of Statistical Inference and Hypothesis Testing, again emphasizing the match between data type and appropriate statistical tests. This section also emphasizes the important role of effect size as a complement to measures of statistical significance. The final section (chapters 14-18) focuses on Correlation and Regression.

Chapter Topics
1. Introduction to Research and Data	10. Hypothesis Testing with two Groups
2. Using R to Do Data Analysis	11. Hypothesis Testing with Multiple Groups
3. Frequencies and Bar Graphs	12. Hypothesis Testing with Non-Numeric Variables
4. Transforming Variables	13. Measures of Association
5. Measures of Central Tendency	14. Correlation and Scatterplots
6. Measures of Dispersion	15. Simple Regression
7. Probability	16. Multiple Regression
8. Sampling and Inference	17. Advanced Regression Topics
9. Hypothesis Testing	18. Regression Assumptions

Keys to Student Success

For instructors, there are a couple of things that can help put students on the path to success. First, the preparatory chapters are very important to student success, and it is worth taking time to make sure students get through this material successfully. For students who have never had a course on data analysis or used R before (most students), this material can be challenging. Early on, the process of downloading and installing R, and then running the first bits or R code, can be especially frustrating and intimidating. In my experience, a little extra attention and a helping hand at this point can make a big difference to student success with the rest of the book. One thing to consider, and something I discuss in Chapter 2, is to use RStudio Cloud to avoid many of the problems student will encounter if they have to download R and install packages in the first couple of weeks of their semester.

One of the interesting developments I have observed over the past decade or so is that students are increasingly disconnected from the internal structure of their laptops and personal computers. The concepts of directories, drives, folders, etc., seem to have lost meaning to many students. I attribute this to so many things being accessible via remote access, either through their computer, table, or phone. A classic example of this comes up almost every semester when I give instructions for loading one of the first data files I use in class, anes20.rda. Students take my sample text,load("<filepath>/anes20.rda"), and try to run it verbatim, as if <filepath> is an actual place on their device. I bring this up not to poke fun at students, but to highlight a very real issue that may cause problems early on for some students. Patience and a helping hand are encouraged.

One of the most important things instructors can do is encourage students to follow along and run the R code as they work through the chapters. Running the code in this low-stakes context will help prepare them for the end-of-chapter assignments and any other assignments that might be required.

While learning R might create stress among students, the same is true for other technical and substantive aspects of doing data analysis. Part of this is about statistics, even though I’ve said this is not a statistics textbook. Still statistics are an essential tool for doing data analysis, and it is important to teach students how to use this tool. Learning R code is important, but it is almost irrelevant if students don’t understand the substantive implications of statistical applications. Most of this can be addressed by requiring regular homework assignments, whether those found here or something else.

For students, the keys to success are largely the same as for most other subject areas. However, for courses using this textbook, it is particularly important that you don’t fall behind. The material builds cumulatively and it will become increasingly difficult to do well if you fall behind. This applies to both aspects of the course, learning about data analysis and learning how to use R. As alluded to above, it is really important to hunker down and get your work done in the first part of the book, where you will learn some data analysis basics and get your first exposure to using R. Also, ask for help! Your instructors want you to succeed and one of keys to success is letting the experts (your instructors) help you!

Data Sets and Codebooks

The primary data sets used for demonstration and for end-of-chapter problems are listed in the table below:

Data set	Description
anes20	A collection of individual-level attitudinal, behavioral, and demographic variables taken from the 2020 American National Election Study. 221 variables from over 8000 respondents.
countries2	A collection of country-level measures of demographic, health, economic, and political outcomes. 49 variables from 196 countries.
county20large	A collection of U.S. county-level measures of demographic, health, economic, and political outcomes. 96 variables from over 3100 counties.
states20	A collection of U.S. state-level measures of demographic, health, economic, and political outcomes. 87 variables taken from 50 states.

The data sets can be downloaded at this link ¹, and the codebooks can be found in the appendix of this book. Only a handful of variables are used from each of these data sets, leaving many other to use for homework, project, or paper assignments.

Last updated on: 15 May 2022.

If the link doesn’t work, copy and past this to your browser: https://bit.ly/3PkyMBW ↩︎