# Preface

**Last updated on: 20 Aug 2023**.

First things first, this book is a *work in progress* and the online version will be updated over time as I address what are no doubt a number of silly errors. I’m not a great proof reader, especially of my own work, so most of the errors will come to my attention as other people read this book (some willingly, some out of obligation). If you find errors, please contact me directly (tomholbrook12@gmail.com) and let me know about them. If you download the PDF version,^{1} you might find some strange formatting features, especially in sections with lots of graphs or tables. All of the content should still be there, but it will look a bit different than the online version. These issues will be addressed gradually over time.

## Origin Story

This book started as a collection of lecture notes that I put online for my undergraduates in the wake of the COVID-19 outbreak in March of 2020. I had always posted a rough set of online notes for my classes, but the pandemic pivot meant that the notes had to be much more detailed and thorough than before. This, coupled with my inability to find a textbook that was right for my class^{2}, led me to somewhat hastily cobble together my topic notes into something that started to look like a “book” during the summer of 2021. I used this new set of notes for the fall 2021 hybrid section of my undergraduate course, *Political Data Analysis*, and it worked out pretty well. I then spent the better part of spring 2022 (a sabbatical semester) expanding and revising (and revising, and revising, and revising) the content, as well as trying to master Bookdown, so I could stitch together the multiple R Markdown files to look more like a book than just a collection of lecture notes. I won’t claim to have mastered either Bookdown or Markdown, but I did learn a lot as I used them to put this thing together.

Right now, the book is available free online. At some point, it will be available as a physical book, probably via a commercial press, and hopefully with low cost options for students. Feel free to use it in your classes if you think it will work for you. If you do use it, let me know how it works out for you!

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

## How to use this book

This book has two purposes: Provide students with a comprehensive, accessible overview of important issues related to political and social data analysis, while also providing a gentle introduction to using the R programming environment to address those issues. I cover this in greater detail in Chapter 1, but it is worth reinforcing here that this is not a statistics textbook. Statistics are used and discussed, but the focus is on practical aspects of doing data analysis.

The structure of the book is motivated by an approach to education that is especially appropriate for teaching data analysis and related topics: *learn by doing.* In this spirit, virtually every graph and statistical table used in this book is accompanied by the R code that produced the output. For a few tables and graphs, the code is a bit too complicated for new R users, so it is not included but is available upon request. Generally, these are the numbered and titled figures and tables.

Students can “follow along” by running the R code on their own as they work through the chapters. Ideally, they will not encounter problems, and will obtain the same results when running the R code. However, anyone who has used R before, and especially anyone who can remember how things worked when they were first learning R, knows that “Ideally” is not how things always work out. This is the great thing about this approach to teaching data analysis–learning from errors.

This book relies on base R and a handful of additional packages. I’ve already been asked why this book does not incorporate or emphasize `tidyverse`

. I think there is a lot to be gained by learning to work with the `tidyverse`

tools, but I also think doing so is much easier once you have at least a basic level of familiarity with R. In general, my approach is: sit up before your crawl, crawl before you walk, and walk before you run.

Each of the chapters includes a few exercises that can be used for assignments. First, there are *Concept and Calculation* exercises. These require students to calculate and interpret statistics, apply key concepts to problems and examples, or interpret R output. Most of these problems lean much more to the “Concepts” than to “Calculations” side, and the calculations tend to be pretty simple. Second, there are *R Problems* that require students to analyze data and interpret the findings using R commands shown earlier in the chapter. In some cases, there is a cumulative aspect to these problems, meaning that students need to use R code they learned in previous chapters. Students who follow along and run the R code as they read the chapters will have a relatively easy time with the R problems. Most chapters include both Concept and Calculations and R Problems, though some chapters only include one or the other type of problems.

## What’s in this Book?

The 18 chapters listed below constitute four broadly defined parts of the book. The *Preparatory* chapters (1,2, and 4) help orient students to key concepts and processes involved in political and social data analysis and also provide a foundation for using R in the other chapters. The *Descriptive* chapters (3, 5, 6, and 7) cover most of the important descriptive statistics, with an emphasis on matching the appropriate statistics to the type of data being used. The middle chunk of the book (chapters 8-13) focuses on different aspects of *Statistical Inference and Hypothesis Testing*, again emphasizing the match between data type and appropriate statistical tests. This section also emphasizes the important role of *effect size* as a complement to measures of statistical significance. The final section (chapters 14-18) focuses on *Correlation and Regression*.

Chapter Topics | |
---|---|

1. Introduction to Research and Data | 10. Hypothesis Testing with two Groups |

2. Using R to Do Data Analysis | 11. Hypothesis Testing with Multiple Groups |

3. Frequencies and Bar Graphs | 12. Hypothesis Testing with Non-Numeric Variables (Crosstabs) |

4. Transforming Variables | 13. Measures of Association |

5. Measures of Central Tendency | 14. Correlation and Scatterplots |

6. Measures of Dispersion | 15. Simple Regression |

7. Probability | 16. Multiple Regression |

8. Sampling and Inference | 17. Advanced Regression Topics |

9. Hypothesis Testing | 18. Regression Assumptions |

## Keys to Student Success

For instructors, there are a couple of things that can help put students on the path to success. First, the preparatory chapters are very important to student success, and it is worth taking time to make sure students get through this material successfully. For students who have never had a course on data analysis or used R before (most students), this material can be challenging. Early on, the process of downloading and installing R, and then running the first bits of R code, can be especially frustrating and intimidating. In my experience, a little extra attention and a helping hand at this point can make a big difference to student success with the rest of the book. One thing to consider, and something I discuss in Chapter 2, is to use RStudio Cloud to avoid many of the problems students will encounter if they have to download R and install packages in the first couple of weeks of their semester.

One of the interesting developments I have observed over the past decade or so is that students are increasingly disconnected from the internal structure of their laptops and personal computers. The concepts of directories, drives, folders, etc., seem to have lost meaning to many students. I attribute this to so many things being accessible via remote access, either through their computer, tablet, or phone. A classic example of this comes up almost every semester when I give instructions for loading one of the first data files I use in class, `anes20.rda`

. Students take my sample text,`load("<filepath>/anes20.rda")`

, and try to run it verbatim, as if `<filepath>`

is an actual place on their device. I bring this up not to poke fun at students, but to highlight a very real issue that may cause problems early on for some students. Patience and a helping hand are encouraged.

An important things instructors can do is encourage students to follow along and run the R code as they work through the chapters. Running the code in this low-stakes context will help prepare them for the end-of-chapter exercises and any other assignments that instructors might require.

While learning R might create stress among students, the same is true for other technical and substantive aspects of doing data analysis. Part of this is about statistics, even though, as mentioned above, this is not a statistics textbook. Still statistics are an essential tool for doing data analysis, and it is important to teach students how to use this tool. Learning R code is important, but it is almost irrelevant if students don’t understand the substantive implications of statistical applications. Most of this can be addressed by requiring regular homework assignments, whether those found here or something else.

For students, the keys to success are largely the same as for most other subject areas. However, for students using this textbook, it is particularly important that they don’t fall behind. The material builds cumulatively and it will become increasingly difficult for students to do well if they fall behind. This applies to both aspects of the book, learning about data analysis and learning how to use R. As alluded to above, it is really important for students to hunker down and get their work done in the first part of the book, where they will learn some data analysis basics and get their first exposure to using R. Also, for students, ask for help! Your instructors want you to succeed and one of the keys to success is letting the experts (your instructors) help you!

## Data Sets and Codebooks

The primary data sets used for demonstration and for end-of-chapter problems are listed in the table below:

Data Set | Description |
---|---|

anes20 | A collection of individual-level attitudinal, behavioral, and demographic variables taken from the 2020 American National Election Study. 221 variables with over 8000 respondents. |

countries2 | A collection of country-level measures of demographic, health, economic, and political outcomes. 49 variables from 196 countries. |

county20large | A collection of U.S. county-level measures of demographic, health, economic, and political outcomes. 96 variables from over 3100 counties. |

states20 | A collection of U.S. state-level measures of demographic, health, economic, and political outcomes. 87 variables taken from 50 states. |

These and other sets can be downloaded at this link (you might have to right-click on this link to open the directory in a new tab or window)^{3}, and the codebooks can be found in the appendix of this book. Only a handful of variables are used from each of these data sets, leaving many others for homework, research projects, or paper assignments.

If the link doesn’t work, copy and paste this to your browser: https://www.dropbox.com/s/tezwantj8n4emjt/bookdown-demo_.pdf?dl=0↩︎

There are a lot of really good textbooks out there. I suppose I was a bit like Goldilocks: most books had many things to recommend them, but none were quite right. Some books had a social science orientation but not much to say about political science; some books were overly technical, while others were not technical enough; some books were essentially programming manuals with little to no research design or statistics instruction, while others were very good on research design and statistics but offered very little on the computing side.↩︎

If the link doesn’t work, copy and paste this to your browser: https://www.dropbox.com/sh/le8u4ha8veihuio/AAD6p6RQ7uFvMNXNKEcU__I7a?dl=0↩︎