This document is intended as a follow-along tutorial for learning how to perform data collection and cleaning with R. To the best of my ability, I have tried to make this illustrative of real data and real tasks that anyone from a social science student to a county government official might actually encounter. To that end, I am building upon actual projects that I have worked on as a graduate research assistant to convey this information. For context, previously, I conducted a Mississippi case study of how indoor smoking bans diffused throughout the state at the municipal level. Case studies, by nature, pose generalization problems that I would like to overcome. As such, I’m interested in the following states being included in our dataset: AL, KY, MS, MO, SC, TX, WV. Those states are relevant because they also have no preempting state level policy, meaning that the municipalities are able to pass their own laws on the matter.

What I am presenting here is a step-by-step walkthrough of where I obtained the data for the other states, some of the challenges faced when combining data from multiple sources into one, and how to overcome them. In the First Steps section, I explain how to import and preview the MS dataset so that we can ascertain what we need to collect for the other states and so that we can have the MS data ready to merge with the other states later. Before that, the next section offers a brief discussion of what knowledge you’re expected to possess before this tutorial will be useful to you, some of the formatting conventions used throughout, and how to best approach the information presented.

How to Read This Document

In this document, I assume that you have the latest version of R downloaded, as well as the latest version of R Studio. If you do not, you can visit the following in their respective order to begin that process:

This tutorial also assumes that you have a basic understanding of the R language’s syntax. While I will endeavor to be as explicit and exhaustive as possible in this walkthrough, a complete instruction in the R programming language is out of scope here. If you don’t feel comfortable with R’s syntax, there are a plethora of resources available that have already brilliantly summarized that information. Personally, I recommend the following:

  • swirl, a package that teaches you R within R (Kross et al. 2019)
  • The Book of R: A First Course in Programming and Statistics (Davies 2016)

Throughout this document, packages, function names, object names, operators, and code chunks are specially formatted here to clearly separate them from plain text. They will be formatted like so.


I would like to extend a very sincere thank you to Dr. Joseph “Dallas” Breen for his continuing mentorship and interest in this project and to (soon to be Dr.) Adam Thrash for providing his invaluable, detailed feedback throughout the creation of this document.


Davies, Tilman M. 2016. The Book of R: A First Course in Programming and Statistics. San Francisco: No Starch Press.

Kross, Sean, Nick Carchedi, Bill Bauer, and Gina Grdina. 2019. Swirl: Learn R, in R.