Chapter 6 Homework

This is our last day, so I really shouldn’t be giving homework! But, now would be a great time to try a data analysis/statistics project on some data that interests you. If you have a dataset already, great! Otherwise, see below for some places to look.

6.1 Understanding the basics

Start by understanding the data that is available to you. If you have a codebook, you have struck gold! If not (the more common case), you’ll need to do some detective work that often involves talking to people. At this stage, ask yourself:

Where does my data come from? How was it collected?6
Is there a codebook? If not, how can I learn about it?
Are there people I can reach out to who have experience with this data?

Next, you need to load the data and clean it. Once the data is loaded, ask yourself about each table:

What is an observation?
How many observations are there?
What is the meaning of each variable?
What is the type of each variable (date, location, string, factor, number, boolean, etc.)?

Some great methods to start with are the functions

str() to learn about the numbers of variables and observations as well as the classes of variables
skim() to see many summary statistics about each variable

Some good things to check:

Are all the variables of the type you expect? For example, make sure that zipcodes are being considered as chr or Factor variable, rather than a num or int.
Look at the variable names in your data. Are they readable? Do they include spaces, periods, underscores, or other things that will make them hard to type? Are they helpful and contextual? For example, you would want Airport and WaterTemp, not Individuals and Treatments, and certainly not A and B as variable names.
Look at the category names. Are they readable? Are they understandable phrases or cryptic codes? For example, do they use Male and Female or something like 1 and 2? A binary variable isFemale could be coded 0 for male, and 1 for female (and then is self-documenting). A variable sex coded 1 and 2 is just asking for trouble.
Look at the min, max, and number of missing values for each of the variables. Do those values make sense?

Finally, ask yourself about the relationships between tables:

What variables link the tables (i.e., which variables can you use in join commands)?

6.2 Visualize and describe

Once you have the data loaded and cleaned, it is usually helpful to do some univariate visualization; e.g., plotting histograms, densities, and box plots of different variables. You might ask questions such as:

What do you see that is interesting?
Which values are most common or unusual (outliers)?
Is there a lot of missing data?
What type of variation occurs within the individual variables?
What might be causing the interesting findings?
How could you figure out whether your ideas are correct?

Once you have done some univariate visualization, you might examine the covariation between different variables.

6.3 Formulate a research question

You will often end up with a ton of data, and it can be easy to be overwhelmed. How should you get started? One easy idea is to brainstorm ideas for research questions, and pick one that seems promising. This process is much easier with more than one brain! You will often be working off of a broad question posed by your business, organization, or supervisor, and be thinking about how to narrow it down. To do so, you can again revisit questions like “What patterns do you see?” or “Why might they be occurring?”

6.4 Try some models!

Test out some of the modeling techniques we have examined on your data.

How do the predictions on a test data set compare to a null model?
What measures of accuracy are you using?
Try some models that include a larger number of explanatory variables.
Try to fit a model that provides some insights (not necessarily high prediction accuracy), and demonstrate these insights with a visualization.

6.5 Communicate your findings

Once you have something you think is interesting, clean up your RMarkdown document to only include the most relevant visualizations, summary statistics, and models. Knit often to ensure that all the necessary pieces are included, but try to streamline your document. Fill in narrative to explain the project. Check out the RMarkdown cheatsheet and perhaps change some chunk options to make the narrative clearer. For example, you may want to include echo=FALSE to hide all the code, and only display the results.

6.6 Places to find data

6.6.1 Inside R!

Of course, he very easiest way to get data inside R is to find some that came with an R package. To see the datasets you have access to in R, type the following in your Console:

data()

A new window will pop up, near where your data preview and RMarkdown documents are. This lists all the datasets you have access to! On my computer, the first one is called AirPassengers. If I wanted to read more about that dataset, I could use the ?, again in the Console

?AirPassengers

(that one isn’t very interesting, it only has one variable!)

Then, if I found a dataset I wanted to use for my project, I would load it in my RMarkdown document like this,

data("AirPassengers")

This is instead of the read_csv stuff we do when we have a csv file

6.6.2 On the web

If you don’t like the look of any of those datasets, you can search for a file on the web.

When you look for data, try to find a dataset that is in .csv format, or .xls (Excel) format. R can read in other types of data, but those two are the easiest.

Here are some of my favorite places to find data:

Data is Plural tinyletter and associated spreadsheet This is my favorite, but it’s also the craziest. I recommend using your browser’s search function to find a keyword in the spreadsheet (e.g., search for “golf” or “race” to find data on those topics)
Tidy Tuesday is a weekly data visualization challenge/community! All of the past data is archived on GitHub. Bonus: if you complete an analysis in the designated week and tweet it, many people will retweet your project! (Tag me and I definitely will!)
FiveThirtyEight data archive Datasets that go along with stories FiveThirtyEight has written
Data.gov 186,000+ datasets! Good ones are the American Time Use Survey and the Youth Behavioral Risk survey, but again, search for topics that interest you.
Pew Research Center a place that does lots of opinion surveys. If you want to know what people think about politics, technology, Millennials, etc., this is the place to go.
Google dataset search a version of google that just searches for data. Works just like regular google, so you can write things like -site:figshare.com if you want to exclude figshare results (for example).
IRE and NICAR are good resources for the types of data journalists care about. For example, Energy data sources and Chrys Wu’s resource page.
Jo Hardin at Pomona College has a nice list of data sources on her website.
U.S. Bureau of Labor Statistics
U.S. Census Bureau
Gapminder, data about the world.
Nathan Yau’s (old) guide to finding data on the internet