1 Introduction


Source: XKCD

Welcome to your guide to learning linear regression in Stata and R. This website houses all the information you need learn the basics of coding linear regression in Stata and R. It will not contain all the information taught in class, but will allow you to bridge that knowledge into running linear regressions on your own.

The Stata labs on this website were adapted from materials by Ewurama Okai.

1.1 Labs

This is a 10-week course with 9 labs. Each lab will focus on some topic related to coding linear regression. By the end of the course, you should be able to run a linear regression project from start to finish with reproducible code.

Each lab will contain links to download script files (in .do or .r format), overviews of key concepts, and application questions.

Lab Topics

Note: All lab topics are tentative and subject to change.

  • Lab 1: Data cleaning review & writing clean code
  • Lab 2: Running a basic linear regression
  • Lab 3: Testing the assumptions of linear regression
  • Lab 4: Transforming variables & displaying results with margins
  • Lab 5: Exporting tables & reproducible code
  • Lab 6: Evaluating linearity & interactions
  • Lab 7: Robust standard errors & multicolinearity
  • Lab 8: Review & requested topics TBD
  • Lab 9: Comparing models & running a project from start to finish

1.2 Finding Data

When selecting data, consider:

  • The research question you would like to answer
  • The model type you will be applying (linear regression in this class)
  • The unit/level of analysis in the dataset (individual? school? district? state?)
  • The main independent and dependent variables you want to analyze
  • Other relevant variables to include in your model

Some places to find datasets:

1.3 Notes on Statistical Significance

Statistical significance is a yes/no test. Did it meet the test of statistical significance you set or not? A smaller p-value does not necessarily mean the association is more meaningful.

As social scientists we need to pay more attention to whether something is socially or sociologically significant. We do this by paying attention to the interpretation of coefficients and effect size. This is is especially important because it is actually relatively easy to get statistically significant results with large samples. In the world of “big data,” this will come up more and more.

Some questions to ask yourself in papers from Bernardi, Chakhaia, and Leopold (2017):

  1. Do you avoid interpreting a statistically insignificant coefficient as evidence of no effect?
  2. Do you avoid using the adjective “significant” in an ambiguous way?
  3. Do you avoid justifying the inclusion of variables in your model on the basis of statistical significance of their estimates?
  4. Do you report coefficients in some usefull and intelligble form that makes it easier to understand how large the effect is?
  5. Do you discus the substantive significance of the model coefficients.

1.5 Data Equity

Quantitative research is often perceived as inherently “objective” but can perpetuate inequality in dangerous ways. This website is a great resource for thinking about equity in quantitative research. I also recommend signing up for their newsletter.

We All Count: Project for Equity in Data Science

Some other resources: