10 Econometric, statistical, and data issues

Many (most?) projects are empirical, involving econometrics/statistics. It doesn’t make sense to give an entire course on Statistics, Econometrics, and Data Science here. However, I think it’s helpful to give somewhat of an applied overview, to:

  • link to key resources,
  • refresh your memory of these topics,
  • help you consider how to do this stuff in a real project (rather than in exams and problem sets),
  • give a sense of one economist’s (my) impressions of how to choose an approach, and
  • point out some common misunderstandings/mistakes students make in this area.

As this cannot be comprehensive, I suggest referring to other resources (texts etc) for more detailed considerations.

10.2 Regression analysis, regression logic and meaning

We need to consider:

  • What is a regression? When should you use one?

  • How to specify the regression?

  • Which dependent variable do we use?

  • Which right-hand side variables?

    • Which is/are the focal variable(s) and which are ‘control variables’?
  • Endogeneity and identification

  • Other statistical issues (e.g., functional form, error structure)

How will you interpret your results?

How to create a regression table and put it in your paper.

Writing about regression (and statistical) analysis; yours and others’.

What is regression? When should you use one?

A way of fitting a line (plane) through a bunch of dots.

  • In multiple dimensions

  • It may have a causal interpretation (or not)

Classical Linear Model (CLM): Population model is linear in parameters:

(authors note: recover math content here)

OLS: Estimating Actual Linear Relationship?

  • Best linear approximation; ‘average slopes’

  • Causal or not

Identifying restrictions; CLM model assumptions

Some coefficients/tests depend on normality, others an “asymptotic” justification with a large enough sample

“… Regression coefficients have an ‘average derivative’ interpretation. In multivariate regression models this interpretation is unfortunately complicated by the fact that the OLS slope vector is a matrix-weighted average of the gradient of the CEF. Matrix-weighted averages are difficult to interpret except in special cases (see Chamberlain and Lemur, 1976).”

-Angrist and Pischke [year?]

10.3 How to specify a regression – some considerations

Functional form?

(e.g., linear or “loglinear”? Include quadratic terms?)

Impose restrictions?

Which dependent variable?

  • Is this meaningful to your question and interpretable?

  • Is it relevant to what you are looking for (e.g., available for right years and countries)?

  • Is it reliably collected?

  • Specified variables in logs? Linearly? Categorically?

  • Aggregated at what level?

Which right-hand side (rhs) variables? The focal variables and control variables

Typically, you care about:

The effect of one (or a few) independent variable on the dependent variable,

e.g., education on wages.

(Although you might have more complicated hypotheses/relationships to test, involving differences between coefficients etc.)

\(\rightarrow\) You should focus on credibly identifying this relationship.

Other rhs variables are typically controls (e.g., control for parent’s education, control for IQ test scores).

Be careful not to include potentially “endogenous” variables as controls, as this can bias all coefficients (more on this later).

Be careful about putting variables on the right hand side that are determined after the outcome variable (Y, the dependent variable).

10.4 Endogeneity

You care about estimating the impact of a variable \(x_1\), on \(y\).

Consider the example of regressing income at age 30 on years of education to try to get at the effect of education on income.

\(x_1\): years of education

\(x_2 ... x_k\): set of “control” variables

\(y\): income at age 30

You regress

Suppose the true relationship (which you almost never know for sure in economics) is

For unbiasedness/consistency of all your estimated terms, the key requirement is:

\(E(u|x_1, x_2,… x_k) = 0\), implies that all of the explanatory variables are exogenous.

Alternatively, it is still ‘consistent’ if \(E(v) = 0\) and \(Cov(x_j,v) = 0\), for \(j = 1, 2, …, k\)

There are various reasons why the above assumption might not hold; various causes of what we call “endogeneity”. Two examples are reverse causality and omitted variable bias.

Reverse casuality

Education may affect income at age 30, but could income at age 30 also affect years of education?

This is probably not a problem for this example, because the education is usually finished long before age 30 (even I finished at age 30 on the nose).

However, in other examples it is an issue (e.g., consider regressing body weight on income, or vice/versa)

Also, if the measure of education were determined years later, this might be a problem. For example, if your measure of years of education was based on self-reports at age 30, maybe those with a lower income would under-report, e.g., if they were ashamed to be waiting tables with a Ph.D.**

or a third, omitted factor may affect both

Intelligence may effect both education obtained and income at age 30

Macro/aggregate: With variation across time, there may be a common trend. E.g., suppose I were to regress “average income” on “average education” for the UK, using only a time series with one observation per year. A “trend term”, perhaps driven by technological growth, may be leading to increases in education as well as increased income.

The omitted variable bias forumula; interpreting/signing the bias

You care about estimating the impact of a variable x_1, on y, e.g.,

x_1: years of education

y: income at age 30

You estimate

But the true relationship is

Where x_2 is an unobserved or unobservable variable, say “intelligence” or “personality”.

Your estimate of the slope is likely to be biased (and “inconsistent”) .

The “omitted variable (asymptotic) bias” is:


In other words, the coefficient you estimate will “converge to” the true coefficient plus a bias term.

The bias is the product:

[Effect of the omitted variable on the outcome] \(\times\) [“effect” of omitted variable on variable of interest]

E.g., [effect of intelligence on income] \(\times\) [“effect” of intelligence of years of schooling]

This can be helpful in understanding whether your estimates may be biased, and if so, in which direction!

This is also a helpful mechanical relationship between “short” and “long” regressions, whether or not there is a causal relationship.

Control strategies

Control for “X2-Xk” variables that have direct effects on Y; this will reduce omitted variable bias (if these variables are correlated to your “X1” of interest)

Including controls can also make your estimates more precise.

If you put in an “Xk” variable that doesn’t actually have a true effect on Y, it will make your estimates less precise. However, it will only lead to a bias if it is itself endogenous (and correlated to your X1 of interest).

If you can’t observe these, you may use “proxies” for these to try to reduce omitted variable bias. E.g., IQ-test scores may be used as proxies for intelligence. Housing value might be used as a proxy for wealth.

“Bad controL”

some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coefficients. Bad controls are variables that are themselves outcome variables in the notational notional experiment at hand. That is, bad controls might just as well be dependent variables too. - (Angrist and Pischke)

– They could also be interpreted as endogenous variables.

Once we acknowledge the fact that college affects occupation, comparison of wages by college degree status within occupation are no longer apples to apples, even if college degree completion is randomly assigned.

– The question here was whether to control for the category of occupation, not the college degree.

It is also incorrect to say that the conditional comparison captures the part of the effect of college that is ‘not explained by occupation’"

so we would do better to control only for variables that are not themselves caused by education. - Angrist and Pischke

Fixed effects/difference-in-between

The net effect of omitted variables and truly random term may have fixed and varying components. There may be a term “Ci” that is specific to an individual or “unit”, but that does not change over time. For example, an individual may be more capable of earning, a firm may have a particularly good location, and a country may have a particular high level of trust in institutions. There may also be a term that varies across units and over time. An individual may experience a particular negative shock to her income, a firm may be hit by a lawsuit, and a country may have a banking scandal.

If this Ci part of the “error term” may be correlated to the dependent variable of interest, X1, it may help to “difference this out” by doing a Fixed Effects Regression. This essentially includes a dummy variable for each individual (or “unit”), but these dummies are usually not reported. The resulting coefficients are the same ones you would get if you “de-meaned” every X and Y variable before running the regression.

By “demeaned” I mean replace each with

and with where the bars indicate “the mean of this variable for individual i”.

Instrumental variables

A variable Z that

  1. “causes” the X1 variable of interest but has no independent effect on Y, and

  2. is not correlated to the true error term, may be used as an “instrument”.

For example, it might be argued (debatable) that if one’s parents had a job near a good university, this would increases one’s chances of going to a good university. To use “distance to nearest university” as an instrument you would have to argue

  1. there is no direct effect of living near a good university on later income.

  2. The probability of living near a good university is not caused by a third unobserved factor (e.g., parent’s interest in children’s success) that might also affect later income.

As suggested above, it is hard to find a convincingly ‘valid’ instrument. This “exclusion restriction” cannot itself be easily tested, and is largely justified theoretically. (If you have multiple instruments there is something called an ‘overidentification test’ but it is controversial).

In addition, there are other issues with IV techniques that some argue make them unreliable. In particular, consider (and read about) issues of

  • weak instruments

  • heterogeneous effects (heterogeneity), differential ‘compliance’, and the ‘Local Average Treatment Effect’ (‘LATE’).


One form of instrumental variables (IV) technique is called “two stage least squares” (2sls). This essentially involves regressing X1 on Z (and other controls) and obtaining a predicted value of X1 from this equation, , and then regressing Y on this (and the same set of other controls) but “excluding” Z from this second-stage regression.

You should generally report both the first and second stages in a table, and “diagnostics” of this instrument.

Some other issues and “diagnostics”*

Time series (and panel) data: issues of autocorrelation, lag structure, trends, non-stationarity

Non-normal error terms, small sample

Categorical dependent variable: consider Logit/Probit if binary, Multinomial logit if categorical, Poisson if ‘count’ data; other variants/models

Bounded/censored dependent variable: Consider Tobit and other models

Sample selection issues; self-selection, selectivity, etc.

Missing values/variables … Imputation

Errors in variables (classical, otherwise)

The meaning of R-squared; when it is useful/important?


OLS coefficients are still unbiased/consistent but maybe not efficient

  • Estimated standard errors of estimator/tests are not unbiased/consistent

(Autocorrelation: similar considerations, but it can be a sign of a mispecified dynamic model)

Responses (to heteroskedasticity and “simple” autocorrelation)

“Feasible” GLS (only consider doing with lots of data) or

Regular OLS with robust standard errors (or clustered in a certain way)*

  • “Test” for heteroskedasticity if you fail to reject homoscedasticity say “whew, I can ignore this”?

(DR: Would you say “I fail to strongly statistically reject the possibility that my car’s breaks are not working, therefore I will drive the car on a mountain pass?”)

Controversial; I don’t like this because the test may not be powerful enough. So use ‘robust’ anyway.

Interpreting your results 1: test for significance

Simple differences (not in a regression): A variety of parametric, nonparametric and “exact” tests

Regression coefficients: t-tests

  • Difference from zero (usually 2-sided)

  • Difference from some hypothesis (e.g., difference from unit)

  • Joint test of coefficients

Evidence for ‘small or no effect’: one-sided t-test of , e.g., H0: >=10 vs HA: <10; where 10 is a ‘small value’ in this context

Joint significance of a set of coefficients: F-tests

H0: all tested coefficients are truly =0

HA: at least one coefficient has a true value ≠0

Interpreting results 2: magnitudes & sizes of effects

In a linear model in levels-on-levels the coefficients on continuous variables have a simple “slope interpretation”

Note: assuming a homogenous effect, otherwise it gets complicated.

Dummy variables have a “difference in means, all else equal” interpretation.

! But be careful to describe and understand and explain the estimated effects (or “linear relationships”) in terms of the units of the variable (e.g., impact of years of education on thousands of pounds of salary at age-30, pre-tax)

Transformed/nonlinear variables

When some variables are transformed, e.g., expressed in logarithms, interpretation is a little more complicated (but not too difficult). Essentially, the impact of/on logged variables represent “proportional” or “percentage-wise” impact. Look this up and describe the effects correctly.

In nonlinear models

(e.g., Logit, Tobit, Poisson/Exponential) the marginal effect of a variable is not constant, it depends on the other variables and the error/unobservable term. But you can express things like “marginal effect averaged over the observed values” or (for some models) the “proportional percentage effect.”

Interpreting results 3: interaction terms

You may run a regression such as:

\[INCOME = A + B_1 \times YEARS_EDUC + B_2 \times FEMALE \times YEARS\_EDUC + B_3 \times FEMALE + U\]

Where FEMALE is a dummy variable that =1 if the observed individual is a woman and =0 if he is a man.

How do you interpret each coefficient estimate?

A: A constant “intercept”; fairly meaningless by itself unless the other variables are expressed as differences from the mean, in which case it represents the mean income.

\(B_1\) : “Effect” of years of education on income (at age 30, say) for males

\(B_2\) : “Additional Effect” of years of education on income for females relative to males

\(B_3\) : “Effect” of being female on income, holding education constant

What about \(B_1 + B_2\)?

= “Effect” of years of education on income for females

10.5 (To add or integrate into the above)

  • Simple statistics and simple tests

  • Hypothesis testing

  • (OLS vs) Probit, Tobit, and Nonlinear Specifications

  • Clustered standard errors

  • Time series data issues

10.6 Formatting figures and tables

(Give an annotated “good” and “bad” example for each)

  1. Summary statistics

  2. Simple tests

  3. Graphs and figures

  4. Regression tables (small)

  5. Regression tables (many columns or rows)