# 10 Econometric, statistical, and data issues

Many (most?) projects are *empirical*, involving econometrics/statistics. It doesn’t make sense to give an entire course on Statistics, Econometrics, and Data Science here. However, I think it’s helpful to give somewhat of an applied overview, to:

- link to key resources,
- refresh your memory of these topics,
- help you consider how to
*do*this stuff in a real project (rather than in exams and problem sets), - give a sense of one economist’s (my) impressions of how to choose an approach, and
- point out some common misunderstandings/mistakes students make in this area.

As this cannot be comprehensive, I suggest referring to other resources (texts etc) for more detailed considerations.

## 10.1 Some recommended applied econometrics and statistics resources

Angrist and Pischke: Mastering metrics

Angrist, J. D., and J. S. Pischke. Mostly Harmless Econometrics: An Empiricist’s Companion.

Peter Kennedy’s “A Guide to Econometrics”

“Introductory Econometrics” by Wooldridge

“The Mixtape”: https://www.scunning.com/mixtape.html

### Time series

Economics 452 time series with
stata **–**
*econ.queensu.ca/faculty/gregory/econ 452/manual.pdf*

2 Working with economic and financial data in
Stata
**(Chris Baum)**

Kennedy, *A Guide to Econometrics*, Chapter 18, “Time Series Econometrics”

### With R

- Introduction to Econometrics with R free online interactive text by Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer; Based on Stock and Watson

## 10.2 Regression analysis, regression logic and meaning

We need to consider:

What is a regression? When should you use one?

How to specify the regression?

Which dependent variable do we use?

Which right-hand side variables?

- Which is/are the focal variable(s) and which are ‘control variables’?

Endogeneity and identification

Other statistical issues (e.g., functional form, error structure)

How will you interpret your results?

How to create a regression table and put it in your paper.

Writing about regression (and statistical) analysis; yours and others’.

### What is regression? When should you use one?

**A way of fitting a line (plane) through a bunch of dots.**

In multiple dimensions

It may have a causal interpretation (or not)

Classical Linear Model (CLM): Population model is linear in parameters:

*(authors note: recover math content here)*

### OLS: Estimating Actual Linear Relationship?

Best linear approximation; ‘average slopes’

Causal or not

### Identifying restrictions; CLM model assumptions

**Some coefficients/tests depend on normality, others an “asymptotic”
justification with a large enough sample**

“… Regression coefficients have an ‘average derivative’ interpretation. In multivariate regression models this interpretation is unfortunately complicated by the fact that the OLS slope vector is a matrix-weighted average of the gradient of the CEF. Matrix-weighted averages are difficult to interpret except in special cases (see Chamberlain and Lemur, 1976).”

-Angrist and Pischke [year?]

## 10.3 How to specify a regression – some considerations

**Functional form? **

(e.g., linear or “loglinear”? Include quadratic terms?)

**Impose restrictions?**

### Which dependent variable?

Is this meaningful to your question and interpretable?

Is it relevant to what you are looking for (e.g., available for right years and countries)?

Is it reliably collected?

Specified variables in logs? Linearly? Categorically?

Aggregated at what level?

### Which right-hand side (rhs) variables? The focal variables and control variables

**Typically, you care about:**

The effect of one (or a few) independent variable on the dependent variable,

e.g., education on wages.

*(Although you might have more complicated hypotheses/relationships to
test, involving differences between coefficients etc.)*

\(\rightarrow\) You should focus on credibly identifying *this* relationship.

Other rhs variables are typically *controls* (e.g., control for parent’s
education, control for IQ test scores).

Be careful not to include potentially “endogenous” variables as
controls, as this can bias *all coefficients* (more on this later).

Be careful about putting variables on the right hand side that are
determined *after the outcome* *variable* (Y, the dependent variable).

## 10.4 Endogeneity

You care about estimating the impact of a variable \(x_1\), on \(y\).

Consider the example of regressing income at age 30 on years of
education to try to get at the *effect* of education on income.

**\(x_1\)**: years of education

**\(x_2 ... x_k\):** set of “control” variables

**\(y\):** income at age 30

You regress

Suppose the *true* relationship (which you almost
never know for sure in economics) is

For unbiasedness/consistency of all your estimated terms, the key requirement is:

\(E(u|x_1, x_2,… x_k) = 0\), implies that all of the explanatory variables are exogenous.

Alternatively, it is still ‘consistent’ if **\(E(v) = 0\) and \(Cov(x_j,v) = 0\),
for \(j = 1, 2, …, k\)**

There are various reasons why the above assumption might not hold; various causes of what we call “endogeneity”. Two examples are reverse causality and omitted variable bias.

### Reverse casuality

**Education may affect income at age 30, but could income at age 30 also
affect years of education? **

*This is probably not a problem for this example, because the education is usually finished long before age 30 (even I finished at age 30 on the nose). *

*However, in other examples it is an issue (e.g., consider regressing
body weight on income, or vice/versa)*

*Also, if the measure* of education were determined years later, this
might be a problem. For example, if your measure of years of education
was based on self-reports at age 30, maybe those with a lower income
would under-report, e.g., if they were ashamed to be waiting tables with
a Ph.D.**

*or a third, omitted factor may affect both*

Intelligence may effect both education obtained and income at age 30

Macro/aggregate: With variation across time, there may be a common trend. E.g., suppose I were to regress “average income” on “average education” for the UK, using only a time series with one observation per year. A “trend term”, perhaps driven by technological growth, may be leading to increases in education as well as increased income.

### The omitted variable bias forumula; interpreting/signing the bias

You care about estimating the impact of a variable x_1, on y, e.g.,

x_1: years of education

y: income at age 30

You estimate

But the true relationship is

Where x_2 is an unobserved or unobservable variable, say “intelligence” or “personality”.

Your estimate of the slope is likely to be biased (and “inconsistent”) .

The “omitted variable (asymptotic) bias” is:

where

In other words, the coefficient you estimate will “converge to” the true coefficient plus a bias term.

**The bias is the product**:

[Effect of the omitted variable on the outcome] \(\times\) [“effect” of omitted variable on variable of interest]

E.g., [effect of intelligence on income] \(\times\) [“effect” of intelligence of years of schooling]

*This can be helpful in understanding whether your estimates may be
biased, and if so, in which direction!*

This is also a helpful mechanical relationship between “short” and “long” regressions, whether or not there is a causal relationship.

### Control strategies

Control for “X2-Xk” variables that have direct effects on Y; this will reduce omitted variable bias (if these variables are correlated to your “X1” of interest)

Including controls can also make your estimates more precise.

If you put in an “Xk” variable that doesn’t actually have a true
effect on Y, it will make your estimates *less* precise. However, it
will only lead to a bias if it is itself endogenous (and correlated to
your X1 of interest).

If you can’t observe these, you may use “proxies” for these to try to reduce omitted variable bias. E.g., IQ-test scores may be used as proxies for intelligence. Housing value might be used as a proxy for wealth.

**“Bad controL”**

some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coefficients. Bad controls are variables that are themselves outcome variables in the notational notional experiment at hand. That is, bad controls might just as well be dependent variables too. - (Angrist and Pischke)

– They could also be interpreted as endogenous variables.

Once we acknowledge the fact that college affects occupation, comparison of wages by college degree status within occupation are no longer apples to apples, even if college degree completion is randomly assigned.

– The question here was whether to control for the category of occupation, not the college degree.

It is also incorrect to say that the conditional comparison captures the part of the effect of college that is ‘not explained by occupation’"

so we would do better to control only for variables that are not themselves caused by education. - Angrist and Pischke

### Fixed effects/difference-in-between

The net effect of omitted variables and truly random term may have fixed and varying components. There may be a term “Ci” that is specific to an individual or “unit”, but that does not change over time. For example, an individual may be more capable of earning, a firm may have a particularly good location, and a country may have a particular high level of trust in institutions. There may also be a term that varies across units and over time. An individual may experience a particular negative shock to her income, a firm may be hit by a lawsuit, and a country may have a banking scandal.

If this Ci part of the “error term” may be correlated to the dependent variable of interest, X1, it may help to “difference this out” by doing a Fixed Effects Regression. This essentially includes a dummy variable for each individual (or “unit”), but these dummies are usually not reported. The resulting coefficients are the same ones you would get if you “de-meaned” every X and Y variable before running the regression.

By “demeaned” I mean replace each with

and with where the bars indicate “the mean of this variable for individual i”.

### Instrumental variables

A variable Z that

“causes” the X1 variable of interest but has no independent effect on Y, and

is not correlated to the true error term, may be used as an “instrument”.

*For example*, it might be argued (debatable) that if one’s parents had
a job near a good university, this would increases one’s chances of
going to a good university. To use “distance to nearest university” as
an instrument you would have to argue

there is no

*direct*effect of living near a good university on later income.The probability of living near a good university is not caused by a third unobserved factor (e.g., parent’s interest in children’s success) that might also affect later income.

As suggested above, it is hard to find a convincingly ‘valid’ instrument. This “exclusion restriction” cannot itself be easily tested, and is largely justified theoretically. (If you have multiple instruments there is something called an ‘overidentification test’ but it is controversial).

In addition, there are other issues with IV techniques that some argue make them unreliable. In particular, consider (and read about) issues of

weak instruments

heterogeneous effects (heterogeneity), differential ‘compliance’, and the ‘Local Average Treatment Effect’ (‘LATE’).

### 2sls

One form of instrumental variables (IV) technique is called “two stage
least squares” (2sls). This essentially involves regressing X1 on Z (and
other controls) and obtaining a predicted value of X1 from this
equation, , and then regressing Y on *this* (and the same set of other
controls) but “excluding” Z from this second-stage regression.

You should generally report both the first and second stages in a table, and “diagnostics” of this instrument.

### Some other issues and “diagnostics”*

Time series (and panel) data: issues of autocorrelation, lag structure, trends, non-stationarity

Non-normal error terms, small sample

Categorical dependent variable: consider Logit/Probit if binary, Multinomial logit if categorical, Poisson if ‘count’ data; other variants/models

Bounded/censored dependent variable: Consider Tobit and other models

Sample selection issues; self-selection, selectivity, etc.

Missing values/variables … Imputation

Errors in variables (classical, otherwise)

The meaning of R-squared; when it is useful/important?

### Heteroskedasticity

OLS coefficients are still unbiased/consistent but maybe not efficient

- Estimated standard errors of estimator/tests are
*not*unbiased/consistent

(Autocorrelation: similar considerations, but it can be a sign of a mispecified dynamic model)

#### Responses (to heteroskedasticity and “simple” autocorrelation)

“Feasible” GLS (only consider doing with lots of data) *or*

Regular OLS with robust standard errors (or clustered in a certain way)*

- “Test” for heteroskedasticity if you fail to reject homoscedasticity say “whew, I can ignore this”?

(DR: Would you say “I fail to strongly statistically reject the possibility that my car’s breaks are not working, therefore I will drive the car on a mountain pass?”)

Controversial; I don’t like this because the test may not be powerful enough. So use ‘robust’ anyway.

### Interpreting your results 1: test for significance

**Simple differences (not in a regression): A variety of parametric,
nonparametric and “exact” tests**

**Regression coefficients: t-tests**

Difference from zero (usually 2-sided)

Difference from some hypothesis (e.g., difference from unit)

Joint test of coefficients

Evidence for ‘small or no effect’: one-sided t-test of , e.g., H0: >=10 vs HA: <10; where 10 is a ‘small value’ in this context

### Joint significance of a set of coefficients: F-tests

H0: all tested coefficients are truly =0

HA: at least one coefficient has a true value ≠0

### Interpreting results 2: magnitudes & sizes of effects

**In a linear model in levels-on-levels the coefficients on continuous
variables have a simple “slope interpretation”**

Note: assuming a homogenous effect, otherwise it gets complicated.

**Dummy variables have a “difference in means, all else equal”
interpretation.**

! But be careful to describe and understand and explain the estimated
effects (or “linear relationships”) in terms of the *units* of the
variable (e.g., impact of *years of education* on *thousands of pounds
of salary at age-30, pre-tax)*

**Transformed/nonlinear variables**

When some variables are transformed, e.g., expressed in logarithms, interpretation is a little more complicated (but not too difficult). Essentially, the impact of/on logged variables represent “proportional” or “percentage-wise” impact. Look this up and describe the effects correctly.

**In nonlinear models**

(e.g., Logit, Tobit, Poisson/Exponential) the marginal effect of a variable is not constant, it depends on the other variables and the error/unobservable term. But you can express things like “marginal effect averaged over the observed values” or (for some models) the “proportional percentage effect.”

### Interpreting results 3: interaction terms

You may run a regression such as:

\[INCOME = A + B_1 \times YEARS_EDUC + B_2 \times FEMALE \times YEARS\_EDUC + B_3 \times FEMALE + U\]

Where FEMALE is a dummy variable that =1 if the observed individual is a woman and =0 if he is a man.

*How do you interpret each coefficient estimate? *

A: A constant “intercept”; fairly meaningless by itself unless the other variables are expressed as differences from the mean, in which case it represents the mean income.

\(B_1\) : “Effect” of years of education on income (at age 30,
say) *for males*

\(B_2\) : “Additional Effect” of years of education on income *for
females relative to males*

\(B_3\) : “Effect” of being female on income, holding education constant

What about \(B_1 + B_2\)?

= “Effect” of years of education on income *for
females*

## 10.5 (To add or integrate into the above)

Simple statistics and simple tests

Hypothesis testing

(OLS vs) Probit, Tobit, and Nonlinear Specifications

Clustered standard errors

Time series data issues

## 10.6 Formatting figures and tables

(Give an annotated “good” and “bad” example for each)

**Summary statistics****Simple tests****Graphs and figures****Regression tables (small)****Regression tables (many columns or rows)**