1.1 Basic econometrics reminder

  • An econometric model can represent as a single equation or a system of equations including either two variables (bivariate model) or more than two variables (multivariate model), and not all variables are required to be numerical, while they can have different roles

  • Basic steps of econometric analysis:

  1. Model specification (model should be correctly specified according to financial theory)
  2. Data collection and preparation (generate new variables or transform existing ones)
  3. Descriptive statistics of the sample data and examination of their properties
  4. Parameters estimation according to the chosen estimator, e.g. OLS, WLS, GLS, ML, GMM, etc.
  5. Significance testing of estimated parameters
  6. Diagnostic checking if all assumptions are met and how well model fits the data
  7. Interpretation and forecasting (to explain and predict changes of financial phenomena)
  • Model specification refers to: (1) appropriate variables selection, (2) assuming causality direction, and (3) appropriate functional form selection

  • Variables on the right-hand side can have different roles; some may serve as control variables, while others can be multiplied to represent interaction term (variables that moderate the relationship between y and x)

  • When dealing with time–series data (data observed over time) it is common for the dependent variable y to appear also as an independent variable, making it endogenous

Example 1. Which variable is endogenous and which one is exogenous in the following equation? What does subscript t represent? Which variable is lagged? yt=β0+β1yt1+β2xt+ut
Solution Variable x is exogenous because x causes y, but not the other way around. Variable y is endogenous as it appears on both sides of the equation, meaning that y is both the consequence and the cause simultaneously. This is common when dealing with time–series data, and thus subscript t represents time unit (t=1, 2,,T), such as week, day, hour, month, year, etc. Variable yt1 is lagged because it is observed in previous time period (subscript t1), e.g. lagged inflation might be used as RHS variable to account for how past inflation impacts present inflation. Likewise, a variable lagged for two periods is noted as yt2, variable lagged for three periods is noted as yt3, etc.
Example 2. Is the following model bivariate or multivariate? Which terms are variables and which are parameters? Which variables are known (observed) and which are unknown? Which parameter represents the interaction term? Explain the interaction term, assuming that y= inflation, x= interest rate and z= COVID pandemic (1 for pandemic period and 0 otherwise). yt=α+βxt+γzt+λ(xtzt)+ut
Solution It is multivariate model due to more than one observed RHS variable (k2). Variables are y, x, z and u, while α, β, γ and λ are parameters. Known (observed) variables are y, x and z. Error term u is unknown (unobserved) random variable.
Parameter λ is the interaction term associated with the multiplication of the two variables x and z. In a given example, parameter λ represents the difference in the change of inflation with respect to 1 unit change in interest rate between two periods. For instance, if λ<0 it indicates that impact of interest rate on inflation was weaker in COVID pandemic period compared to non–pandemic period.

Example 3. Which variable is endogenous and which one is exogenous in the system of equations? How many parameters we need to estimate? System of two equations write in a matrix form!

yt=β1,0+β1,1yt1+β1,2xt1+u1,t xt=β2,0+β2,1yt1+β2,2xt1+u2,t
Solution Considering the system of equations both variables are endogenous, meaning that x causes y and y causes x. From this point none of the variables is strictly exogenous. Matrix form of the system is: [ytxt]=[β1,0β2,0]+[β1,1β1,2β2,1β2,2][yt1xt1]+[u1,tu2,t]
  • Keep in mind that in the pre–estimation phase raw data are typically transformed:
  1. Taking the logs, squares, inverse values, square roots,
  2. Seasonally and/or calendar adjusted
  3. First differences are sometimes required as well as lagged values
  4. Deflating nominal values
  • Most common data issues:
  1. Missing values (NA)
  2. Measurement errors (collected data may not always present the true values)
  3. Outliers (extreme values above or below the mean)
  • Regardless of the functional form and data type you should always consider parsimony principle with respect to the number of variables on the right–hand side (less is better)

  • This principle balances model goodness of fit with it’s simplicity to avoid overfitting