2.1 Variable selection

  • Assuming unidirectional causality, where \(x\) causes \(y\), there are several alternative terms that can be used for the variables \(y\) and \(x\)
TABLE 2.2: Variables terminology
\(~~~~~~\)Variable \(y~~~~~~\) \(~~~~~~~\)Variable \(x~~~~~~\)
dependent independent
outcome predictor
response explanatory
regressand regressor
endogenous exogenous
  • Variables on the right-hand side can have different roles; some may serve as control variables, while others can be mutiplied to represent interaction term

  • When dealing with time-series data (data observed over time) it is common for the dependent variable \(y\) to appear also as an independent variable, making it endogenous

Exercise 1. Which variable is endogenous and which one is exogenous in the following equation? What does subscript \(t\) represent? Which variable is lagged? \[y_t=\beta_0+\beta_1y_{t-1}+\beta_2x_t+u_t~~~~~~~~~t=1,~2, ...,T\]
Solution Variable \(x\) is exogenous because \(x\) causes \(y\), but not the other way around. Variable \(y\) is endogenous as it appears on both sides of the equation, meaning that \(y\) is both the consequence and the cause simultaneously. This is common when dealing with time-series data, and thus subscript \(t\) represents time unit (such as week, day, hour, month or year). Variable \(y_{t-1}\) is agged because it is observed in previous time period (subscript \(t-1\)), e.g. lagged consumption might be used as RHS variable to account for how past consumption impacts present consumption. Likewise, a variable lagged for two periods is noted as \(y_{t-2}\), variable lagged for three periods is noted as \(y_{t-3}\), etc.

Exercise 2. Which variable is endogenous and which one is exogenous in the system of equations? How many parameters we need to estimate? System of two equations write in a matrix form!

\[y_t=\beta_{1,0}+\beta_{1,1}y_{t-1}+\beta_{1,2}x_{t-1}+u_{1,t}\] \[x_t=\beta_{2,0}+\beta_{2,1}y_{t-1}+\beta_{2,2}x_{t-1}+u_{2,t}\]
Solution Considering the system of equations both variables are endogenous, meaning that \(x\) causes \(y\) and \(y\) causes \(x\). From this point none of the variables is strictly exogenous. Matrix form of the system is: \[\begin{bmatrix}y_t \\ x_t \end{bmatrix}=\begin{bmatrix} \beta_{1,0} \\ \beta_{2,0} \end{bmatrix}+ \begin{bmatrix} \beta_{1,1} \quad \beta_{1,2} \\ \beta_{2,1} \quad \beta_{2,2} \end{bmatrix} \begin{bmatrix}y_{t-1} \\ x_{t-1} \end{bmatrix}+\begin{bmatrix} u_{1,t} \\ u_{2,t} \end{bmatrix}\]
Exercise 3. Is the following model bivariate or multivariate? What does subscript \(i\) represent? Which terms are variables and which are parameters? Which variables are known (observed) and which are unknown? \[y_i=\alpha+\beta x_i+\gamma z_i+u_i~~~~~~~~~i=1,~2, ...,n\]
Solution It is multivariate model due to more than one observed RHS variable. Subscript \(i\) represent cross-sectional unit. Variables are \(y\), \(x\), \(z\) and \(u\), while \(\alpha\), \(\beta\) and \(\gamma\) are parameters. Known variables are \(y\), \(x\) and \(z\). Error term is uknown (unobserved) variable \(u\).
Exercise 4. Which parameter represents the constant term, and which one represents the interaction term? Explain the interaction term, assuming that \(y=\) income, \(x=\) years of working experience and \(z=\) gender (\(1\) for males and \(0\) for females). \[y_i=\alpha+\beta x_i+\gamma z_i+ \lambda (x_i z_i) + u_i\]
Solution Parameter \(\alpha\) is the constant term, and parameter \(\lambda\) is the interaction term associated with the multiplication of the two variables \(x\) and \(z\). In a given example, parameter \(\lambda\) represents the difference in the change of income with respect to \(1\) unit change in working experience between males and females. For example, if \(\lambda \gt 0\), it indicates that the income increase due to additional \(1\) year of experience is greater for males than the females.