4.3 Variables selection and transformation

  • Based on some previous knowledge or theoretical assumption we should select appropriate variables and assume causality direction in advance
TABLE 4.4: Caption here
\(~~~~~~~~Y\) \(~~~~~~~~Y\)
dependent variable independent variable
predicted variable predictor variable
regressand regressor
endogenous exogenous
  • When dealing with time-series data it is common that dependent variable \(Y\) also appears as independent one (endogenous variable)

Example 4.2 Which variable is endogenous and which one is exogenous in the following equation \[Y_t=\beta_0+\beta_1X_t+\beta_2Y_{t-1}+u_t~~~~~~~~t=1,~2, ...,T\]

  • The variables selection depends on the research objective and specific application
dependent independent(s)
\(y\)=height \(x\)=age
\(y\)=consumption \(x\)=income
\(y\)=demand \(x\)=price
\(y\)=production \(x_1\)=labour, \(x_2\)=capital
\(y\)=interest rate \(x_1\)=money supply, \(x_2\)=inflation
\(y\)=crop yield \(x_1\)=average temperature,
\(x_2\)=rainfall, \(x_3\)=number of sunshine days
  • Regression with more than on independent variable is called multivariate regression, while only one independent variable is considered it is called bivariate regression (simple regression)

  • Variables are not always quantitative, i.e. qualitative variables can also be included in the equation using binary values of \(0\) and \(1\) (sometimes called dummy variables )

  • In the most simple application the varaible \(X\) is a single dummy variable with two attributes

Example 4.3 Let consider that cross-sectional data consists of \(100\) randomly chosen students with respect to their weight and gender. What information give us \(\beta_0\) and \(\beta_1\) in the following regression

\[y_i=\beta_0+\beta_1x_i+u_i~,~~y=weight~~,~~~x=\left\{\begin{array}{cl} 1,& if~male\\0,& if~female\end{array}\right.\]
  • If qualitative variable has two attributes (categories) only one dummy variable is included in the model with constant term

    -> 0 indicates the absence of the attribute

    -> 1 indicates the presence of the attribute

  • If qualitative variable has more than two attributes (\(m>2\)) the number of dummy variables should be for one less then the number of attributes (\(m-1\))

  • Otherwise the system of normal equations obtained by least squares method has no solution due to multicollinearity problem

  • In regression analysis dummy variables can be used to describe:

  1. Changes in the constant term only
  2. Changes in the slope coefficient only
  3. Changes in the constant term and the slope coefficient
  4. Interaction effect between to or more qualitative variables
  5. Seasonal component of a time-series (seasonal dummy variables)

Example 4.4 Dependence of income (variable \(y\) in USD) on gender (variable \(d_1\)) and working sector (variable \(d_2\)) is analyzed, considering \[d_1=\left\{\begin{array}{cl} 1,& if~male\\0,& if~female\end{array}\right.~~,~~~~~~~~~d_2=\left\{\begin{array}{cl} 1,& if~working~in~private~sector\\0,& if~working~in~public~sector\end{array}\right.\]

What changes can be explained by given models?

\[a)~~~\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1d_{i,1}+\hat{\beta}_2d_{i,2}+\hat{\beta}_3(d_{i,1}d_{i,2})~~~~~~~~~~~~~~~~~~~\] \[b)~~\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_i+\hat{\beta}_2(x_id_{i,1})+\hat{\beta_3}(x_id_{i,2})+\hat{\beta}_4(d_{i,1}d_{i,2})\]

  • After we select appropriate variables, sometimes they should be transformed before parameters estimation

  • Transformations are usually taken for 3 reasons:

  1. reducing the impact of outliers (extreme values)
  2. stabilizing the variations around the regression line
  3. establishing the linearity between nonlinearly related variables
  • Due to variables transformation you should be careful in interpretation of the slope coefficient
model equation slope coefficient
lin-lin \(y_i=\beta_0+\beta_1x_i+u_i\) \(\beta_1=\Delta y/\Delta x\)
log-log \(logy_i=\beta_0+\beta_1logx_i+u_i\) \(\beta_1=\%\Delta y/\%\Delta x\)
log-lin \(logy_i=\beta_0+\beta_1x_i+u_i\) \(\beta_1\times 100\approx\%\Delta y/\Delta x\)
lin-log \(y_i=\beta_0+\beta_1logx_i+u_i\) \(\beta_1/ 100\approx\Delta y/\%\Delta x\)
polynomial \(y_i=\beta_0+\beta_1x_i+\beta_2x_i^{2}+u_i\) \(\beta_1+2\beta_2x_0\approx\Delta y/\Delta x_0\)

Example 4.5 Transform Cobb-Douglas production function \[y_i=\beta_0x_{i,1}^{\beta_1}x_{i,2}^{\beta_2}e^{u_i}\] into linear model. Explain the meaning of parameters \(\beta_1\) and \(\beta_2\) if \(y\)=production (000 tons), \(x_1\)=number of employees and \(x_2\)=capital (millions $).

Example 4.6 Explain the meaning of the slope coefficient in each model if \(y\)=monthly gas consumption (in liters) and \(x\)=price per liter (in $). \[a)~~y_i=158,49-13,48x_i+u_i~~~~~~~b)~~logy_i=5,44-0,87logx_i+u_i\] \[c)~~logy_i=3,27-0,14x_i+u_i~~~~~~d)~~y_i=265,51-103,62logx_i+u_i\]