4.3 Variables selection and transformation

Based on some previous knowledge or theoretical assumption we should select appropriate variables and assume causality direction in advance

TABLE 4.4: Caption here
$~~~~~~~~Y$	$~~~~~~~~Y$
dependent variable	independent variable
predicted variable	predictor variable
regressand	regressor
endogenous	exogenous

When dealing with time-series data it is common that dependent variable $Y$ also appears as independent one (endogenous variable)

Example 4.2 Which variable is endogenous and which one is exogenous in the following equation $Y_t=\beta_0+\beta_1X_t+\beta_2Y_{t-1}+u_t~~~~~~~~t=1,~2, ...,T$

The variables selection depends on the research objective and specific application

dependent	independent(s)
$y$ =height	$x$ =age
$y$ =consumption	$x$ =income
$y$ =demand	$x$ =price
$y$ =production	$x_1$ =labour, $x_2$ =capital
$y$ =interest rate	$x_1$ =money supply, $x_2$ =inflation
$y$ =crop yield	$x_1$ =average temperature,
	$x_2$ =rainfall, $x_3$ =number of sunshine days

Regression with more than on independent variable is called multivariate regression, while only one independent variable is considered it is called bivariate regression (simple regression)
Variables are not always quantitative, i.e. qualitative variables can also be included in the equation using binary values of $0$ and $1$ (sometimes called dummy variables )
In the most simple application the varaible $X$ is a single dummy variable with two attributes

Example 4.3 Let consider that cross-sectional data consists of $100$ randomly chosen students with respect to their weight and gender. What information give us $\beta_0$ and $\beta_1$ in the following regression

$y_i=\beta_0+\beta_1x_i+u_i~,~~y=weight~~,~~~x=\left\{\begin{array}{cl} 1,& if~male\\0,& if~female\end{array}\right.$

If qualitative variable has two attributes (categories) only one dummy variable is included in the model with constant term

-> 0 indicates the absence of the attribute

-> 1 indicates the presence of the attribute
If qualitative variable has more than two attributes ( $m>2$ ) the number of dummy variables should be for one less then the number of attributes ( $m-1$ )
Otherwise the system of normal equations obtained by least squares method has no solution due to multicollinearity problem
In regression analysis dummy variables can be used to describe:

Changes in the constant term only
Changes in the slope coefficient only
Changes in the constant term and the slope coefficient
Interaction effect between to or more qualitative variables
Seasonal component of a time-series (seasonal dummy variables)

Example 4.4 Dependence of income (variable $y$ in USD) on gender (variable $d_1$ ) and working sector (variable $d_2$ ) is analyzed, considering $d_1=\left\{\begin{array}{cl} 1,& if~male\\0,& if~female\end{array}\right.~~,~~~~~~~~~d_2=\left\{\begin{array}{cl} 1,& if~working~in~private~sector\\0,& if~working~in~public~sector\end{array}\right.$

What changes can be explained by given models?

$a)~~~\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1d_{i,1}+\hat{\beta}_2d_{i,2}+\hat{\beta}_3(d_{i,1}d_{i,2})~~~~~~~~~~~~~~~~~~~$ $b)~~\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_i+\hat{\beta}_2(x_id_{i,1})+\hat{\beta_3}(x_id_{i,2})+\hat{\beta}_4(d_{i,1}d_{i,2})$

After we select appropriate variables, sometimes they should be transformed before parameters estimation
Transformations are usually taken for 3 reasons:

reducing the impact of outliers (extreme values)
stabilizing the variations around the regression line
establishing the linearity between nonlinearly related variables

Due to variables transformation you should be careful in interpretation of the slope coefficient

model	equation	slope coefficient
lin-lin	$y_i=\beta_0+\beta_1x_i+u_i$	$\beta_1=\Delta y/\Delta x$
log-log	$logy_i=\beta_0+\beta_1logx_i+u_i$	$\beta_1=\%\Delta y/\%\Delta x$
log-lin	$logy_i=\beta_0+\beta_1x_i+u_i$	$\beta_1\times 100\approx\%\Delta y/\Delta x$
lin-log	$y_i=\beta_0+\beta_1logx_i+u_i$	$\beta_1/ 100\approx\Delta y/\%\Delta x$
polynomial	$y_i=\beta_0+\beta_1x_i+\beta_2x_i^{2}+u_i$	$\beta_1+2\beta_2x_0\approx\Delta y/\Delta x_0$

Example 4.5 Transform Cobb-Douglas production function $y_i=\beta_0x_{i,1}^{\beta_1}x_{i,2}^{\beta_2}e^{u_i}$ into linear model. Explain the meaning of parameters $\beta_1$ and $\beta_2$ if $y$ =production (000 tons), $x_1$ =number of employees and $x_2$ =capital (millions $).

Example 4.6 Explain the meaning of the slope coefficient in each model if

$y$ =monthly gas consumption (in liters) and

$x$ =price per liter (in $).

$a)~~y_i=158,49-13,48x_i+u_i~~~~~~~b)~~logy_i=5,44-0,87logx_i+u_i$

$c)~~logy_i=3,27-0,14x_i+u_i~~~~~~d)~~y_i=265,51-103,62logx_i+u_i$