4.3 Variables selection and transformation
- Based on some previous knowledge or theoretical assumption we should select appropriate variables and assume causality direction in advance
Y | Y |
---|---|
dependent variable | independent variable |
predicted variable | predictor variable |
regressand | regressor |
endogenous | exogenous |
- When dealing with time-series data it is common that dependent variable Y also appears as independent one (endogenous variable)
Example 4.2 Which variable is endogenous and which one is exogenous in the following equation Yt=β0+β1Xt+β2Yt−1+ut t=1, 2,...,T
- The variables selection depends on the research objective and specific application
dependent | independent(s) |
---|---|
y=height | x=age |
y=consumption | x=income |
y=demand | x=price |
y=production | x1=labour, x2=capital |
y=interest rate | x1=money supply, x2=inflation |
y=crop yield | x1=average temperature, |
x2=rainfall, x3=number of sunshine days |
Regression with more than on independent variable is called multivariate regression, while only one independent variable is considered it is called bivariate regression (simple regression)
Variables are not always quantitative, i.e. qualitative variables can also be included in the equation using binary values of 0 and 1 (sometimes called dummy variables )
In the most simple application the varaible X is a single dummy variable with two attributes
Example 4.3 Let consider that cross-sectional data consists of 100 randomly chosen students with respect to their weight and gender. What information give us β0 and β1 in the following regression
yi=β0+β1xi+ui , y=weight , x={1,if male0,if femaleIf qualitative variable has two attributes (categories) only one dummy variable is included in the model with constant term
-> 0 indicates the absence of the attribute
-> 1 indicates the presence of the attribute
If qualitative variable has more than two attributes (m>2) the number of dummy variables should be for one less then the number of attributes (m−1)
Otherwise the system of normal equations obtained by least squares method has no solution due to multicollinearity problem
In regression analysis dummy variables can be used to describe:
- Changes in the constant term only
- Changes in the slope coefficient only
- Changes in the constant term and the slope coefficient
- Interaction effect between to or more qualitative variables
- Seasonal component of a time-series (seasonal dummy variables)
Example 4.4 Dependence of income (variable y in USD) on gender (variable d1) and working sector (variable d2) is analyzed, considering d1={1,if male0,if female , d2={1,if working in private sector0,if working in public sector
What changes can be explained by given models?
a) ˆyi=ˆβ0+ˆβ1di,1+ˆβ2di,2+ˆβ3(di,1di,2) b) ˆyi=ˆβ0+ˆβ1xi+ˆβ2(xidi,1)+^β3(xidi,2)+ˆβ4(di,1di,2)
After we select appropriate variables, sometimes they should be transformed before parameters estimation
Transformations are usually taken for 3 reasons:
- reducing the impact of outliers (extreme values)
- stabilizing the variations around the regression line
- establishing the linearity between nonlinearly related variables
- Due to variables transformation you should be careful in interpretation of the slope coefficient
model | equation | slope coefficient |
---|---|---|
lin-lin | yi=β0+β1xi+ui | β1=Δy/Δx |
log-log | logyi=β0+β1logxi+ui | β1=%Δy/%Δx |
log-lin | logyi=β0+β1xi+ui | β1×100≈%Δy/Δx |
lin-log | yi=β0+β1logxi+ui | β1/100≈Δy/%Δx |
polynomial | yi=β0+β1xi+β2x2i+ui | β1+2β2x0≈Δy/Δx0 |
Example 4.5 Transform Cobb-Douglas production function yi=β0xβ1i,1xβ2i,2eui into linear model. Explain the meaning of parameters β1 and β2 if y=production (000 tons), x1=number of employees and x2=capital (millions $).