Examples and applications
Case study I: PISA scores and GDPp
The Programme for International Student Assessment (PISA) is a study carried out by the Organisation for Economic Cooperation and Development (OECD) in 65 countries with the purpose of evaluating the performance of 15yearold pupils on mathematics, science, and reading. A phenomena observed over years is that wealthy countries tend to achieve larger average scores. The purpose of this case study, motivated by the OECD (2012a) inform, is to answer two questions related with the previous statement:
 Q1. Is the educational level of a country influenced by its economic wealth?
 Q2. If so, up to what precise extent?
The pisa.csv
file (download) contains 65 rows corresponding to the countries that took part on the PISA study. The data was obtained merging the statlink in OECD (2012b) with The World Bank (2012) data. Each row has the following variables: Country
; MathMean
, ReadingMean
and ScienceMean
(the average performance of the students in mathematics, reading and science); MathShareLow
and MathShareTop
(percentages of students with a low and top performance in mathematics); GDPp
and logGDPp
(the Gross Domestic Product per capita and its logarithm); HighIncome
(whether the country has a GDPp larger than 20000$ or not). The GDPp of a country is a measure of how many economic resources are available per citizen. The logGDPp
is the logarithm of the GDPp, taken in order to avoid scale distortions. A small subset of the data is shown in Table 2.1.
Table 2.1: First 10 rows of the pisa
dataset for a selection of variables. Note the NA
(Not Available) in Chinese Taipei (or Taiwan).
ShanghaiChina 
613 
570 
580 
8.74267 
FALSE 
Singapore 
573 
542 
551 
10.90506 
TRUE 
Hong Kong SAR, China 
561 
545 
555 
10.51074 
TRUE 
Chinese Taipei 
560 
523 
523 
NA 
NA 
Korea 
554 
536 
538 
10.10455 
TRUE 
Macao SAR, China 
538 
509 
521 
11.25344 
TRUE 
Japan 
536 
538 
547 
10.75152 
TRUE 
Liechtenstein 
535 
516 
525 
11.91278 
TRUE 
Switzerland 
531 
509 
515 
11.32911 
TRUE 
Netherlands 
523 
511 
522 
10.80922 
TRUE 
We definitely need a way of summarizing this ammount of information!
We are going to do the following. First, import the data into R Commander
and do a basic manipulation of it. Second, fit a linear model and interpret its output. Finally, visualize the fitted line and the data.
Import the data into R Commander
.
Go to 'Data' > 'Import data' > 'from text file, clipboard, or URL...'
. A window like Figure 2.1 will popup. Select the appropiate formatting options of the data file: whether the first row contains the name of the variables, what is the indicator for missing data, what is the field separator, and what is the decimal point character. Then click 'OK'
.
Inspecting the data file in a text editor will give you the right formatting choices for importing the data.
Click on 'View data set'
to check that the importation was fine. If the data looks weird, then recheck the structure of the data file and restart from the above point.
Since each row corresponds to a different country, we are going to name the rows as the value of the variable Country
. To that end, go to 'Data' > 'Active data set' > 'Set case names...'
and select the variable Country
and click 'OK'
. The dataset should look like Figure 2.2.
In UC3M computers, altering the location of a downloaded file may cause errors in its importation to R Commander
!
Example:

Default download path:
‘C:/Users/g15s4021/Downloads/pisa.csv’
. Importation from that path works fine.

If you move the file another location (e.g. to
‘C:/Users/g15s4021/Desktop/pisa.csv’
). Importation generates an error.
Fit a simple linear regression.
Go to 'Statistics' > 'Fit models' > 'Linear regression...'
. A window like Figure 2.3 will popup.
Select the response variable. This is the variable denoted by \(Y\) that we want to predict/explain. Then select the explanatory variable (also known as the predictor). It is denoted by \(X\) and is the variable used to predict/explain \(Y\). Recall the form of the linear model:
\[\begin{align*}
Y=\beta_0+\beta_1X+\varepsilon
\end{align*}\]
In our case \(Y=\)MathMean
and \(X=\)logGDPp
, so select them and click 'OK'
.
If you want to deselect an option in an R Commander
menu, use ‘Control’ + ‘Mouse click’
.
Four buttons are common in the menus of R Commander
:

‘OK’
: executes the selected action, then closes the window.

‘Apply’
: executes the selected action but leaves the window open. Useful if you are experimenting with different options.

‘Reset’
: resets the fields and boxes of the window to their defaults.

‘Cancel’
: exits the window without performing any action.
The window in Figure 2.3 generates this code and output:
pisaLinearModel < lm(MathMean ~ logGDPp, data = pisa)
summary(pisaLinearModel)
##
## Call:
## lm(formula = MathMean ~ logGDPp, data = pisa)
##
## Residuals:
## Min 1Q Median 3Q Max
## 138.924 29.109 1.381 20.239 176.166
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 185.16 61.36 3.018 0.00369 **
## logGDPp 28.79 6.13 4.696 1.51e05 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.48 on 62 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple Rsquared: 0.2624, Adjusted Rsquared: 0.2505
## Fstatistic: 22.06 on 1 and 62 DF, pvalue: 1.512e05
This is the linear model of MathMean
regressed on logGDPp
(first line) and its summary (second line). The summary gives the coefficients of the line and the \(R^2\) ('Multiple Rsquared'
), which – as we will see in Section 2.7 – it can be regarded as an indicator of the strength of the linear relation between the variables. (\(R^2=1\) is a perfect linear fit – all the points lay in a line – and \(R^2=0\) is the poorest fit.)
The fitted regression line is MathMean
\(= 185.16 + 28.79\,\times\) logGDPp
. The slope coefficient is positive, which indicates that there is a positive correlation between the wealth of a country and its performance in the PISA Mathematics test (this answers Q1). Hence, the evidence that wealthy countries tend to achieve larger average scores is indeed true (at least for the Mathematics test). We can be more precise on the effect of the wealth of a country. According to the fitted linear model, an increase of 1 unit in the logGDPp
of a country is associated with achieving, on average, 28.79 additional points in the test (Q2).
Visualize the fitted regression line.
Go to 'Graphs' > 'Scatterplot...'
. A window with two panels will popup (Figures 2.4 and 2.5).
On the 'Data'
panel, select the \(X\) and \(Y\) variables to be displayed in the scatterplot. On the 'Options'
panel, check the 'Leastsquares line'
box and choose to identify '3'
points 'Automatically'
. This will identify what are the three most different observations of the data.
The following R
code will be generated. It produces a scatterplot of MathMean
vs logGDPp
, with its scorresponding regression line.
scatterplot(MathMean ~ logGDPp, reg.line = lm, smooth = FALSE, spread = FALSE,
id.method = 'mahal', id.n = 3, boxplots = FALSE, span = 0.5,
ellipse = FALSE, levels = c(.5, .9),
main = "Average Math score vs. logGDPp", pch = c(16), data = pisa)
## ShanghaiChina Vietnam Qatar
## 1 16 62
There are three clear outliers: Vietnam, ShanghaiChina and Qatar. The first two are non highincome economies that perform exceptionally well in the test (although ShanghaiChina is a cherrypicked region of China). On the other hand, Qatar is a highincome economy that has really poor scores.
We can identify countries that are above and below the linear trend in the plot. This is particularly interesting: we can assess whether a country is performing better or worse with respect to its expected PISA score according to its economic status (this adds more insight into Q2). To do so, we want to display the text labels in the points of the scatterplot. We can take a shortcut: copy and run in the input panel the next piece of code. It is a slightly modified version of the previous code (what are the differences?).
scatterplot(MathMean ~ logGDPp, reg.line = lm, smooth = FALSE, spread = FALSE,
id.method = 'mahal', id.n = 65, id.cex = 0.75, boxplots = FALSE,
span = 0.5, ellipse = FALSE, levels = c(.5, .9),
main = "Average Math score vs. logGDPp", pch = c(16), cex = 0.75,
data = pisa)
If you understood the previous analysis, then you should be able to perform the next ones on your own.
Repeat the regression analysis (steps 2–3) for:

ReadingMean
regressed on logGDPp
. Are the results similar to MathMean
on logGDPp
?

MathMean
regressed on ReadingMean
. Compare it with MathMean
on ScienceMean
. Which pair of variables has the highest linear relation? Is that something expected?
Save the new models with different names to avoid overwriting the previous models!
Case study II: Apportionment in the EU and US
Apportionment is the process by which seats in a legislative body are distributed among administrative divisions entitled to representation.
— Wikipedia article on Apportionment (politics)
The European Parliament and the US House of Representatives are two of the most important macro legislative bodies in the world. The distribution of seats in both cameras is designed to represent the different states that conform the federation (US) or union (EU). Both chambers were created under very different historical and political circumstances, which is reflected in the kinds of apportionment that they present. More specifically:
In the US, the apportionment is neatly fixed by the US Constitution. Each of the 50 states is apportioned a number of seats that corresponds to its share of the total population of the 50 states, according to the most recent decennial census. Every state is guaranteed at least 1 seat. There are 435 seats.
Until now, the apportionment in the EU was set by treaties (Nice, Lisbon), in which negotiations between countries took place. The last accepted composition gives an allocation of seats based on the principle of “degressive proportionality” and somehow vague guidelines. It concludes with a commitment to establish a system to “allocate the seats between Member States in an objective, fair, durable and transparent way, translating the principle of degressive proportionality”. The Cambridge Compromise (Grimmett et al. 2011) was a proposal in that direction that was not effectively implemented. Currently, every state is guaranteed a minimum of 6 seats and a maximum of 96 for a grand total of 750 seats.
We know that there exist qualitative dissimilarities between both chambers, but we can not be more specific with the description at hand. The purpose of this case study is to quantify and visualize what are the differences between the apportionments of the two chambers, and how the simple linear regression can add insights on what is actually going on with the EU apportionment. The questions we want to answer are:
 Q1. Can we quantify which chamber is more proportional?
 Q2. What are the overrepresented and underrepresented states in both chambers?
 Q3. How can we quantify the ‘degressive proportionality’ in the EU approportionment system? Was the Cambridge Compromise proposing a fairer representation?
Let’s begin by reading the data:
The US_apportionment.xlsx
file (download) contains the 50 US states entitled to representation. The variables are State
, Population2010
(from the last census) and Seats2013–2023
. This is an Excel file that we can read using 'Data' > 'Import data' > 'from Excel file...'
. A window will popup, asking for the right options. We set them as in Figure 2.6, since we want the variable State
to be the case names. After clicking in 'View dataset'
, the data should look like Figure 2.7.
The EU_apportionment.txt
file (download) contains 28 rows with the member states of the EU (Country
), the number of seats assigned under different years (Seats2011
, Seats2014
), the Cambridge Compromise apportionment (CamCom2011
), and the countries population (Population2010
,Population2013
).
For this file, you should know how to:

Inspect the file in a text editor and determine its formatting.

Decide the right importation options and load it with the name
EU
.

Set the case names as the variable
Country
.
Table 2.2: The EU
dataset with Country
set as the case names.
Germany 
81802257 
99 
96 
80523746 
96 
France 
64714074 
74 
85 
65633194 
74 
United Kingdom 
62008048 
73 
81 
63896071 
73 
Italy 
60340328 
73 
79 
59685227 
73 
Spain 
45989016 
54 
62 
46704308 
54 
Poland 
38167329 
51 
52 
38533299 
51 
Romania 
21462186 
33 
32 
20020074 
32 
Netherlands 
16574989 
26 
26 
16779575 
26 
Greece 
11305118 
22 
19 
11161642 
21 
Belgium 
10839905 
22 
19 
11062508 
21 
Portugal 
10637713 
22 
18 
10516125 
21 
Czech Republic 
10506813 
22 
18 
10487289 
21 
Hungary 
10014324 
22 
18 
9908798 
21 
Sweden 
9340682 
20 
17 
9555893 
20 
Austria 
8375290 
19 
16 
8451860 
18 
Bulgaria 
7563710 
18 
15 
7284552 
17 
Denmark 
5534738 
13 
12 
5602628 
13 
Slovakia 
5424925 
13 
12 
5426674 
13 
Finland 
5351427 
13 
12 
5410836 
13 
Ireland 
4467854 
12 
11 
4591087 
11 
Crotia 
4425747 
NA 
NA 
4262140 
11 
Lithuania 
3329039 
12 
10 
2971905 
11 
Latvia 
2248374 
9 
8 
2058821 
8 
Slovenia 
2046976 
8 
8 
2023825 
8 
Estonia 
1340127 
6 
7 
1324814 
6 
Cyprus 
803147 
6 
6 
865878 
6 
Luxembourg 
502066 
6 
6 
537039 
6 
Malta 
412970 
6 
6 
421364 
6 
We start by analyzing the US
dataset. If there is indeed a direct proportionality in the apportionment, we would expect a direct, 1:1, relation between the ratios of seats and the population per state. Let’s start by constructing these variables:
 Switch the active dataset to
US
. An alternative way to do so is by 'Data' > 'Active data set' > 'Select active data set...'
.
 Go to
'Data' > 'Manage variables in active dataset...' > 'Compute new variable...'
.
Create the variable RatioSeats2013.2023
as shown in Figure 2.8. Be careful to not overwrite the variable Seats2013.2023
.
'View dataset'
to check that the new variable is available.
Repeat steps 1–3, conveniently adapted, to create the new variable RatioPopulation2010
.
Let’s fit a regression line to the US
data, with RatioSeats2013.2023
as the response and RatioPopulation2010
as the explanatory variable. If we name the model as appUS
, you should get the following code and output:
appUS < lm(RatioSeats2013.2023 ~ RatioPopulation2010, data = US)
summary(appUS)
##
## Call:
## lm(formula = RatioSeats2013.2023 ~ RatioPopulation2010, data = US)
##
## Residuals:
## Min 1Q Median 3Q Max
## 1.118e03 4.955e04 3.144e05 4.087e04 1.269e03
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.0001066 0.0001275 0.836 0.407
## RatioPopulation2010 1.0053307 0.0042872 234.498 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0006669 on 48 degrees of freedom
## Multiple Rsquared: 0.9991, Adjusted Rsquared: 0.9991
## Fstatistic: 5.499e+04 on 1 and 48 DF, pvalue: < 2.2e16
The fitted regression line is RatioSeats2013.2023
\(=0.000+1.005\,\times\)RatioPopulation2010
and has an \(R^2=0.9991\) ('Multiple Rsquared'
), which means that the data is almost perfectly linearly distributed. Furthermore, the intercept coefficient is not significant for the regression. This is seen in the column 'Pr(>t)'
, which gives the \(p\)values for the null hypotheses \(H_0:\beta_0=0\) and \(H_0:\beta_1=0\), respectively. The null hypothesis \(H_0:\beta_0=0\) is not rejected (\(p\text{value}=0.407\); nonsignificant) whereas \(H_0:\beta_1=0\) is rejected (\(p\text{value}=0\); significant). Hence, we can conclude that the appropriation of seats in the US House of Representatives is indeed directly proportional to the population of each state (partially answers Q1).
If we make the scatterplot for the US
dataset, we can see the almost perfect (up to integer rounding) 1:1 relation between the ratios “state seats”/“total seats” and “state population”/“aggregated population”. We can set the scatterplot to automatically label the '25'
most different points (select the numeric box with the mouse and type '25'
– the arrow buttons are limited to '10'
) with their case names. As it is seen in Figure 2.9, there is no state clearly over or underrepresented (Q2).
Let’s switch to the EU
dataset, for which we will focus on the 2011 variables. A quick way of visualizing this dataset and, in general, of visualizing multivariate data (up to a moderate number of dimensions) is to use a matrix scatterplot. Essentially, it displays the scatterplots between all the pairs of variables. To do it, go to 'Graphs' > 'Scatterplot matrix...'
and select the number of variables to be displayed. If you select them as in Figures 2.10 and 2.11, you should get an output like Figure 2.12.
The scatterplot matrix has a central panel displaying onevariable summary plots: histogram, density estimate, boxplot and QQplot. Experiment and understand them.
The most interesting panels in Figure 2.12 for our study are CamCom2011
vs Population2010
– panel (1,2) – and Seats2011
vs Population2010
– panel (3,2). At first sight, it seems that the Cambridge Compromise was favoring a fairer allocation of seats than what it was actually being used in the EU parliament in 2011 (recall the stepwise patterns in (3,2)). Let’s explore in depth the scatterplot Seats2011
vs Population2010
.
There are some countries clearly detrimented and benefited by this apportionment. For example, France and Spain are underrepresented and, on the other hand, Germany, Hungary and Czech Republic are overrepresented (Q2).
Let’s compute the regression line of Seats2011
on Population2010
, which we save in the model appEU2011
.
appEU2011 < lm(Seats2011 ~ Population2010, data = EU)
summary(appEU2011)
##
## Call:
## lm(formula = Seats2011 ~ Population2010, data = EU)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.7031 1.9511 0.0139 1.9799 3.2898
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 7.910e+00 5.661e01 13.97 2.58e13 ***
## Population2010 1.078e06 1.915e08 56.31 < 2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.289 on 25 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple Rsquared: 0.9922, Adjusted Rsquared: 0.9919
## Fstatistic: 3171 on 1 and 25 DF, pvalue: < 2.2e16
The fitted line is Seats2011
\(=7.91+1.078\,\times10^{6}\,\times\)Population2010
. The intercept is not zero and, indeed, the fitted intercept is significantly different from zero. Therefore, there is no proportionality in the apportionment. Recall that the fitted slope, despite being very small (why?), is also significantly different from zero. The \(R^2\) is slightly smaller than in the US
dataset, but definitely very high. Two conclusions stem from this analysis:
The US House of Representatives is a proportional chamber whereas the EU parliament is definitely not, but is close to perfect linearity (completes Q1).
The principle of digressive proportionality, in practice, means an almost linear allocation of seats with respect to population (Q3). The main point is the presence of a nonzero intercept – that is, a minimum number of seats corresponding to a country – in order to overrepresent smaller countries with respect to its corresponding proportional share.
The question that remains to be answered is whether the Cambridge Compromise was favoring a fairer allocation of seats than the 2011 official agreement. In Figure 2.12 we can see that indeed it seems like that, but there is an outlier outside the linear pattern: Germany. There is an explanation for that: the EU commission imposed a cap to the maximum number of seats per country, 96, to the development of the Cambridge Compromise. With this rule, Germany is notably underrepresented.
In order to avoid this distortion, we will exclude Germany from our comparison. To do so, we specify in the 'Subset expression'
field, of either 'Linear regression...'
or 'Scatterplot...'
, a '1'
. This tells R
to exclude the first row of EU
dataset, corresponding to Germany. Then, we compare the linear models for the official allocation, appEUNoGer2011
, and the Cambridge Compromise, appCamComNoGer2011
. The outputs are the following.
appEUNoGer2011 < lm(Seats2011 ~ Population2010, data = EU, subset = 1)
summary(appEUNoGer2011)
##
## Call:
## lm(formula = Seats2011 ~ Population2010, data = EU, subset = 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.5197 2.0722 0.2192 2.0179 3.2865
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 8.099e+00 5.638e01 14.37 2.78e13 ***
## Population2010 1.060e06 2.212e08 47.92 < 2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.227 on 24 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple Rsquared: 0.9897, Adjusted Rsquared: 0.9892
## Fstatistic: 2296 on 1 and 24 DF, pvalue: < 2.2e16
appCamComNoGer2011 < lm(CamCom2011 ~ Population2010, data = EU, subset = 1)
summary(appCamComNoGer2011)
##
## Call:
## lm(formula = CamCom2011 ~ Population2010, data = EU, subset = 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.47547 0.22598 0.01443 0.27471 0.46766
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 5.459e+00 7.051e02 77.42 <2e16 ***
## Population2010 1.224e06 2.766e09 442.41 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2784 on 24 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple Rsquared: 0.9999, Adjusted Rsquared: 0.9999
## Fstatistic: 1.957e+05 on 1 and 24 DF, pvalue: < 2.2e16
We see that the Cambridge Compromise has a larger \(R^2\) and a lower intercept than the official allocation of seats. This means that it favors a more proportional allocation, which is fairer in the sense that the deviations from the linear trend are smaller (Q3). We conclude the case study by illustrating both fits.
In 2014 it was negotiated a new EU apportionment, collected in Seats2014
, according to the population of 2013, Population2013
, and due to the inclusion of Croatia in the EU. Answer these questions:

Which countries were the most favored and unfavored by such apportionment?

Was the apportionment proportional?

Was the degree of linearity higher or lower than the 2011 apportionment? (Exclude Germany.)

Was the degree of linearity higher or lower than the Cambridge Compromise for 2011? (Exclude Germany.)
We have performed a decent number of operations in R Commander
. If we have to exit the session, we can save the data and models in an .RData
file, which contains all the objects we have computed so far (but not the code – this has to be saved differently).
To exit R Commander
, save all your progress and reload it later, do:

Save
.RData
file. Go to ‘File’ > ‘Save R workspace as…’
.

Save
.R
file. Go to ‘File’ > ‘Save script as…’
.

Exit
R Commander
+ R
. Go to ‘File’ > ‘Exit’ > ‘From Commander and R’
. Choose to not save any file.

Start R Commander
and load your files:

.RData
file in ‘Data’ > ‘Load data set…’
,

.R
file in ‘File’ > ‘Open script file…’
.
If you just want to save a dataset, you have two options:

‘Data’ > ‘Active data set’ > ‘Save active data set…’
: it will be saved as an .RData
file. The easiest way of importing it back to R
.

‘Data’ > ‘Active data set’ > ‘Export active data set…’
: it will be saved as a text file with the format that you choose. Useful for exporting data to other programs.
References
OECD. 2012b. PISA 2012 Results: What Students Know and Can Do (Volume I, Revised Edition, February 2014): Student Performance in Mathematics, Reading and Science. Paris: OECD Publishing. doi:10.1787/9789264208780en.
Computing simple linear regressions
Import the iris
dataset, either from iris.txt
or datasets.RData
. This dataset contains measurements for 150 iris flowers. The purpose of this exercise is to do the following analyses through R Commander
while inspecting and understanding the outputed code to identify what parts are changing, how and why.
 Fit the regression line for
Petal.Width
(response) on Petal.Length
(predictor) and summarize it.  Make the scatterplot of
Petal.Width
(y) vs Petal.Length
(x) with a regression line.  Set the
‘Graph title’
to “iris dataset: petal width vs petal length”, the ‘xaxis label’
to “petal length” and ‘yaxis label’
to “sepal length”.  Identify the 5 most outlier points
‘Automatically’
.  Redo the linear regression and scatterplot excluding the points labelled as outliers (exclude them in
‘Subset expression’
with a c(…)
).  Check that the summary for the fitted line and the scatterplot displayed are coherent.
 Make the matrix scatterplot for the four variables and including
‘Leastsquares lines’
.  Set the
‘On Diagonal’
plots to ‘Histograms’
and ‘Boxplots’
.  Set the
‘Graph title’
to “iris matrix scatterplot”.  Identify the 5 most outlier points
‘Automatically’
.  Modify the code to identify 15 points.
 Compute the regression line for the plot in the third row and the fourth column and create the scatterplot for it.
 Redo the scatterplot by selecting the option
‘Plot by groups…’
and then selecting ‘Species’
.
The last scatterplots are an illustration of the Simpson’s paradox. The paradox surges when there are two or more welldefined groups in the data, they all have positive (negative) correlation, but taken as a whole dataset, the correlation is the opposite.
R
basics
Answer briefly in your own words:
 What is the operator
<
doing? What does it mean, for example, that a < 1:5
?  What is the difference between a matrix and a data frame? Why is the latter useful?
 What are the differences between a vector and a matrix?
 What is
c
employed for?  Consider the expression
lm(a ~ b, data = myData)
. What is
lm
standing for?  What does it mean
a ~ b
? What are the roles of a
and b
?  Is
myData
a matrix or a data frame?  What must be the relation between
myData
and a
and b
?  Explain the differences with
lm(b ~ a, data = myData)
.
 What are the differences between running
a < 1; a
, a < 1
and 1
.  What are the differences between a list and a data frame? What are their common parts?
 Why is
$
employed? How can you know in which variables you can use $
?  If you have a vector
x
, what are x^2
and x + 1
doing to its elements?
Do the following:
 Create the vectors \(x = (1.17, 0.41, 0.34, 1.11, 1.02, 0.22, 0.24, 0.27, 0.40, 1.38)\) and \(y = (3.63, 1.69, 0.27, 5.83, 2.64, 1.33, 1.22 0.62, 1.29, 0.43)\).
 Set the positions 3, 4 and 8 of \(x\) to 0. Set the positions 1, 4, 9 of \(y\) to 0.5, 0.75 and 0.3, respectively.
 Create a new vector \(z\) containing \(\log(x^2)  y^3\sqrt{\exp(x)}\).
 Create the vector \(t=(1, 4, 9, 16, 25, \ldots, 100)\).
 Access all the elements of \(t\) except the third and fifth.
 Create the matrix \(A=\begin{pmatrix}1 & 3\\0 & 2\end{pmatrix}\). Hint: use
rbind
or cbind
.  Using \(A\), what is a short way (less amount of code) of computing \(B=\begin{pmatrix}1+\sqrt{2}\sin(3) & 3+\sqrt{2}\sin(3)\\0+\sqrt{2}\sin(3) & 2+\sqrt{2}\sin(3)\end{pmatrix}\)?
 Compute
A*B
. Check that it makes sense with the results of A[1, 1] * B[1, 1]
, A[1, 2] * B[1, 2]
, A[2, 1] * B[2, 1]
and A[2, 2] * B[2, 2]
. Why?  Create a data frame named
worldPopulation
such that: the first variable is called
Year
and contains the values c(1915, 1925, 1935, 1945, 1955, 1965, 1975, 1985, 1995, 2005, 2015)
.  the second variable is called
Population
and contains the values c(1798.0, 1952.3, 2197.3, 2366.6, 2758.3, 3322.5, 4061.4, 4852.5, 5735.1, 6519.6, 7349.5)
.
 Write
names(worldPopulation)
. Access to the two variables.  Create a new variable in
worldPopulation
called logPopulation
that contains log(Population)
.  Compute the standard deviation, mean and median of the variables in
worldPopulation
.  Regress
logPopulation
into Year
. Save the result as mod
.  Compute the summary of the model and save it as
sumMod
.  Do a
str
on A
, worldPopulation
, mod
and sumMod
.  Access the \(R^2\) and \(\hat\sigma\) in
sumMod
.  Check that \(R^2\) is the same as the squared correlation between predictor and response, and also the squared correlation between response and
mod$fitted.values
.
Assumptions of the linear model
The dataset moreAssumptions.RData
(download) contains the variables x1
, …, x9
and y1
, …, y9
. For each regression y1 ~ x1
, …, y9 ~ x9
describe whether the assumptions of the linear model are being satisfied or not. Justify your answer and state which assumption(s) you think are violated.
Nonlinear relations
Load moreAssumptions
. For the regressions y1 ~ x1
, y2 ~ x2
, y6 ~ x6
and y9 ~ x9
, identify which nonlinear transformation yields the largest \(R^2\). For that transformation, check wether the assumptions are verified.
Hints: use the transformations in Figure
2.26 for the three first regressions. For
y9 ~ x9
, try with
(5  x9)^2
,
abs(x9  5)
and
abs(x9  5)^3
.
Case study: Moore’s law
Moore’s law (Moore 1965) is an empirical law that states that the power of a computer doubles approximately every two years. More precisely:
Moore’s law is the observation that the number of transistors in a dense integrated circuit [e.g. a CPU] doubles approximately every two years.
— Wikipedia article on Moore’s law
Translated into a mathematical formula, Moore’s law is
\[\begin{align*}\text{transistors}\approx 2^{\text{years}/2}.\end{align*}\]Applying logarithms to both sides gives (why?)
\[\begin{align*}\log(\text{transistors})\approx \frac{\log(2)}{2}\text{years}.\end{align*}\]We can write the above formula more generally
\[\begin{align*}\log(\text{transistors})=\beta_0+\beta_1 \text{years}+\varepsilon,\end{align*}\]where
\(\varepsilon\) is a random error. This is a linear model!
The dataset cpus.txt
(download) contains the transistor counts for the CPUs appeared in the time range 1971–2015. For this data, do the following:
 Import conveniently the data and name it as
cpus
.  Show a scatterplot of
Transistor.count
vs Date.of.introduction
with a linear regression.  Are the assumptions verified in
Transistor.count ~ Date.of.introduction
? Which ones are which are more “problematic”?  Create a new variable, named
Log.Transistor.count
, containing the logarithm of Transistor.count
.  Show a scatterplot of
Log.Transistor.count
vs Date.of.introduction
with a linear regression.  Are the assumptions verified in
Log.Transistor.count ~ Date.of.introduction
? Which ones are which are more “problematic”?  Regress
Log.Transistor.count ~ Date.of.introduction
.  Summarize the fit. What are the estimates \(\hat\beta_0\) and \(\hat\beta_1\)? Is \(\hat\beta_1\) close to \(\frac{\log(2)}{2}\)?
 Compute the CI for β_{1} at α = 0.05. Is \(\frac{\log(2)}{2}\) inside it? What happens at levels α = 0.10, 0.01?
 We want to forecast the average lognumber of transistors for the CPUs to be released in 2017. Compute the adequate prediction and CI.
 A new CPU design is expected for 2017. What is the range of lognumber of transistors expected for it, at a 95% level of confidence?
 Compute the ANOVA table for
Log.Transistor.count ~ Date.of.introduction
. Is β_{1} significative?
The dataset gpus.txt
(download) contains the transistor counts for the GPUs appeared in the period 1997–2016. Repeat the previous analysis for this dataset.
Case study: Growth in a time of debt
In the aftermath of the 20072008 financial crisis, the paper Growth in a time of debt (Reinhart and Rogoff 2010), from Carmen M. Reinhart and Kenneth Rogoff (both at Harvard), provided an important economical support for proausterity policies. The paper claimed that for levels of external debt in excess of 90% of the GDP, the GDP growth of a country was dramatically different than for lower levels of external debt. Therefore, it concludes the existence of a magical threshold – 90% – for which the level of external debt must be kept below in order to have a growing economy. Figure 2.29, extracted from Reinhart and Rogoff (2010), illustrates the main finding.
Herndon, Ash, and Pollin (2013) replicated the analysis of Reinhart and Rogoff (2010) and found that “selective exclusion of available data, coding errors and inappropriate weighting of summary statistics lead to serious miscalculations that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies”. The authors concluded that “both mean and median GDP growth when public debt levels exceed 90% of GDP are not dramatically different from when the public debt/GDP ratios are lower”. As a consequence, Reinhart and Rogoff (2010) led to an unjustified support for the adoption of austerity policies for countries with various levels of public debt.
You can read the full story at BBC, The New York Times and The Economist. Also, the video in Figure 2.30 contains a quick summary of the story by Nobel Prize laureate Paul Krugman.
Herndon, Ash, and Pollin (2013) made the data of the study publicly available. You can download it here.
The dataset hap.txt
(download) contains data for 20 advanced economies in the time period 1946–2009, and is the data source for the papers aforementioned. The variable dRGDP
represents the real GDP growth (as a percentage) and debtgdp
represents the percentage of public debt with respect to the GDP.
 Import the data and save it as
hap
.  Set the case names of
hap
as Country.Year
.  Summarize
dRGDP
and debtgdp
. What are their minimum and maximum values?  What is the correlation between
dRGDP
and debtgdp
? What is the standard deviation of each variable?  Show the scatterplot of
dRGDP
vs debtgdp
with the regression line. Is this coherent with what was stated in the video at 1:30?  Do you see any gap on the data around 90%? Is there any substantial change for
dRGDP
around there?  Compute the linear regression of
dRGDP
on debtgdp
and summarize the fit.  What are the fitted coefficients? What are their standard errors? What is the R^{2}?
 Compute the ANOVA table. How many degrees of freedom are? What is the SSR? What is SSE? What is the pvalue for H_{0} : β_{1} = 0?
 Is SSR larger than SSE? Is this coherent with the resulting R^{2}?
 Are β_{0} and β_{1} significant for the regression at level α = 0.05? And at level α = 0.10, 0.01?
 Compute the CIs for the coefficients. Can we conclude that the effect of
debtgdp
on dRGDP
is positive at α = 0.05? And negative?  Predict the average growth for levels of debt of 60%, 70%, 80%, 90%, 100% and 110%. Compute the 95% CIs for all of them.
 Predict the growth for the previous levels of debt. Compute also the CI for them. Is there a marked difference on the CIs for debt levels below and above 90%?
 Which assumptions of the linear model you think are satisfied? Should we trust blindly the inferential results obtained assuming that the assumptions were satisfied?
Multiple linear regression
The multiple linear regression is an extension of the simple linear regression saw in Chapter 2. If the simple linear regression employed a single predictor \(X\) to explain the response \(Y\), the multiple linear regression employs multiple predictors \(X_1,\ldots,X_k\) for explaining a single response \(Y\): \[Y=\beta_0+\beta_1X_1+\beta_2X_2+\ldots+\beta_kX_k+\varepsilon\] To convince you why is useful, let’s begin by seeing what it can deliver in realcase scenarios!
Examples and applications
Case study I: The Bordeaux equation
Calculate the winter rain and the harvest rain (in millimeters). Add summer heat in the vineyard (in degrees centigrade). Subtract 12.145. And what do you have? A very, very passionate argument over wine.
— “Wine Equation Puts Some Noses Out of Joint”, The New York Times, 04/03/1990
This case study is motivated by the study of Princeton professor Orley Ashenfelter (Ashenfelter, Ashmore, and Lalonde 1995) on the quality of red Bordeaux vintages. The study became mainstream after disputes with the wine press, especially with Robert Parker, Jr., one of the most influential wine critic in America. See a short review of the story at the Financial Times (Google’s cache) and at the video in Figure 3.3.
Red Bordeaux wines have been produced in Bordeaux, one of most famous and prolific wine regions in the world, in a very similar way for hundreds of years. However, the quality of vintages is largely variable from one season to another due to a long list of random factors, such as the weather conditions. Because Bordeaux wines taste better when they are older (young wines are astringent, when the wines age they lose their astringency), there is an incentive to store the young wines until they are mature. Due to the important difference in taste, it is hard to determine the quality of the wine when it is so young just by tasting it, because it is going to change substantially when the aged wine is in the market. Therefore, being able to predict the quality of a vintage is a valuable information for investing resources, for determining a fair price for vintages and for understanding what factors are affecting the wine quality. The purpose of this case study is to answer:
 Q1. Can we predict the quality of a vintage effectively?
 Q2. What is the interpretation of such prediction?
The wine.csv
file (download) contains 27 red Bordeaux vintages. The data is the originally employed by Ashenfelter, Ashmore, and Lalonde (1995), except for the inclusion of the variable Year
, the exclusion of NAs and the reference price used for the wine. The original source is here. Each row has the following variables:
Year
: year in which grapes were harvested to make wine.Price
: logarithm of the average market price for Bordeaux vintages according to 1990–1991 auctions. This is a nonlinear transformation of the response (hence different to what we did in Section 2.8) made to linearize the response.WinterRain
: winter rainfall (in mm).AGST
: Average Growing Season Temperature (in Celsius degrees).HarvestRain
: harvest rainfall (in mm).Age
: age of the wine measured as the number of years stored in a cask.FrancePop
: population of France at Year
(in thousands).
The quality of the wine is quantified as the Price
, a clever way of quantifying a qualitative measure. The data is shown in Table 3.1.
Table 3.1: First 15 rows of the wine
dataset.1952  7.4950  600  17.1167  160  31  43183.57 
1953  8.0393  690  16.7333  80  30  43495.03 
1955  7.6858  502  17.1500  130  28  44217.86 
1957  6.9845  420  16.1333  110  26  45152.25 
1958  6.7772  582  16.4167  187  25  45653.81 
1959  8.0757  485  17.4833  187  24  46128.64 
1960  6.5188  763  16.4167  290  23  46584.00 
1961  8.4937  830  17.3333  38  22  47128.00 
1962  7.3880  697  16.3000  52  21  48088.67 
1963  6.7127  608  15.7167  155  20  48798.99 
1964  7.3094  402  17.2667  96  19  49356.94 
1965  6.2518  602  15.3667  267  18  49801.82 
1966  7.7443  819  16.5333  86  17  50254.97 
1967  6.8398  714  16.2333  118  16  50650.41 
1968  6.2435  610  16.2000  292  15  51034.41 
Let’s begin by summarizing the information in Table 3.1. First, import correctly the dataset into R Commander
and 'Set case names...'
as the variable Year
. Let’s summarize and inspect the data in two ways:
Numerically. Go to 'Statistics' > 'Summaries' > 'Active data set'
.
summary(wine)## Price WinterRain AGST HarvestRain ## Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0 ## 1st Qu.:6.508 1st Qu.:543.5 1st Qu.:16.15 1st Qu.: 88.0 ## Median :6.984 Median :600.0 Median :16.42 Median :123.0 ## Mean :7.042 Mean :608.4 Mean :16.48 Mean :144.8 ## 3rd Qu.:7.441 3rd Qu.:705.5 3rd Qu.:17.01 3rd Qu.:185.5 ## Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0 ## Age FrancePop ## Min. : 3.00 Min. :43184 ## 1st Qu.: 9.50 1st Qu.:46856 ## Median :16.00 Median :50650 ## Mean :16.19 Mean :50085 ## 3rd Qu.:22.50 3rd Qu.:53511 ## Max. :31.00 Max. :55110
Additionally, other summary statistics are available in 'Statistics' > 'Summaries' > 'Numerical summaries...'
.
Graphically. Make a scatterplot matrix with all the variables. Add the 'Leastsquares lines'
, 'Histograms'
on the diagonals and choose to identify 2 points.
scatterplotMatrix(~ Age + AGST + FrancePop + HarvestRain + Price + WinterRain, reg.line = lm, smooth = FALSE, spread = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), id.n = 2, diagonal = 'histogram', data = wine)
Recall that the objective is to predict Price
. Based on the above matrix scatterplot the best we can predict Price
by a simple linear regression seems to be with AGST
or HarvestRain
. Let’s see which one yields the larger \(R^2\).
modAGST < lm(Price ~ AGST, data = wine)summary(modAGST)## ## Call:## lm(formula = Price ~ AGST, data = wine)## ## Residuals:## Min 1Q Median 3Q Max ## 0.78370 0.23827 0.03421 0.29973 0.90198 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 3.5469 2.3641 1.500 0.146052 ## AGST 0.6426 0.1434 4.483 0.000143 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4819 on 25 degrees of freedom## Multiple Rsquared: 0.4456, Adjusted Rsquared: 0.4234 ## Fstatistic: 20.09 on 1 and 25 DF, pvalue: 0.0001425modHarvestRain < lm(Price ~ HarvestRain, data = wine)summary(modHarvestRain)## ## Call:## lm(formula = Price ~ HarvestRain, data = wine)## ## Residuals:## Min 1Q Median 3Q Max ## 1.03792 0.27679 0.07892 0.40434 1.21958 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 7.679856 0.241911 31.747 < 2e16 ***## HarvestRain 0.004405 0.001497 2.942 0.00693 ** ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.5577 on 25 degrees of freedom## Multiple Rsquared: 0.2572, Adjusted Rsquared: 0.2275 ## Fstatistic: 8.658 on 1 and 25 DF, pvalue: 0.00693
In Price ~ AGST
, the intercept is not significant for the regression but the slope is, and AGST
has a positive effect on the Price
. For Price ~ HarvestRain
, both intercept and slope are significant and the effect is negative.
Complete the analysis by computing the linear models Price ~ FrancePop
, Price ~ Age
and Price ~ WinterRain
. Name them as modFrancePop
, modAge
and modWinterRain
. Check if the intercepts and slopes are significant for the regression.
If we do the simple regressions of Price
on the remaining predictors, we obtain a table like this for the \(R^2\):
AGST  \(0.4456\) 
HarvestRain  \(0.2572\) 
FrancePop  \(0.2314\) 
Age  \(0.2120\) 
WinterRain  \(0.0181\) 
A natural question to ask is:
Can we combine these simple regressions to increase both the \(R^2\) and the prediction accuracy for Price
?
The answer is yes, by means of the multiple linear regression. In order to make our first one, go to 'Statistics' > 'Fit models' > 'Linear model...'
. A window like Figure 3.2 will popup.
Set the response as
Price
and add the rest of variables as predictors, in the form
Age + AGST + FrancePop + HarvestRain + WinterRain
. Note the
use of +
for including all the predictors. This does
not mean that they are all summed and then the regression is done on the sum!. Instead of, this notation is designed to
resemble the multiple linear model:
\[\begin{align*}Y=\beta_0+\beta_1X_1+\beta_2X_2+\ldots+\beta_kX_k+\varepsilon\end{align*}\]If the model is named modWine1
, we get the following summary when clicking in 'OK'
:
modWine1 < lm(Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain, data = wine)summary(modWine1)## ## Call:## lm(formula = Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain, ## data = wine)## ## Residuals:## Min 1Q Median 3Q Max ## 0.46541 0.24133 0.00413 0.18974 0.52495 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 2.343e+00 7.697e+00 0.304 0.76384 ## Age 1.377e02 5.821e02 0.237 0.81531 ## AGST 6.144e01 9.799e02 6.270 3.22e06 ***## FrancePop 2.213e05 1.268e04 0.175 0.86313 ## HarvestRain 3.837e03 8.366e04 4.587 0.00016 ***## WinterRain 1.153e03 4.991e04 2.311 0.03109 * ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.293 on 21 degrees of freedom## Multiple Rsquared: 0.8278, Adjusted Rsquared: 0.7868 ## Fstatistic: 20.19 on 5 and 21 DF, pvalue: 2.232e07
The main difference with simple linear regressions is that we have more rows on the 'Coefficients'
section, since these correspond to each of the predictors. The fitted regression is Price
\(= 2.343 + 0.013\,\times\) Age
\(+ 0.614\,\times\) AGST
\( 0.000\,\times\) FrancePop
\( 0.003\,\times\) HarvestRain
\(+ 0.001\,\times\) WinterRain
. Recall that the 'Multiple Rsquared'
has almost doubled with respect to the best simple linear regression! This tells us that we can explain up to \(82.75\%\) of the Price
variability by the predictors.
Note however that many predictors are not significant for the regression: FrancePop
, Age
and the intercept are not significant. This is an indication of an excess of predictors adding little information to the response. Note the almost perfect correlation between FrancePop
and Age
shown in Figure 3.1: one of them is not adding any extra information to explain Price
. This complicates the model unnecessarily and, more importantly, it has the undesirable effect of making the coefficient estimates less precise. We opt to remove the predictor FrancePop
from the model since it is exogenous to the wine context.
Two useful tips about lm
’s syntax for including/excluding predictors faster:
Price ~ .
> includes all the variables in the dataset as predictors. It is equivalent to Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain
.Price ~ .  FrancePop
> includes all the variables except the ones with 
as predictors. It is equivalent to It is equivalent to Price ~ Age + AGST + HarvestRain + WinterRain
.
Then, the model without FrancePop
is
modWine2 < lm(Price ~ .  FrancePop, data = wine)summary(modWine2)## ## Call:## lm(formula = Price ~ .  FrancePop, data = wine)## ## Residuals:## Min 1Q Median 3Q Max ## 0.46024 0.23862 0.01347 0.18601 0.53443 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 3.6515703 1.6880876 2.163 0.04167 * ## WinterRain 0.0011667 0.0004820 2.420 0.02421 * ## AGST 0.6163916 0.0951747 6.476 1.63e06 ***## HarvestRain 0.0038606 0.0008075 4.781 8.97e05 ***## Age 0.0238480 0.0071667 3.328 0.00305 ** ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.2865 on 22 degrees of freedom## Multiple Rsquared: 0.8275, Adjusted Rsquared: 0.7962 ## Fstatistic: 26.39 on 4 and 22 DF, pvalue: 4.057e08
All the coefficients are significant at level \(\alpha=0.05\). Therefore, there is no clear redundant information. In addition, the \(R^2\) is very similar to the full model, but the 'Adjusted Rsquared'
, a weighting of the \(R^2\) to account for the number of predictors used by the model, is slightly larger. Hence, this means that, comparatively to the number of predictors used, modWine2
explains more variability of Price
than modWine1
. Later in this chapter we will see the precise meaning of the \(R^2\) adjusted.
The comparison of the coefficients of both models can be done with 'Models > Compare model coefficients...'
:
compareCoefs(modWine1, modWine2)## ## Call:## 1: lm(formula = Price ~ Age + AGST + FrancePop + HarvestRain + ## WinterRain, data = wine)## 2: lm(formula = Price ~ .  FrancePop, data = wine)## Est. 1 SE 1 Est. 2 SE 2## (Intercept) 2.34e+00 7.70e+00 3.65e+00 1.69e+00## Age 1.38e02 5.82e02 2.38e02 7.17e03## AGST 6.14e01 9.80e02 6.16e01 9.52e02## FrancePop 2.21e05 1.27e04 ## HarvestRain 3.84e03 8.37e04 3.86e03 8.07e04## WinterRain 1.15e03 4.99e04 1.17e03 4.82e04
Note how the coefficients for modWine2
have smaller errors than modWine1
.
As a conclusion, modWine2
is a model that explains the \(82.75\%\) of the variability in a nonredundant way and with all their coefficients significant. Therefore, we have a formula for effectively explaining and predicting the quality of a vintage (answers Q1).
The interpretation of modWine2
agrees with wellknown facts in viticulture that make perfect sense (Q2):
 Higher temperatures are associated with better quality (higher priced) wine.
 Rain before the growing season is good for the wine quality, but during harvest is bad.
 The quality of the wine improves with the age.
Although these were known facts, keep in mind that the model allows to quantify the effect of each variable on the wine quality and provides us with a precise way of predicting the quality of future vintages.
Create a new variable in wine
named PriceOrley
, defined as Price  8.4937
. Check that the model PriceOrley ~ .  FrancePop  Price
kind of coincides with the formula given in the second paragraph of the Financial Times article (Google’s cache). (There are a couple of typos in the article’s formula: the Age
term is missing and the ACGS
coefficient has an extra zero. Emailed the author, his answer: “Thanks for the heads up on this. Ian Ayres.”.)
Case study II: Housing values in Boston
The second case study is motivated by Harrison and Rubinfeld (1978), who proposed an hedonic model for determining the willingness of house buyers to pay for clean air. An hedonic model is a model that decomposes the price of an item into separate components that determine its price. For example, an hedonic model for the price of a house may decompose its price into the house characteristics, the kind of neighborhood, and the location. The study of Harrison and Rubinfeld (1978) employed data from the Boston metropolitan area, containing 560 suburbs and 14 variables. The Boston
dataset is available through the file Boston.xlsx
file (download) and through the dataset Boston
in the MASS
package (load MASS
by 'Tools' > 'Load package(s)...'
).
The description of the related variables can be found in ?Boston
and Harrison and Rubinfeld (1978), but we summarize here the most important ones as they appear in Boston
. They are aggregated into five topics:
 Dependent variable:
medv
, the median value of owneroccupied homes (in thousands of dollars).  Structural variables indicating the house characteristics:
rm
(average number of rooms “in owner units”) and age
(proportion of owneroccupied units built prior to 1940).  Neighborhood variables:
crim
(crime rate), zn
(proportion of residential areas), indus
(proportion of nonretail business area), chas
(river limitation), tax
(cost of public services in each community), ptratio
(pupilteacher ratio), black
(variable \(1000(B  0.63)^2\), where \(B\) is the black proportion of population – low and high values of \(B\) increase housing prices) and lstat
(percent of lower status of the population).  Accesibility variables:
dis
(distances to five Boston employment centers) and rad
(accessibility to radial highways – larger index denotes better accessibility).  Air pollution variable:
nox
, the annual concentration of nitrogen oxide (in parts per ten million).
A summary of the data is shown below:
summary(Boston)## crim zn indus chas ## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 ## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 ## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 ## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 ## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 ## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 ## nox rm age dis ## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 ## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 ## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 ## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 ## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 ## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 ## rad tax ptratio black ## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 ## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 ## Median : 5.000 Median :330.0 Median :19.05 Median :391.44 ## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 ## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 ## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 ## lstat medv ## Min. : 1.73 Min. : 5.00 ## 1st Qu.: 6.95 1st Qu.:17.02 ## Median :11.36 Median :21.20 ## Mean :12.65 Mean :22.53 ## 3rd Qu.:16.95 3rd Qu.:25.00 ## Max. :37.97 Max. :50.00
The two goals of this case study are:
 Q1. Quantify the influence of the predictor variables in the housing prices.
 Q2. Obtain the “best possible” model for decomposing the housing variables and interpret it.
We begin by making an exploratory analysis of the data with a matrix scatterplot. Since the number of variables is high, we opt to plot only five variables: crim
, dis
, medv
, nox
and rm
. Each of them represents the five topics in which variables were classified.
scatterplotMatrix(~ crim + dis + medv + nox + rm, reg.line = lm, smooth = FALSE, spread = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), id.n = 0, diagonal = 'density', data = Boston)
The diagonal panels are showing an estimate of the unknown density of each variable. Note the peculiar distribution of crim
, very concentrated at zero, and the asymmetry in medv
, with a second mode associated to the most expensive properties. Inspecting the individual panels, it is clear that some nonlinearity exists in the data. For simplicity, we disregard that analysis for the moment (but see the final exercise).
Let’s fit a multiple linear regression for explaining medv
. There are a good number of variables now, and some of them might be of little use for predicting medv
. However, there is no clear intuition of which predictors will yield better explanations of medv
with the information at hand. Therefore, we can start by doing a linear model on all the predictors:
modHouse < lm(medv ~ ., data = Boston)summary(modHouse)## ## Call:## lm(formula = medv ~ ., data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## 15.595 2.730 0.518 1.777 26.199 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e12 ***## crim 1.080e01 3.286e02 3.287 0.001087 ** ## zn 4.642e02 1.373e02 3.382 0.000778 ***## indus 2.056e02 6.150e02 0.334 0.738288 ## chas 2.687e+00 8.616e01 3.118 0.001925 ** ## nox 1.777e+01 3.820e+00 4.651 4.25e06 ***## rm 3.810e+00 4.179e01 9.116 < 2e16 ***## age 6.922e04 1.321e02 0.052 0.958229 ## dis 1.476e+00 1.995e01 7.398 6.01e13 ***## rad 3.060e01 6.635e02 4.613 5.07e06 ***## tax 1.233e02 3.760e03 3.280 0.001112 ** ## ptratio 9.527e01 1.308e01 7.283 1.31e12 ***## black 9.312e03 2.686e03 3.467 0.000573 ***## lstat 5.248e01 5.072e02 10.347 < 2e16 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 4.745 on 492 degrees of freedom## Multiple Rsquared: 0.7406, Adjusted Rsquared: 0.7338 ## Fstatistic: 108.1 on 13 and 492 DF, pvalue: < 2.2e16
There are a couple of nonsignificant variables, but so far the model has an \(R^2=0.74\) and the fitted coefficients are sensible with what it would be expected. For example, crim
, tax
, ptratio
and nox
have negative effects on medv
, while rm
, rad
and chas
have positive. However, the nonsignificant coefficients are not improving significantly the model, but only adding artificial noise and decreasing the overall accuracy of the coefficient estimates!
Let’s polish a little bit the previous model. Instead of removing manually each nonsignificant variable to reduce the complexity, we employ an automatic tool in R
called stepwise model selection. It has different flavors, that we will see in detail in Section 3.7, but essentially this powerful tool usually ends up selecting “a” best model: a model that delivers the maximum fit with the minimum number of variables.
The stepwise model selection is located at 'Models' > 'Stepwise model selection...'
and is always applied on the active model. Apply it with the default options to modBest
:
modBest < stepwise(modHouse, direction = 'backward/forward', criterion = 'BIC')## ## Direction: backward/forward## Criterion: BIC ## ## Start: AIC=1648.81## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + ## tax + ptratio + black + lstat## ## Df Sum of Sq RSS AIC##  age 1 0.06 11079 1642.6##  indus 1 2.52 11081 1642.7## <none> 11079 1648.8##  chas 1 218.97 11298 1652.5##  tax 1 242.26 11321 1653.5##  crim 1 243.22 11322 1653.6##  zn 1 257.49 11336 1654.2##  black 1 270.63 11349 1654.8##  rad 1 479.15 11558 1664.0##  nox 1 487.16 11566 1664.4##  ptratio 1 1194.23 12273 1694.4##  dis 1 1232.41 12311 1696.0##  rm 1 1871.32 12950 1721.6##  lstat 1 2410.84 13490 1742.2## ## Step: AIC=1642.59## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + ## ptratio + black + lstat## ## Df Sum of Sq RSS AIC##  indus 1 2.52 11081 1636.5## <none> 11079 1642.6##  chas 1 219.91 11299 1646.3##  tax 1 242.24 11321 1647.3##  crim 1 243.20 11322 1647.3##  zn 1 260.32 11339 1648.1##  black 1 272.26 11351 1648.7## + age 1 0.06 11079 1648.8##  rad 1 481.09 11560 1657.9##  nox 1 520.87 11600 1659.6##  ptratio 1 1200.23 12279 1688.4##  dis 1 1352.26 12431 1694.6##  rm 1 1959.55 13038 1718.8##  lstat 1 2718.88 13798 1747.4## ## Step: AIC=1636.48## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + ## black + lstat## ## Df Sum of Sq RSS AIC## <none> 11081 1636.5##  chas 1 227.21 11309 1640.5##  crim 1 245.37 11327 1641.3##  zn 1 257.82 11339 1641.9##  black 1 270.82 11352 1642.5## + indus 1 2.52 11079 1642.6##  tax 1 273.62 11355 1642.6## + age 1 0.06 11081 1642.7##  rad 1 500.92 11582 1652.6##  nox 1 541.91 11623 1654.4##  ptratio 1 1206.45 12288 1682.5##  dis 1 1448.94 12530 1692.4##  rm 1 1963.66 13045 1712.8##  lstat 1 2723.48 13805 1741.5
Note the different steps: it starts with the full model and, when +
is shown, it means that the variable is excluded at that step. The procedure seeks to minimize an information criterion (BIC or AIC). An information criterion balances the fitness of a model with the number of predictors employed. Hence, it determines objectively the best model: the one that minimizes the information criterion. Remember to save the output to a variable if you want to have the final model (you need to do this in R
)!
The summary of the final model is:
summary(modBest)## ## Call:## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + ## tax + ptratio + black + lstat, data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## 15.5984 2.7386 0.5046 1.7273 26.2373 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 36.341145 5.067492 7.171 2.73e12 ***## crim 0.108413 0.032779 3.307 0.001010 ** ## zn 0.045845 0.013523 3.390 0.000754 ***## chas 2.718716 0.854240 3.183 0.001551 ** ## nox 17.376023 3.535243 4.915 1.21e06 ***## rm 3.801579 0.406316 9.356 < 2e16 ***## dis 1.492711 0.185731 8.037 6.84e15 ***## rad 0.299608 0.063402 4.726 3.00e06 ***## tax 0.011778 0.003372 3.493 0.000521 ***## ptratio 0.946525 0.129066 7.334 9.24e13 ***## black 0.009291 0.002674 3.475 0.000557 ***## lstat 0.522553 0.047424 11.019 < 2e16 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 4.736 on 494 degrees of freedom## Multiple Rsquared: 0.7406, Adjusted Rsquared: 0.7348 ## Fstatistic: 128.2 on 11 and 494 DF, pvalue: < 2.2e16
Let’s compute the confidence intervals at level \(\alpha=0.05\):
confint(modBest)## 2.5 % 97.5 %## (Intercept) 26.384649126 46.29764088## crim 0.172817670 0.04400902## zn 0.019275889 0.07241397## chas 1.040324913 4.39710769## nox 24.321990312 10.43005655## rm 3.003258393 4.59989929## dis 1.857631161 1.12779176## rad 0.175037411 0.42417950## tax 0.018403857 0.00515209## ptratio 1.200109823 0.69293932## black 0.004037216 0.01454447## lstat 0.615731781 0.42937513
We have quantified the influence of the predictor variables in the housing prices (Q1) and we can conclude that, in the final model and with confidence level \(\alpha=0.05\):
chas
, age
, rad
and black
have a significantly positive influence on medv
.nox
, dis
, tax
, pratio
and lstat
have a significantly negative influence on medv
.
The model employed in Harrison and Rubinfeld (1978) is different from modBest
. In the paper, several nonlinear transformations of the predictors (remember Section 2.8) and the response are done to improve the linear fit. Also, different units are used for medv
, black
, lstat
and nox
. The authors considered these variables:
 Response:
log(1000 * medv)
 Linear predictors:
age
, black / 1000
(this variable corresponds to their \((B0.63)^2\)), tax
, ptratio
, crim
, zn
, indus
and chas
.  Nonlinear predictors:
rm^2
, log(dis)
, log(rad)
, log(lstat / 100)
and (10 * nox)^2
.
Do the following:
Check if the model with such predictors corresponds to the one in the first column, Table VII, page 100 of Harrison and Rubinfeld (1978) (openaccess paper available here). To do so, Save this model as modelHarrison
and summarize it. Hint: the formula should be something like I(log(1000 * medv)) ~ age + I(black / 1000) + ... + I(log(lstat / 100)) + I((10 * nox)^2)
.
Make a stepwise
selection of the variables in modelHarrison
(use defaults) and save it as modelHarrisonSel
. Summarize it.
 Which model has a larger \(R^2\)? And adjusted \(R^2\)? Which is simpler and has more significant coefficients?
Model formulation and estimation by least squares
The multiple linear model extends the simple linear model by describing the relation between the random variables
\(X_1,\ldots,X_k\) and
\(Y\). For example, in the last model for the
wine
dataset, we had
\(k=4\) variables
\(X_1=\)WinterRain
,
\(X_2=\)AGST
,
\(X_3=\)HarvestRain
and
\(X_4=\)Age
, and
\(Y=\) Price
. Therefore, as in Section
2.3, the multiple linear model is
constructed by assuming that the linear relation
\[\begin{align}Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k + \varepsilon \tag{3.1}\end{align}\]holds between the predictors
\(X_1,\ldots,X_k\) and the response
\(Y\). In
(3.1),
\(\beta_0\) is the
intercept and
\(\beta_1,\ldots,\beta_k\) are the
slopes, respectively.
\(\varepsilon\) is a random variable with mean zero and independent from
\(X_1,\ldots,X_n\). Another way of looking at
(3.1) is
\[\begin{align}\mathbb{E}[YX_1=x_1,\ldots,X_k=x_k]=\beta_0+\beta_1x_1+\ldots+\beta_kx_k, \tag{3.2}\end{align}\]since \(\mathbb{E}[\varepsilonX_1=x_1,\ldots,X_k=x_k]=0\).
The LHS of (3.2) is the conditional expectation of \(Y\) given \(X_1,\ldots,X_k\). It represents how the mean of the random variable \(Y\) is changing according to particular values, denoted by \(x_1,\ldots,x_k\), of the random variables \(X_1,\ldots,X_k\). With the RHS, what we are saying is that the mean of \(Y\) is changing in a linear fashion with respect to the value of \(X\). Hence the interpretation of the coefficients:
 \(\beta_0\): is the mean of \(Y\) when \(X_1=\ldots=X_k=0\).
 \(\beta_j\), \(1\leq j\leq k\): is the increment in mean of \(Y\) for an increment of one unit in \(X_j=x_j\), provided that the remaining variables \(X_1,\ldots,X_{j1},X_{j+1},\ldots,X_k\) do not change.
Figure 3.5 illustrates the geometrical interpretation of a multiple linear model: a plane in the \((k+1)\)dimensional space. If \(k=1\), the plane is the regression line for simple linear regression. If \(k=2\), then the plane can be visualized in a threedimensional plot
The estimation of \(\beta_0,\beta_1,\ldots,\beta_k\) is done as in simple linear regression, by minimizing the Residual Sum of Squares (RSS). First we need to introduce some helpful matrix notation. In the following, bold face are used for distinguishing vectors and matrices from scalars:
A sample of \((X_1,\ldots,X_k,Y)\) is \((X_{11},\ldots,X_{1k},Y_1),\ldots,(X_{n1},\ldots,X_{nk},Y_n)\), where \(X_{ij}\) denotes the \(i\)th observation of the \(j\)th predictor \(X_j\). We denote with \(\mathbf{X}_i=(X_{i1},\ldots,X_{ik})\) to the \(i\)th observation of \((X_1,\ldots,X_k)\), so the sample simplifies to \((\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)\)
The design matrix contains all the information of the predictors and a column of ones \[\mathbf{X}=\begin{pmatrix}1 & X_{11} & \cdots & X_{1k}\\\vdots & \vdots & \ddots & \vdots\\1 & X_{n1} & \cdots & X_{nk}\end{pmatrix}_{n\times(k+1)}\]
 The vector of responses \(\mathbf{Y}\), the vector of coefficients \(\boldsymbol\beta\) and the vector of errors are, respectively, \[\mathbf{Y}=\begin{pmatrix}Y_1 \\\vdots \\Y_n\end{pmatrix}_{n\times 1},\quad\boldsymbol\beta=\begin{pmatrix}\beta_0 \\\beta_1 \\\vdots \\\beta_k\end{pmatrix}_{(k+1)\times 1}\text{ and }\boldsymbol\varepsilon=\begin{pmatrix}\varepsilon_1 \\\vdots \\\varepsilon_n\end{pmatrix}_{n\times 1}.\] Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely\[\begin{align*}Y_i&=\beta_0 + \beta_1 X_{i1} + \ldots +\beta_k X_{ik} + \varepsilon_i,\quad i=1,\ldots,n\end{align*}\]into something as compact as\[\begin{align*}\mathbf{Y}=\mathbf{X}\boldsymbol\beta+\boldsymbol\varepsilon.\end{align*}\]
Recall that if \(k=1\) we have the simple linear model. In this case: \[\mathbf{X}=\begin{pmatrix}1 & X_{11}\\\vdots & \vdots\\1 & X_{n1}\end{pmatrix}_{n\times2}\text{ and } \beta=\begin{pmatrix}\beta_0 \\\beta_1\end{pmatrix}_{2\times 1}\]
The RSS for the multiple linear regression isThe RSS aggregates the
squared vertical distances from the data to a regression plane given by
\(\boldsymbol\beta\). Remember that the
vertical distances are considered because we want to minimize the error in the
prediction of
\(Y\). The least squares estimators are
the minimizers of the RSS:
\[\begin{align*}\hat{\boldsymbol{\beta}}=\arg\min_{\boldsymbol{\beta}\in\mathbb{R}^{k+1}} \text{RSS}(\boldsymbol{\beta}).\end{align*}\]Luckily, thanks to the matrix form of
(3.3), it is simple to compute a closedform expression for the least squares estimates:
\[\begin{align}\hat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{1}\mathbf{X}^T\mathbf{Y},\tag{3.4}\end{align}\]There are some similarities between
(3.4) and
\(\hat\beta_1=(s_x^2)^{1}s_{xy}\) from the simple linear model: both are related to the covariance between
\(\mathbf{X}\) and
\(Y\) weighted by the variance of
\(\mathbf{X}\).
The data of the illustration has been generated with the following code:
# Generates 50 points from a N(0, 1): predictors and errorset.seed(34567) # Fixes the seed for the random generatorx1 < rnorm(50)x2 < rnorm(50)x3 < x1 + rnorm(50, sd = 0.05) # Make variables dependenteps < rnorm(50)# ResponsesyLin < 0.5 + 0.5 * x1 + 0.5 * x2 + epsyQua < 0.5 + x1^2 + 0.5 * x2 + epsyExp < 0.5 + 0.5 * exp(x2) + x3 + eps# DataleastSquares3D < data.frame(x1 = x1, x2 = x2, yLin = yLin, yQua = yQua, yExp = yExp)
Let’s check that indeed the coefficients given by lm
are the ones given by equation (3.4) for the regression yLin ~ x1 + x2
.
# Matrix XX < cbind(1, x1, x2)# Vector YY < yLin# Coefficientsbeta < solve(t(X) %*% X) %*% t(X) %*% Y# %*% multiplies matrices# solve() computes the inverse of a matrix# t() transposes a matrixbeta## [,1]## 0.5702694## x1 0.4832624## x2 0.3214894# Output from lmmod < lm(yLin ~ x1 + x2, data = leastSquares3D)mod$coefficients## (Intercept) x1 x2 ## 0.5702694 0.4832624 0.3214894
Compute \(\boldsymbol{\beta}\) for the regressions yLin ~ x1 + x2
, yQua ~ x1 + x2
and yExp ~ x2 + x3
using:
 equation (3.4) and
 the function
lm
.
Check that the fitted plane and the coefficient estimates are coherent.
Once we have the least squares estimates \(\hat{\boldsymbol{\beta}}\), we can define the next two concepts:
 The fitted values \(\hat Y_1,\ldots,\hat Y_n\), where\[\begin{align*}\hat Y_i=\hat\beta_0+\hat\beta_1X_{i1}+\cdots+\hat\beta_kX_{ik},\quad i=1,\ldots,n.\end{align*}\]
They are the vertical projections of \(Y_1,\ldots,Y_n\) into the fitted line (see Figure 3.5). In a matrix form, inputting (3.3) \[\hat{\mathbf{Y}}=\mathbf{X}\hat{\boldsymbol{\beta}}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{1}\mathbf{X}^T\mathbf{Y}=\mathbf{H}\mathbf{Y},\] where \(\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{1}\mathbf{X}^T\) is called the hat matrix because it “puts the hat into \(\mathbf{Y}\)”. What it does is to project \(\mathbf{Y}\) into the regression plane (see Figure 3.5).
 The estimated residuals \(\hat \varepsilon_1,\ldots,\hat \varepsilon_n\), where\[\begin{align*}\hat\varepsilon_i=Y_i\hat Y_i,\quad i=1,\ldots,n.\end{align*}\]
They are the vertical distances between actual data and fitted data.
We conclude with an insight on the relation of multiple and simple linear regressions. It is illustrated in Figure 3.6.
Consider the multiple linear model Y = β_{0} + β_{1}X_{1} + β_{2}X_{2} + ε and its associated simple linear models Y = α_{0} + α_{1}X_{1} + ε and Y = γ_{0} + γ_{1}X_{2} + ε. Assume that we have a sample (X_{11}, X_{12}, Y_{1}),…,(X_{n1}, X_{n2}, Y_{n}). Then, in general, \(\hat\alpha_0\neq\hat\beta_0\), \(\hat\alpha_1\neq\hat\beta_1\), \(\hat\gamma_0\neq\hat\beta_0\) and \(\hat\gamma_1\neq\hat\beta_1\). This is, in general, the inclusion of a new predictor changes the coefficient estimates.
The data employed in Figure 3.6 is:
set.seed(212542)n < 100x1 < rnorm(n, sd = 2)x2 < rnorm(n, mean = x1, sd = 3)y < 1 + 2 * x1  x2 + rnorm(n, sd = 1)data < data.frame(x1 = x1, x2 = x2, y = y)
With the above data
, cheek how the fitted coefficients change for y ~ x1
, y ~ x2
and y ~ x1 + x2
.
Assumptions of the model
Some probabilistic assumptions are required for performing inference on the model parameters. In other words, to infer properties about the unknown population coefficients \(\boldsymbol{\beta}\) from the sample \((\mathbf{X}_1, Y_1),\ldots,(\mathbf{X}_n, Y_n)\).
The assumptions of the multiple linear model are an extension of the simple linear model:
 Linearity: \(\mathbb{E}[YX_1=x_1,\ldots,X_k=x_k]=\beta_0+\beta_1x_1+\ldots+\beta_kx_k\).
 Homoscedasticity: \(\mathbb{V}\text{ar}(\varepsilon_i)=\sigma^2\), with \(\sigma^2\) constant for \(i=1,\ldots,n\).
 Normality: \(\varepsilon_i\sim\mathcal{N}(0,\sigma^2)\) for \(i=1,\ldots,n\).
 Independence of the errors: \(\varepsilon_1,\ldots,\varepsilon_n\) are independent (or uncorrelated, \(\mathbb{E}[\varepsilon_i\varepsilon_j]=0\), \(i\neq j\), since they are assumed to be Normal).
A good oneline summary of the linear model is the following (independence is assumed)
\[\begin{align}Y(X_1=x_1,\ldots,X_k=x_k)\sim \mathcal{N}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k,\sigma^2)\tag{3.5}\end{align}\]Recall:
Compared with simple liner regression, the only different assumption is linearity.
Nothing is said about the distribution of \(X_1,\ldots,X_k\). They could be deterministic or random. They could be discrete or continuous.
\(X_1,\ldots,X_k\) are not required to be independent between them.
 \(Y\) has to be continuous, since the errors are normal – recall (2.1).
Figure 3.8 represent situations where the assumptions of the model are respected and violated, for the situation with two predictors. Clearly, the inspection of the scatterplots for identifying strange patterns is more complicated than in simple linear regression – and here we are dealing only with two predictors. In Section 3.8 we will see more sophisticated methods for checking whether the assumptions hold or not for an arbitrary number of predictors.
To conclude this section, let’s see how to make a 3D scatterplot with the regression plane, in order to evaluate visually how good the fit of the model is. We will do it with the iris
dataset, that can be imported in R
simply by running data(iris)
. In R Commander
go to 'Graphs' > '3D Graphs' > '3D scatterplot...'
. A window like Figures 3.9 and 3.10 will popup. The options are similar to the ones for 'Graphs' > 'Scatterplot...'
.
If you select the options as shown in Figures 3.9 and 3.10, you should get something like this:
data(iris)scatter3d(Petal.Length ~ Petal.Width + Sepal.Length, data = iris, fit = "linear", residuals = TRUE, bg = "white", axis.scales = TRUE, grid = TRUE, ellipsoid = FALSE, id.method = 'mahal', id.n = 2)
You must enable Javascript to view this page properly.
Inference for model parameters
The assumptions introduced in the previous section allow to specify what is the distribution of the random vector \(\hat{\boldsymbol{\beta}}\). The distribution is derived conditionally on the sample predictors \(\mathbf{X}_1,\ldots,\mathbf{X}_n\). In other words, we assume that the randomness of \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol\varepsilon\) comes only from the error terms and not from the predictors. To denote this, we employ lowercase for the sample predictors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\).
Distributions of the fitted coefficients
The distribution of
\(\hat{\boldsymbol{\beta}}\) is:
\[\begin{align}\hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol\beta,\sigma^2(\mathbf{X}^T\mathbf{X})^{1}\right)\tag{3.6}\end{align}\]where \(\mathcal{N}_{m}\) is the \(m\)dimensional Normal, this is, the extension of the usual Normal distribution to deal with \(m\) random variables. The interpretation of (3.6) is not so easy as in the simple linear case. Here are some broad remarks:
 Bias. The estimates are unbiased.
Variance. Depending on:
 Sample size \(n\). Hidden inside \(\mathbf{X}^T\mathbf{X}\). As \(n\) grows, the precision of the estimators increases.
 Error variance \(\sigma^2\). The larger \(\sigma^2\) is, the less precise \(\hat{\boldsymbol{\beta}}\) is.
 Predictor sparsity \((\mathbf{X}^T\mathbf{X})^{1}\). The more sparse the predictor is (small \((\mathbf{X}^T\mathbf{X})^{1}\)), the more precise \(\hat{\boldsymbol{\beta}}\) is.
The problem with
(3.6) is that
\(\sigma^2\) is unknown in practice, so we need to estimate
\(\sigma^2\) from the data. We do so by computing a rescaled sample variance of the fitted residuals
\(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\):
\[\begin{align*}\hat\sigma^2=\frac{\sum_{i=1}^n\hat\varepsilon_i^2}{nk1}.\end{align*}\]Note the \(nk1\) in the denominator. Now \(nk1\) are the degrees of freedom, the number of data points minus the number of already fitted parameters (\(k\) slopes and \(1\) intercept). As in simple linear regression, the mean of the fitted residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) is zero.
If we use the estimate
\(\hat\sigma^2\) instead of
\(\sigma^2\), we get more useful distributions, this time for the
individual \(\beta_j\)’s:
\[\begin{align}\frac{\hat\beta_j\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim t_{nk1},\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=\hat\sigma^2v_j^2\tag{3.7}\end{align}\]where \(t_{nk1}\) represents the Student’s \(t\) distribution with \(nk1\) degrees of freedom and \[v_j\text{ is the $j$th element of the diagonal of }(\mathbf{X}^T\mathbf{X})^{1}.\] The LHS of (3.7) is the \(t\)statistic for \(\beta_j\), \(j=0,\ldots,k\). They are employed for building confidence intervals and hypothesis tests.
Confidence intervals for the coefficients
Thanks to
(3.7), we can have the
\(100(1\alpha)\%\) CI for the coefficient
\(\beta_j\),
\(j=0,\ldots,k\):
\[\begin{align}\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)t_{nk1;\alpha/2}\right)\tag{3.8}\end{align}\]where \(t_{nk1;\alpha/2}\) is the \(\alpha/2\)upper quantile of the \(t_{nk1}\). Note that with \(k=1\) we have same CI as in (2.8).
Let’s see how we can compute the CIs. We return to the wine
dataset, so in case you do not have it loaded, you can download it here as an .RData
file. We analyse the CI for the coefficients of Price ~ Age + WinterRain
.
# Fit modelmod < lm(Price ~ Age + WinterRain, data = wine)# Confidence intervals at 95%confint(mod)## 2.5 % 97.5 %## (Intercept) 4.746010626 7.220074676## Age 0.007702664 0.064409106## WinterRain 0.001030725 0.002593278# Confidence intervals at other levelsconfint(mod, level = 0.90)## 5 % 95 %## (Intercept) 4.9575969417 7.008488360## Age 0.0125522989 0.059559471## WinterRain 0.0007207941 0.002283347confint(mod, level = 0.99)## 0.5 % 99.5 %## (Intercept) 4.306650310 7.659434991## Age 0.002367633 0.074479403## WinterRain 0.001674299 0.003236852
In this example, the 95% confidence interval for \(\beta_0\) is \((4.7460, 7.2201)\), for \(\beta_1\) is \((0.0077, 0.0644)\) and for \(\beta_2\) is \((0.0010, 0.0026)\). Therefore, we can say with a 95% confidence that the coefficient of WinterRain
is non significant. But in Section 3.1.1 we saw that it was significant in the model Price ~ Age + AGST + HarvestRain + WinterRain
! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 3.6. Therefore, the precise statement to make is: in the model Price ~ Age + WinterRain
, with \(\alpha=0.05\), the coefficient of WinterRain
is non significant. Note that this does not mean that it will be always non significant: in Price ~ Age + AGST + HarvestRain + WinterRain
it is.
Compute and interpret the CIs for the coefficients, at levels α = 0.10, 0.05, 0.01, for the following regressions:
medv ~ .  lstat  chas  zn  crim
(Boston
)nox ~ chas + zn + indus + lstat + dis + rad
(Boston
)Price ~ WinterRain + HarvestRain + AGST
(wine
)AGST ~ Year + FrancePop
(wine
)
Testing on the coefficients
The distributions in
(3.7) also allow to conduct a formal hypothesis test on the coefficients
\(\beta_j\),
\(j=0,\ldots,k\). For example the test for significance is specially important:
\[\begin{align*}H_0:\beta_j=0\end{align*}\]for
\(j=0,\ldots,k\). The test of
\(H_0:\beta_j=0\) with
\(1\leq j\leq k\) is specially interesting, since it allows to answer whether
the variable \(X_j\) has a significant linear effect on \(Y\). The statistic used for testing for significance is the
\(t\)statistic
\[\begin{align*}\frac{\hat\beta_j0}{\hat{\mathrm{SE}}(\hat\beta_j)},\end{align*}\]which is distributed as a \(t_{nk1}\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the bilateral alternative hypothesis \(H_1:\beta_j\neq 0\).
Remember two important insights regarding hypothesis testing.
In an hypothesis test, the pvalue measures the degree of veracity of H_{0} according to the data. The rule of thumb is the following:
Is the pvalue lower than α?
 Yes → reject H_{0}.
 No → do not reject H_{0}.
The connection of a ttest for H_{0} : β_{j} = 0 and the CI for β_{j}, both at level α, is the following.
Is 0 inside the CI for β_{j}?
 Yes ↔ do not reject H_{0}.
 No ↔ reject H_{0}.
The tests for significance are builtin in the summary
function, as we saw in Section 3. For mod
, the regression of Price ~ Age + WinterRain
, we have:
summary(mod)## ## Call:## lm(formula = Price ~ Age + WinterRain, data = wine)## ## Residuals:## Min 1Q Median 3Q Max ## 0.88964 0.51421 0.00066 0.43103 1.06897 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 5.9830427 0.5993667 9.982 5.09e10 ***## Age 0.0360559 0.0137377 2.625 0.0149 * ## WinterRain 0.0007813 0.0008780 0.890 0.3824 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.5769 on 24 degrees of freedom## Multiple Rsquared: 0.2371, Adjusted Rsquared: 0.1736 ## Fstatistic: 3.73 on 2 and 24 DF, pvalue: 0.03884
The unilateral test H_{0} : β_{j} ≥ 0 (respectively, H_{0} : β_{j} ≤ 0) vs H_{1} : β_{j} < 0 (H_{1} : β_{j} > 0) can be done by means of the CI for β_{j}. If H_{0} is rejected, they allow to conclude that \(\hat\beta_j\) is significantly negative (positive) and that for the considered regression model, X_{j} has a significant negative (positive) effect on Y. We have been doing them using the following rule of thumb:
Is the CI for β_{j} below (above) 0 at level α?
 Yes → reject H_{0} at level α. Conclude X_{j} has a significant negative (positive) effect on Y at level α.
 No → the criterion is not conclusive.
Prediction
As in the simple linear model, the forecast of \(Y\) from \(\mathbf{X}=\mathbf{x}\) (this is, \(X_1=x_1,\ldots,X_k=x_k\)) is approached by two different ways:
 Inference on the conditional mean of \(Y\) given \(\mathbf{X}=\mathbf{x}\), \(\mathbb{E}[Y\mathbf{X}=\mathbf{x}]\). This is a deterministic quantity, which equals \(\beta_0+\beta_1x_1+\ldots+\beta_{k}x_k\).
 Prediction of the conditional response \(Y\mathbf{X}=\mathbf{x}\). This is a random variable distributed as \(\mathcal{N}(\beta_0+\beta_1x_1+\ldots+\beta_{k}x_k,\sigma^2)\).
The prediction and computation of CIs can be done with the R
function predict
(unfortunately, there is no R Commander
shortcut for this one). The objects required for predict
are: first, the output of lm
; second, a data.frame
containing the locations \(\mathbf{x}=(x_1,\ldots,x_k)\) where we want to predict \(\beta_0+\beta_1x_1+\ldots+\beta_{k}x_k\). The prediction is \(\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_{k}x_k\)
It is mandatory to name the columns of the data frame with the same names of the predictors used in lm
. Otherwise predict
will generate an error, see below.
To illustrate the use of predict
, we return to the wine
dataset.
# Fit a linear model for the price on WinterRain, HarvestRain and AGSTmodelW < lm(Price ~ WinterRain + HarvestRain + AGST, data = wine)summary(modelW)## ## Call:## lm(formula = Price ~ WinterRain + HarvestRain + AGST, data = wine)## ## Residuals:## Min 1Q Median 3Q Max ## 0.62816 0.17923 0.02274 0.21990 0.62859 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 4.9506001 1.9694011 2.514 0.01940 * ## WinterRain 0.0012820 0.0005765 2.224 0.03628 * ## HarvestRain 0.0036242 0.0009646 3.757 0.00103 ** ## AGST 0.7123192 0.1087676 6.549 1.11e06 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3436 on 23 degrees of freedom## Multiple Rsquared: 0.7407, Adjusted Rsquared: 0.7069 ## Fstatistic: 21.9 on 3 and 23 DF, pvalue: 6.246e07# Data for which we want a prediction# Important! You have to name the column with the predictor name!weather < data.frame(WinterRain = 500, HarvestRain = 123, AGST = 18)## Prediction of the mean# Prediction of the mean at 95%  the defaultspredict(modelW, newdata = weather)## 1 ## 8.066342# Prediction of the with 95% confidence interval (the default)# CI: (lwr, upr)predict(modelW, newdata = weather, interval = "confidence")## fit lwr upr## 1 8.066342 7.714178 8.418507predict(modelW, newdata = weather, interval = "confidence", level = 0.95)## fit lwr upr## 1 8.066342 7.714178 8.418507# Other levelspredict(modelW, newdata = weather, interval = "confidence", level = 0.90)## fit lwr upr## 1 8.066342 7.774576 8.358108predict(modelW, newdata = weather, interval = "confidence", level = 0.99)## fit lwr upr## 1 8.066342 7.588427 8.544258## Prediction of the response# Prediction of the mean at 95%  the defaultspredict(modelW, newdata = weather)## 1 ## 8.066342# Prediction of the with 95% confidence interval (the default)# CI: (lwr, upr)predict(modelW, newdata = weather, interval = "prediction")## fit lwr upr## 1 8.066342 7.273176 8.859508predict(modelW, newdata = weather, interval = "prediction", level = 0.95)## fit lwr upr## 1 8.066342 7.273176 8.859508# Other levelspredict(modelW, newdata = weather, interval = "prediction", level = 0.90)## fit lwr upr## 1 8.066342 7.409208 8.723476predict(modelW, newdata = weather, interval = "prediction", level = 0.99)## fit lwr upr## 1 8.066342 6.989951 9.142733# Predictions for several valuesweather2 < data.frame(WinterRain = c(500, 200), HarvestRain = c(123, 200), AGST = c(17, 18))predict(modelW, newdata = weather2, interval = "prediction")## fit lwr upr## 1 7.354023 6.613835 8.094211## 2 7.402691 6.533945 8.271437
For the wine
dataset, do the following:
 Regress
WinterRain
on HarvestRain
and AGST
. Name the fitted model modExercise
.  Compute the estimate for the conditional mean of
WinterRain
for HarvestRain
=123.0 and AGST
=16.15. What is the CI at α = 0.01?  Compute the estimate for the conditional response for
HarvestRain
=125.0 and AGST
=15. What is the CI at α = 0.10?  Check that
modExercise$fitted.values
is the same as predict(modExercise, newdata = data.frame(WinterRain = wine$HarvestRain, AGST = wine$AGST))
. Why is so?
Similarities and differences in the prediction of the conditional mean 𝔼[YX = x] and conditional response YX = x:
 Similarities. The estimate is the same, \(\hat y=\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_kx_k\). Both CI are centered in \(\hat y\).
 Differences. 𝔼[YX = x] is deterministic and YX = x is random. Therefore, the variance is larger for the prediction of YX = x than for the prediction of 𝔼[YX = x].
ANOVA and model fit
ANOVA
The ANOVA decomposition for multiple linear regression is quite analogous to the one in simple linear regression. The ANOVA decomposes the variance of \(Y\) into two parts, each one corresponding to the regression and to the error, respectively. Since the difference between simple and multiple linear regression is the number of predictors – the response \(Y\) is unique in both cases – the ANOVA decompositions are highly similar, as we will see.
As in simple linear regression, the mean of the fitted values \(\hat Y_1,\ldots,\hat Y_n\) is the mean of \(Y_1,\ldots, Y_n\). This is an important result that can be checked if we use matrix notation. The ANOVA decomposition considers the following measures of variation related with the response:
 \(\text{SST}=\sum_{i=1}^n\left(Y_i\bar Y\right)^2\), the total sum of squares. This is the total variation of \(Y_1,\ldots,Y_n\), since \(\text{SST}=ns_y^2\), where \(s_y^2\) is the sample variance of \(Y_1,\ldots,Y_n\).
 \(\text{SSR}=\sum_{i=1}^n\left(\hat Y_i\bar Y\right)^2\), the regression sum of squares. This is the variation explained by the regression plane, that is, the variation from \(\bar Y\) that is explained by the estimated conditional mean \(\hat Y_i=\hat\beta_0+\hat\beta_1X_{i1}+\ldots+\hat\beta_kX_{ik}\). \(\text{SSR}=ns_{\hat y}^2\), where \(s_{\hat y}^2\) is the sample variance of \(\hat Y_1,\ldots,\hat Y_n\).
 \(\text{SSE}=\sum_{i=1}^n\left(Y_i\hat Y_i\right)^2\), the sum of squared errors. Is the variation around the conditional mean. Recall that \(\text{SSE}=\sum_{i=1}^n \hat\varepsilon_i^2=(nk1)\hat\sigma^2\), where \(\hat\sigma^2\) is the sample variance of \(\hat \varepsilon_1,\ldots,\hat \varepsilon_n\).
The ANOVA decomposition is exactly the same as in simple linear regression:
\[\begin{align}\underbrace{\text{SST}}_{\text{Variation of }Y_i's} = \underbrace{\text{SSR}}_{\text{Variation of }\hat Y_i's} + \underbrace{\text{SSE}}_{\text{Variation of }\hat \varepsilon_i's} \tag{3.9}\end{align}\]or, equivalently (dividing by
\(n\) in
(3.9)),
\[\begin{align*}\underbrace{s_y^2}_{\text{Variance of $Y_i$'s}} = \underbrace{s_{\hat y}^2}_{\text{Variance of $\hat Y_i$'s}} + \underbrace{(nk1)/n\times\hat\sigma^2}_{\text{Variance of $\hat\varepsilon_i$'s}}.\end{align*}\]Notice the \(nk1\) instead of simple linear regression’s \(n2\), which is the main change. The graphical interpretation of (3.9) when \(k=2\) is shown in Figures 3.11 and 3.12.
The ANOVA table summarizes the decomposition of the variance.
Predictors  \(k\)  SSR  \(\frac{\text{SSR}}{k}\)  \(\frac{\text{SSR}/k}{\text{SSE}/(nk1)}\)  \(p\) 
Residuals  \(n  k1\)  SSE  \(\frac{\text{SSE}}{nk1}\)   
The “
\(F\)value” of the ANOVA table represents the value of the
\(F\)statistic
\(\frac{\text{SSR}/k}{\text{SSE}/(nk1)}\). This statistic is employed to test
\[\begin{align*}H_0:\beta_1=\ldots=\beta_k=0\quad\text{vs.}\quad H_1:\beta_j\neq 0\text{ for any $j$},\end{align*}\]that is, the hypothesis of
no linear dependence of \(Y\) on \(X_1,\ldots,X_k\) (the plane is completely flat, with no inclination). If
\(H_0\) is rejected, it means that
at least one \(\beta_j\) is significantly different from zero. It happens that
\[\begin{align*}F=\frac{\text{SSR}/k}{\text{SSE}/(nk1)}\stackrel{H_0}{\sim} F_{k,nk1},\end{align*}\]where \(F_{k,nk1}\) is the Snedecor’s \(F\) distribution with \(k\) and \(nk1\) degrees of freedom. If \(H_0\) is true, then \(F\) is expected to be small since SSR will be close to zero (little variation is explained by the regression model since \(\hat{\boldsymbol{\beta}}\approx\mathbf{0}\)). The \(p\)value of this test is not the same as the \(p\)value of the \(t\)test for \(H_0:\beta_1=0\), that only happens in simple linear regression because \(k=1\)!
The “ANOVA table” is a broad concept in statistics, with different variants. Here we are only covering the basic ANOVA table from the relation SST = SSR + SSE. However, further sophistications are possible when SSR is decomposed into the variations contributed by each predictor. In particular, for multiple linear regression R
’s anova
implements a sequential (type I) ANOVA table, which is not the previous table!
The anova
function in R
takes a model as an input and returns the following sequential ANOVA table:
Predictor 1  \(1\)  SSR\(_1\)  \(\frac{\text{SSR}_1}{1}\)  \(\frac{\text{SSR}_1/1}{\text{SSE}/(nk1)}\)  \(p_1\) 
Predictor 2  \(1\)  SSR\(_2\)  \(\frac{\text{SSR}_2}{1}\)  \(\frac{\text{SSR}_2/1}{\text{SSE}/(nk1)}\)  \(p_2\) 
\(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\) 
Predictor \(k\)  \(1\)  SSR\(_k\)  \(\frac{\text{SSR}_k}{1}\)  \(\frac{\text{SSR}_k/1}{\text{SSE}/(nk1)}\)  \(p_k\) 
Residuals  \(n  k  1\)  SSE  \(\frac{\text{SSE}}{nk1}\)   
Here the SSR
\(_j\) represents the regression sum of squares associated to the inclusion of
\(X_j\) in the model with predictors
\(X_1,\ldots,X_{j1}\), this is:
\[\text{SSR}_j=\text{SSR}(X_1,\ldots,X_j)\text{SSR}(X_1,\ldots,X_{j1}).\] The
\(p\)values
\(p_1,\ldots,p_k\) correspond to the testing of the hypotheses
\[\begin{align*}H_0:\beta_j=0\quad\text{vs.}\quad H_1:\beta_j\neq 0,\end{align*}\]carried out inside the linear model \(Y=\beta_0+\beta_1X_1+\ldots+\beta_jX_j+\varepsilon\). This is like the \(t\)test for \(\beta_j\) for the model with predictors \(X_1,\ldots,X_j\).
Let’s see how we can compute both ANOVA tables in R
. The sequential table is simple: use anova
. We illustrate it with the Boston
dataset.
# Load datalibrary(MASS)data(Boston)# Fit a linear modelmodel < lm(medv ~ crim + lstat + zn + nox, data = Boston)summary(model)## ## Call:## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## 14.972 3.956 1.344 2.148 25.076 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 30.93462 1.72076 17.977 <2e16 ***## crim 0.08297 0.03677 2.257 0.0245 * ## lstat 0.90940 0.05040 18.044 <2e16 ***## zn 0.03493 0.01395 2.504 0.0126 * ## nox 5.42234 3.24241 1.672 0.0951 . ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 6.169 on 501 degrees of freedom## Multiple Rsquared: 0.5537, Adjusted Rsquared: 0.5502 ## Fstatistic: 155.4 on 4 and 501 DF, pvalue: < 2.2e16# ANOVA table with sequential testanova(model)## Analysis of Variance Table## ## Response: medv## Df Sum Sq Mean Sq F value Pr(>F) ## crim 1 6440.8 6440.8 169.2694 < 2e16 ***## lstat 1 16950.1 16950.1 445.4628 < 2e16 ***## zn 1 155.7 155.7 4.0929 0.04360 * ## nox 1 106.4 106.4 2.7967 0.09509 . ## Residuals 501 19063.3 38.1 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# The last pvalue is the one of the last ttest
In order to compute the simplified ANOVA table, we need to rely on an adhoc function. The function takes as input a fitted lm
:
# This function computes the simplied anova from a linear modelsimpleAnova < function(object, ...) { # Compute anova table tab < anova(object, ...) # Obtain number of predictors p < nrow(tab)  1 # Add predictors row predictorsRow < colSums(tab[1:p, 1:2]) predictorsRow < c(predictorsRow, predictorsRow[2] / predictorsRow[1]) # Fquantities Fval < predictorsRow[3] / tab[p + 1, 3] pval < pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE) predictorsRow < c(predictorsRow, Fval, pval) # Simplified table tab < rbind(predictorsRow, tab[p + 1, ]) row.names(tab)[1] < "Predictors" return(tab)}# Simplified ANOVAsimpleAnova(model)
## Analysis of Variance Table## ## Response: medv## Df Sum Sq Mean Sq F value Pr(>F) ## Predictors 4 23653 5913.3 155.41 < 2.2e16 ***## Residuals 501 19063 38.1 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall that the \(F\)statistic, its \(p\)value and the degrees of freedom are also given in the output of summary
.
Compute the ANOVA table for the regression Price ~ WinterRain + AGST + HarvestRain + Age
in the wine
dataset. Check that the pvalue for the Ftest given in summary
and by simpleAnova
are the same.
The \(R^2\)
The
coefficient of determination \(R^2\) is defined as in simple linear regression:
\[\begin{align*}R^2=\frac{\text{SSR}}{\text{SST}}=\frac{\text{SSR}}{\text{SSR}+\text{SSE}}=\frac{\text{SSR}}{\text{SSR}+(nk1)\hat\sigma^2}.\end{align*}\]\(R^2\) measures the proportion of variation of the response variable \(Y\) that is explained by the predictor \(X\) through the regression. Intuitively, \(R^2\) measures the tightness of the data cloud around the regression plane. Check in Figure 3.12 how changing the value of \(\sigma^2\) (not \(\hat\sigma^2\), but \(\hat\sigma^2\) is obviously dependent on \(\sigma^2\)) affects the \(R^2\). Also, as we saw in Section 2.7, \(R^2=r^2_{y\hat y}\), that is, the square of the sample correlation coefficient between \(Y_1,\ldots,Y_n\) and \(\hat Y_1,\ldots,\hat Y_n\) is \(R^2\).
Trusting blindly the \(R^2\) can lead to catastrophic conclusions in model selection. Here is a counterexample of a multiple regression where the \(R^2\) is apparently large but the assumptions discussed in Section 3.3 are clearly not satisfied.
# Create data that:# 1) does not follow a linear model# 2) the error is heteroskedasticx1 < seq(0.15, 1, l = 100)set.seed(123456)x2 < runif(100, 3, 3)eps < rnorm(n = 100, sd = 0.25 * x1^2)y < 1  3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps# Great R^2!?reg < lm(y ~ x1 + x2)summary(reg)## ## Call:## lm(formula = y ~ x1 + x2)## ## Residuals:## Min 1Q Median 3Q Max ## 0.78737 0.20946 0.01031 0.19652 1.05351 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.788812 0.096418 8.181 1.1e12 ***## x1 2.540073 0.154876 16.401 < 2e16 ***## x2 0.002283 0.020954 0.109 0.913 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3754 on 97 degrees of freedom## Multiple Rsquared: 0.744, Adjusted Rsquared: 0.7388 ## Fstatistic: 141 on 2 and 97 DF, pvalue: < 2.2e16# But prediction is obviously problematicscatter3d(y ~ x1 + x2, fit = "linear")
You must enable Javascript to view this page properly.
Remember that:
 R^{2} does not measure the correctness of a linear model but its usefulness, assuming the model is correct.
 R^{2} is the proportion of variance of Y explained by X_{1}, …, X_{k}, but, of course, only when the linear model is correct.
We finalize by pointing out a nice connection between the \(R^2\), the ANOVA decomposition and the least squares estimator \(\hat{\boldsymbol{\beta}}\):
The ANOVA decomposition gives another interpretation of the leastsquares estimates: \(\hat{\boldsymbol{\beta}}\) are the estimated coefficients that maximize the R^{2} (among all the possible estimates we could think about). To see this, recall that
\[R^2=\frac{\text{SSR}}{\text{SST}}=\frac{\text{SST}  \text{SSE}}{\text{SST}}=\frac{\text{SST}  \text{RSS}(\hat{\boldsymbol{\beta}})}{\text{SST}},\]
so if \(\text{RSS}(\hat{\boldsymbol{\beta}})=\min_{\boldsymbol{\beta}\in\mathbb{R}^{k+1}}\text{RSS}(\boldsymbol{\beta})\), then R^{2} is maximal for \(\hat{\boldsymbol{\beta}}\)!
The \(R^2_{\text{Adj}}\)
As we saw, these are equivalent forms for
\(R^2\):
\[\begin{align}R^2&=\frac{\text{SSR}}{\text{SST}}=\frac{\text{SST}\text{SSE}}{\text{SST}}=1\frac{\text{SSE}}{\text{SST}}\nonumber\\&=1\frac{\hat\sigma^2}{\text{SST}}\times(nk1).\tag{3.10}\end{align}\]The SSE on the numerator always decreases as more predictors are added to the model, even if these are no significant. As a consequence, the \(R^2\) always increases with \(k\). Why is so? Intuitively, because the complexity – hence the flexibility – of the model augments when we use more predictors to explain \(Y\). Mathematically, because when \(k\) approaches \(n1\) and \(\hat\sigma^2\), then the second term in (3.10) is reduced and as a consequence \(R^2\) grows.
The
adjusted \(R^2\) is an important quantity specifically designed to cover this
\(R^2\)’s flaw, which is ubiquitous in multiple linear regression. The purpose is to have a better tool for
comparing models without systematically favouring complexer models. This alternative coefficient is defined as
\[\begin{align}R^2_{\text{Adj}}&=1\frac{\text{SSE}/(nk1)}{\text{SST}/(n1)}=1\frac{\text{SSE}}{\text{SST}}\times\frac{n1}{nk1}\nonumber\\&=1\frac{\hat\sigma^2}{\text{SST}}\times (n1).\tag{3.11}\end{align}\]The \(R^2_{\text{Adj}}\) is independent of \(k\), at least explicitly. If \(k=1\) then \(R^2_{\text{Adj}}\) is almost \(R^2\) (practically identical if \(n\) is large). Both (3.10) and (3.11) are quite similar except for the last factor, which in the former does not depend on \(k\). Therefore, (3.11) will only increase if \(\hat\sigma^2\) is reduced with \(k\) – in other words, if the new variables contribute in the reduction of variability around the regression plane.
The different behavior between \(R^2\) and \(R^2_\text{Adj}\) can be visualized by a small simulation. Suppose that we generate a random dataset, with \(n=200\) observations of a response \(Y\) and two predictors \(X_1,X_2\). This is, the sample \(\{(X_{i1},X_{i2},Y_i)\}_{i=1}^n\) with \[Y_i=\beta_0+\beta_1X_{i1}+\beta_2X_{i2}+\varepsilon_i,\quad \varepsilon_i\sim\mathcal{N}(0,1).\] Tho this data, we add \(196\) garbage predictors that are completely independent from \(Y\). Therefore, we end up with \(k=198\) predictors. Now we compute the \(R^2(j)\) and \(R^2_\text{Adj}(j)\) for the models \[Y=\beta_0+\beta_1X_{1}+\ldots+\beta_jX_{j}+\varepsilon,\] with \(j=1,\ldots,k\) and we plot them as the curves \((j,R^2(j))\) and \((j,R_\text{Adj}^2(j))\). Since \(R^2\) and \(R^2_\text{Adj}\) are random variables, we repeat the procedure \(100\) times to have a measure of the variability.
Figure 3.13 contains the results of this experiment. As you can see \(R^2\) increases linearly with the number of predictors considered, although only the first two ones were important! On the contrary, \(R^2_\text{Adj}\) only increases in the first two variables and then is flat on average, but it has a huge variability when \(k\) approaches \(n2\). This is a consequence of the explosive variance of \(\hat\sigma^2\) in that degenerate case (as we will see in Section 3.7). The experiment evidences that \(R^2_\text{Adj}\) is more adequate than the \(R^2\) for evaluating the fit of a multiple linear regression.
An example of a simulated dataset considered in the experiment of Figure 3.13:
# Generate datak < 198n < 200set.seed(3456732)beta < c(0.5, 0.5, rep(0, k  2))X < matrix(rnorm(n * k), nrow = n, ncol = k)Y < drop(X %*% beta + rnorm(n, sd = 3))data < data.frame(y = Y, x = X)# Regression on the two meaningful predictorssummary(lm(y ~ x.1 + x.2, data = data))# Adding 20 garbage variablessummary(lm(y ~ X[, 1:22], data = data))
The R_{Adj}^{2} no longer measures the proportion of variation of Y explained by the regression, but the result of correcting this proportion by the number of predictors exmployed. As a consequence of this, R_{Adj}^{2} ≤ 1 but it can be negative!
The next code illustrates a situation where we have two predictors completely independent from the response. The fitted model has a negative \(R^2_\text{Adj}\).
# Three independent variablesset.seed(234599)x1 < rnorm(100)x2 < rnorm(100)y < 1 + rnorm(100)# Negative adjusted R^2summary(lm(y ~ x1 + x2))## ## Call:## lm(formula = y ~ x1 + x2)## ## Residuals:## Min 1Q Median 3Q Max ## 3.5081 0.5021 0.0191 0.5286 2.4750 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.97024 0.10399 9.330 3.75e15 ***## x1 0.09003 0.10300 0.874 0.384 ## x2 0.05253 0.11090 0.474 0.637 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1.034 on 97 degrees of freedom## Multiple Rsquared: 0.009797, Adjusted Rsquared: 0.01062 ## Fstatistic: 0.4799 on 2 and 97 DF, pvalue: 0.6203
Construct more predictors (x3
, x4
, …) by sampling 100 points from a normal (rnorm(100)
). Check that when the predictors are added to the model, the R_{Adj}^{2} decreases and the R^{2} increases.
Model selection
In Section 3.1.1 we briefly saw that the inclusion of more predictors is not for free: there is a price to pay in terms more variability on the coefficients. Indeed, there is a maximum number of predictors \(k\) that can be considered in a linear model for a sample size \(n\): \(k\leq n2\). Or equivalently, there is a minimum sample size \(n\) required for fitting a model with \(k\) predictors: \(n\geq k + 2\).
The interpretation of this fact is simple if we think on the geometry for \(k=1\) and \(k=2\):
 If \(k=1\), we need at least \(n=2\) points to fit uniquely a line. However, this line gives no information on the vertical variation around it and hence \(\hat\sigma^2\) can not be estimated (applying its formula, we would have \(\hat\sigma^2=\infty\)). Therefore we need at least \(n=3\) points, or in other words \(n\geq k + 2=3\).
 If \(k=2\), we need at least \(n=3\) points to fit uniquely a plane. But this plane gives no information on the variation of the data around it and hence \(\hat\sigma^2\) can not be estimated. Therefore we need \(n\geq k + 2=4\).
Another interpretation is the following:
The fitting of a linear model with \(k\) predictors involves the estimation the \(k+2\) parameters \((\boldsymbol{\beta},\sigma^2)\) from \(n\) data points. The closer \(k+2\) and \(n\) are, the more variable the estimates \((\hat{\boldsymbol{\beta}},\hat\sigma^2)\) will be, since less information is availiable for computing each one. In the limit case \(n=k+2\), each sample point determines a parameter estimate.
The degrees of freedom \(nk1\) quantify the increasing on the variability of \((\hat{\boldsymbol{\beta}},\hat\sigma^2)\) when \(nk1\) decreases. For example:
 \(t_{nk1;\alpha/2}\) appears in (3.7) and influences the length of the CIs for \(\beta_j\), see (3.8). It also influences the length of the CIs for the prediction. As Figure 3.14 shows, when the degrees of freedom decrease, \(t_{nk1;\alpha/2}\) increases, thus the intervals become wider.
 \(\hat\sigma^2=\frac{1}{nk1}\sum_{i=1}^n\hat\varepsilon_i^2\) influences the \(R^2\) and \(R^2_\text{Adj}\). If no relevant variables are added to the model then \(\sum_{i=1}^n\hat\varepsilon_i^2\) will not change substantially. However, the reducing factor \(\frac{1}{nk1}\) will decrease as \(k\) augments, inflating \(\hat\sigma^2\) and its variance. This is exactly what happened in Figure 3.13.
Now that we have added more light into the problem of having an excess of predictors, we turn the focus into selecting the most adequate predictors for a multiple regression model. This is a challenging task without a unique solution, and what is worse, without a method that is guaranteed to work in all the cases. However, there is a wellestablished procedure that usually gives good results: the stepwise regression. Its principle is to compare multiple linear regression models with different predictors (and, of course, with the same responses).
Before introducing the method, we need to understand what is an
information criterion. An information criterion balances the fitness of a model with the number of predictors employed. Hence, it determines objectively the best model as the one that
minimizes the information criterion. Tho common criteria are the
Bayesian Information Criterion (BIC) and the
Akaike Information Criterion (AIC). Both are based on a
balance between the model fitness and its complexity:
\[\begin{align}\text{BIC}(\text{model}) = \underbrace{2\log\text{lik(model)}}_{\text{Model fitness}} + \underbrace{\text{npar(model)}\times\log n}_{\text{Complexity}}, \tag{3.12}\end{align}\]where \(\text{lik(model)}\) is the likelihood of the model (how well the model fits the data) and \(\text{npar(model)}\) is the number of parameters of the model, \(k+2\) in the case of a multiple linear regression model with \(k\) predictors. The AIC replaces \(\log n\) by \(2\) in (3.12), so it penalizes less complexer models. This is one of the reasons why BIC is preferred by some practitioners for model comparison. Also, because is consistent in selecting the true model: if enough data is provided, the BIC is guaranteed to select the datagenerating model among a list of candidate models.
The BIC and AIC can be computed in R
through the functions BIC
and AIC
. They take a model as the input.
# Load iris datasetdata(iris)# Two models with different predictorsmod1 < lm(Petal.Length ~ Sepal.Width, data = iris)mod2 < lm(Petal.Length ~ Sepal.Width + Petal.Width, data = iris)# BICsBIC(mod1)## [1] 579.7856BIC(mod2) # Smaller > better## [1] 208.0366# Check the summariessummary(mod1)## ## Call:## lm(formula = Petal.Length ~ Sepal.Width, data = iris)## ## Residuals:## Min 1Q Median 3Q Max ## 3.7721 1.4164 0.1719 1.2094 4.2307 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 9.0632 0.9289 9.757 < 2e16 ***## Sepal.Width 1.7352 0.3008 5.768 4.51e08 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1.6 on 148 degrees of freedom## Multiple Rsquared: 0.1836, Adjusted Rsquared: 0.178 ## Fstatistic: 33.28 on 1 and 148 DF, pvalue: 4.513e08summary(mod2)## ## Call:## lm(formula = Petal.Length ~ Sepal.Width + Petal.Width, data = iris)## ## Residuals:## Min 1Q Median 3Q Max ## 1.33753 0.29251 0.00989 0.21447 1.24707 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 2.25816 0.31352 7.203 2.84e11 ***## Sepal.Width 0.35503 0.09239 3.843 0.00018 ***## Petal.Width 2.15561 0.05283 40.804 < 2e16 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4574 on 147 degrees of freedom## Multiple Rsquared: 0.9338, Adjusted Rsquared: 0.9329 ## Fstatistic: 1036 on 2 and 147 DF, pvalue: < 2.2e16
Let’s go back to the selection of predictors. If we have \(k\) predictors, a naive procedure would be to check all the possible models that can be constructed with them and then select the best one in terms of BIC/AIC. The problem is that there are \(2^{k+1}\) possible models! Fortunately, the stepwise
procedure helps us navigating this ocean of models. The function takes as input a model employing all the available predictors.
# Explain NOx in Boston datasetmod < lm(nox ~ ., data = Boston)# With BICmodBIC < stepwise(mod, trace = 0)## ## Direction: backward/forward## Criterion: BICsummary(modBIC)## ## Call:## lm(formula = nox ~ indus + age + dis + rad + ptratio + medv, ## data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## 0.117146 0.034877 0.005863 0.031655 0.183363 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.7649531 0.0347425 22.018 < 2e16 ***## indus 0.0045930 0.0005972 7.691 7.85e14 ***## age 0.0008682 0.0001381 6.288 7.03e10 ***## dis 0.0170889 0.0020226 8.449 3.24e16 ***## rad 0.0033154 0.0003730 8.888 < 2e16 ***## ptratio 0.0130209 0.0013942 9.339 < 2e16 ***## medv 0.0021057 0.0003413 6.170 1.41e09 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.05485 on 499 degrees of freedom## Multiple Rsquared: 0.7786, Adjusted Rsquared: 0.7759 ## Fstatistic: 292.4 on 6 and 499 DF, pvalue: < 2.2e16# Different search directionsstepwise(mod, trace = 0, direction = "forward")## ## Direction: forward## Criterion: BIC## ## Call:## lm(formula = nox ~ dis + indus + age + rad + ptratio + medv, ## data = Boston)## ## Coefficients:## (Intercept) dis indus age rad ## 0.7649531 0.0170889 0.0045930 0.0008682 0.0033154 ## ptratio medv ## 0.0130209 0.0021057stepwise(mod, trace = 0, direction = "backward")## ## Direction: backward## Criterion: BIC## ## Call:## lm(formula = nox ~ indus + age + dis + rad + ptratio + medv, ## data = Boston)## ## Coefficients:## (Intercept) indus age dis rad ## 0.7649531 0.0045930 0.0008682 0.0170889 0.0033154 ## ptratio medv ## 0.0130209 0.0021057# With AICmodAIC < stepwise(mod, trace = 0, criterion = "AIC")## ## Direction: backward/forward## Criterion: AICsummary(modAIC)## ## Call:## lm(formula = nox ~ crim + indus + age + dis + rad + ptratio + ## medv, data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## 0.122633 0.035593 0.004273 0.030938 0.182914 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.7750308 0.0349903 22.150 < 2e16 ***## crim 0.0007603 0.0003753 2.026 0.0433 * ## indus 0.0044875 0.0005976 7.509 2.77e13 ***## age 0.0008656 0.0001377 6.288 7.04e10 ***## dis 0.0175329 0.0020282 8.645 < 2e16 ***## rad 0.0037478 0.0004288 8.741 < 2e16 ***## ptratio 0.0132746 0.0013956 9.512 < 2e16 ***## medv 0.0022716 0.0003499 6.491 2.06e10 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.05468 on 498 degrees of freedom## Multiple Rsquared: 0.7804, Adjusted Rsquared: 0.7773 ## Fstatistic: 252.8 on 7 and 498 DF, pvalue: < 2.2e16
The model selected by stepwise
is a good starting point for further additions or deletions of predictors. For example, in modAIC
we could remove crim
.
When applying stepwise
for BIC/AIC, different final models might be selected depending on the choice of direction
. This is the interpretation:
“backward”
: starts from the full model, removes predictors sequentially.“forward”
: starts a simple model, adds predictors sequentially.“backward/forward”
(default) and “forward/backward”
: combination of the above.
The advice is to try several of these methods and retain the one with minimum BIC/AIC. Set trace = 0
to omit lengthy outputs of information of the search procedure.
stepwise
assumes no NA
’s (missing values) are present in the data. It is advised to remove the missing values in the data before since their presence might lead to errors. To do so, employ data = na.omit(dataset)
in the call to lm
(if your dataset is dataset
).
We conclude highlighting a caveat on the use of the BIC and AIC: they are constructed assuming that the sample size \(n\) is much larger than the number of parameters in the model (\(k+2\)). Therefore, they will work reasonably well if \(n>>k+2\), but if this is not true they may favor unrealistic complex models. An illustration of this phenomena is Figure 3.15, which is the BIC/AIC version of Figure 3.13 for the experiment done in Section 3.6. The BIC and AIC curves tend to have local minimums close to \(k=2\) and then increase. But when \(k+2\) gets close to \(n\), they quickly drop down. Note also how the BIC penalizes more the complexity than the AIC, which is more flat.
Model diagnostics and multicollinearity
As we saw in Section 3.3, checking the assumptions of the multiple linear model through the data scatterplots becomes tricky even when \(k=2\). To solve this issue, a series of diagnostic plots have been designed in order to evaluate graphically and in a simple way the validity of the assumptions. For illustration, we retake the wine
dataset (download).
mod < lm(Price ~ Age + AGST + HarvestRain + WinterRain, data = wine)
We will focus only in three plots:
Residuals vs. fitted values plot. This plot serves mainly to check the linearity, although lack of homoscedasticity or independence can also be detected. Here is an example:
Under linearity, we expect the red line (a nonlinear fit of the mean of the residuals) to be almost flat. This which means that the trend of \(Y_1,\ldots,Y_n\) is linear with respect to the predictors. Heteroskedasticity can be detected also in the form of irregular vertical dispersion around the red line. The dependence between residuals can be detected (harder) in the form of non randomly spread residuals.
QQplot. Checks the normality:
Under normality, we expect the points (sample quantiles of the standardized residuals vs. theoretical quantiles of a \(\mathcal{N}(0,1)\)) to align with the diagonal line, which represents the ideal position of the points if those were sampled from a \(\mathcal{N}(0,1)\). It is usual to have larger departures from the diagonal in the extremes than in the center, even under normality, although these departures are more clear if the data is nonnormal.
Scalelocation plot. Serves for checking the homoscedasticity. It is similar to the first diagnostic plot, but now with the residuals standardized and transformed by a square root (of the absolute value). This change transforms the task of spotting heteroskedasticity by looking into irregular vertical dispersion patterns into spotting for nonlinearities, which is somehow simpler.
Under homoscedasticity, we expect the red line to be almost flat. If there are consistent nonlinear patterns, then there is evidence of heteroskedasticity.
If you type plot(mod)
, several diagnostic plots will be shown sequentially. In order to advance them, hit ‘Enter’
in the R
console.
The next figures present datasets where the assumptions are satisfied and violated.
Load the dataset assumptions3D.RData
(download) and compute the regressions y.3 ~ x1.3 + x2.3
, y.4 ~ x1.4 + x2.4
, y.5 ~ x1.5 + x2.5
and y.8 ~ x1.8 + x2.8
. Use the three diagnostic plots to test the assumptions of the linear model.
A common problem that arises in multiple linear regression is multicollinearity. This is the situation when two or more predictors are highly linearly related between them. Multicollinearitiy has important effects on the fit of the model:
 It reduces the precision of the estimates. As a consequence, sings of fitted coefficients may be reversed and valuable predictors may appear as non significant.
 It is difficult to determine how each of the highly related predictors affects the response, since one masks the other. This may result in numerical instabilities.
An approach is to detect multicollinearity is to compute the correlation matrix between the predictors by cor
(in R Commander
: 'Statistics' > 'Summaries' > 'Correlation matrix...'
)
cor(wine)## Price WinterRain AGST HarvestRain Age## Price 1.0000000 0.13488004 0.66752483 0.50718463 0.46040873## WinterRain 0.1348800 1.00000000 0.32113230 0.26798907 0.05118354## AGST 0.6675248 0.32113230 1.00000000 0.02708361 0.29488335## HarvestRain 0.5071846 0.26798907 0.02708361 1.00000000 0.05884976## Age 0.4604087 0.05118354 0.29488335 0.05884976 1.00000000## FrancePop 0.4810720 0.02945091 0.30126148 0.03201463 0.99227908## FrancePop## Price 0.48107195## WinterRain 0.02945091## AGST 0.30126148## HarvestRain 0.03201463## Age 0.99227908## FrancePop 1.00000000
Here we can see what we already knew from Section 3.1.1, that Age
and Year
are perfectly linearly related and that Age
and FrancePop
are highly linearly related.
However, is not enough to inspect remove pair by pair correlations in order to get rid of multicollinearity. Here is a counterexample:
# Create predictors with multicollinearity: x4 depends on the restset.seed(45678)x1 < rnorm(100)x2 < 0.5 * x1 + rnorm(100)x3 < 0.5 * x2 + rnorm(100)x4 < x1 + x2 + rnorm(100, sd = 0.25)# Responsey < 1 + 0.5 * x1 + 2 * x2  3 * x3  x4 + rnorm(100)data < data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)# Correlations  none seems suspiciuscor(data)## x1 x2 x3 x4 y## x1 1.0000000 0.38254782 0.2142011 0.5261464 0.31194689## x2 0.3825478 1.00000000 0.5167341 0.5673174 0.04428223## x3 0.2142011 0.51673408 1.0000000 0.2500123 0.77482655## x4 0.5261464 0.56731738 0.2500123 1.0000000 0.28677304## y 0.3119469 0.04428223 0.7748265 0.2867730 1.00000000
A better approach is to compute the Variance Inflation Factor (VIF) of each coefficient \(\hat\beta_j\). This is a measure of how linearly dependent is \(X_j\) with the rest of predictors: \[\text{VIF}(\hat\beta_j)=\frac{1}{1R^2_{X_jX_{j}}}\] where \(R^2_{X_jX_{j}}\) is the \(R^2\) from a regression of \(X_j\) into the remaining predictors. The next rule of thumb gives direct insight into which predictors are multicollinear:
 VIF close to 1: absence of multicollinearity.
 VIF larger than 5 or 10: problematic amount of multicollinearity. Advised to remove the predictor with largest VIF.
VIF is called by vif
and takes as argument a linear model (In R Commander
: 'Models' > 'Numerical diagnostics' > 'Varianceinflation factors'
). We continue with the previous example.
# Variance inflation factors anormal: largest for x4, we remove itmodMultiCo < lm(y ~ x1 + x2 + x3 + x4)vif(modMultiCo)## x1 x2 x3 x4 ## 26.361444 29.726498 1.416156 33.293983# Without x4modClean < lm(y ~ x1 + x2 + x3)# Comparisonsummary(modMultiCo)## ## Call:## lm(formula = y ~ x1 + x2 + x3 + x4)## ## Residuals:## Min 1Q Median 3Q Max ## 1.9762 0.6663 0.1195 0.6217 2.5568 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 1.0622 0.1034 10.270 < 2e16 ***## x1 0.9224 0.5512 1.673 0.09756 . ## x2 1.6399 0.5461 3.003 0.00342 ** ## x3 3.1652 0.1086 29.158 < 2e16 ***## x4 0.5292 0.5409 0.978 0.33040 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1.028 on 95 degrees of freedom## Multiple Rsquared: 0.9144, Adjusted Rsquared: 0.9108 ## Fstatistic: 253.7 on 4 and 95 DF, pvalue: < 2.2e16summary(modClean)## ## Call:## lm(formula = y ~ x1 + x2 + x3)## ## Residuals:## Min 1Q Median 3Q Max ## 1.91297 0.66622 0.07889 0.65819 2.62737 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 1.0577 0.1033 10.24 < 2e16 ***## x1 1.4495 0.1162 12.47 < 2e16 ***## x2 1.1195 0.1237 9.05 1.63e14 ***## x3 3.1450 0.1065 29.52 < 2e16 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 1.028 on 96 degrees of freedom## Multiple Rsquared: 0.9135, Adjusted Rsquared: 0.9108 ## Fstatistic: 338 on 3 and 96 DF, pvalue: < 2.2e16# Variance inflation factors normalvif(modClean)## x1 x2 x3 ## 1.171942 1.525501 1.364878
Logistic regression
As we saw in Chapters 2 and 3, linear regression assumes that the response variable \(Y\) is continuous. In this chapter we will see how logistic regression can deal with a discrete response \(Y\). The simplest case is with \(Y\) being a binary response, that is, a variable encoding two categories. In general, we assume that we have \(X_1,\ldots,X_k\) predictors for explaining \(Y\) (multiple logistic regression) and cover the peculiarities for \(k=1\) as particular cases.
More R
basics
In order to implement some of the contents of this chapter we need to cover more R
basics, mostly related with flexible plotting that is not implemented directly in R Commander
. The R
functions we will are also very useful for simplifying some R Commander
approaches.
In the following sections, type – not copy and paste systematically – the code in the 'R Script'
panel and send it to the output panel. Remember that you should get the same outputs (which are preceded by ## [1]
).
Data frames revisited
# Let's begin importing the iris datasetdata(iris)# names gives you the variables in the data framenames(iris)## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species"# The beginning of the datahead(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa# So we can access variables by $ or as in a matrixiris$Sepal.Length[1:10]## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9iris[1:10, 1]## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9iris[3, 1]## [1] 4.7# Information on the dimension of the data framedim(iris)## [1] 150 5# str gives the structure of any object in Rstr(iris)## 'data.frame': 150 obs. of 5 variables:## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...# Recall the species variable: it is a categorical variable (or factor),# not a numeric variableiris$Species[1:10]## [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa## Levels: setosa versicolor virginica# Factors can only take certain valueslevels(iris$Species)## [1] "setosa" "versicolor" "virginica"# If a file contains a variable with character strings as observations (either# encapsulated by quotation marks or not), the variable will become a factor# when imported into R
Do the following:
 Import
auto.txt
into R
as the data frame auto
. Check how the character strings in the file give rise to factor variables.  Get the dimensions of
auto
and show beginning of the data.  Retrieve the fifth observation of
horsepower
in two different ways.  Compute the levels of
name
.
Logical conditions and subsetting
# Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y# They return TRUE or FALSE# Smaller than0 < 1## [1] TRUE# Greater than1 > 1## [1] FALSE# Greater or equal to1 >= 1 # Remember: ">="" and not "=>"" !## [1] TRUE# Smaller or equal to2 <= 1 # Remember: "<="" and not "=<"" !## [1] FALSE# Equal1 == 1 # Tests equality. Remember: "=="" and not "="" !## [1] TRUE# Unequal1 != 0 # Tests iequality## [1] TRUE# TRUE is encoded as 1 and FALSE as 0TRUE + 1## [1] 2FALSE + 1## [1] 1# In a vectorlike fashionx < 1:5y < c(0, 3, 1, 5, 2)x < y## [1] FALSE TRUE FALSE TRUE FALSEx == y## [1] FALSE FALSE FALSE FALSE FALSEx != y## [1] TRUE TRUE TRUE TRUE TRUE# Subsetting of vectorsx## [1] 1 2 3 4 5x[x >= 2]## [1] 2 3 4 5x[x < 3]## [1] 1 2# Easy way of work with parts of the datadata < data.frame(x = c(0, 1, 3, 3, 0), y = 1:5)data## x y## 1 0 1## 2 1 2## 3 3 3## 4 3 4## 5 0 5# Data such that x is zerodata0 < data[data$x == 0, ]data0## x y## 1 0 1## 5 0 5# Data such that x is larger than 2data2 < data[data$x > 2, ]data2## x y## 3 3 3## 4 3 4# In an exampleiris$Sepal.Width[iris$Sepal.Width > 3]## [1] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 4.0 4.4 3.9 3.5 3.8 3.8 3.4## [18] 3.7 3.6 3.3 3.4 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.4## [35] 3.5 3.2 3.5 3.8 3.8 3.2 3.7 3.3 3.2 3.2 3.1 3.3 3.1 3.2 3.4 3.1 3.3## [52] 3.6 3.2 3.2 3.8 3.2 3.3 3.2 3.8 3.4 3.1 3.1 3.1 3.1 3.2 3.3 3.4# Problem  what happened?data[x > 2, ]## x y## 3 3 3## 4 3 4## 5 0 5# In an examplesummary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## summary(iris[iris$Sepal.Width > 3, ])## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.400 Min. :3.100 Min. :1.000 Min. :0.1000 ## 1st Qu.:5.000 1st Qu.:3.200 1st Qu.:1.450 1st Qu.:0.2000 ## Median :5.400 Median :3.400 Median :1.600 Median :0.4000 ## Mean :5.684 Mean :3.434 Mean :2.934 Mean :0.9075 ## 3rd Qu.:6.400 3rd Qu.:3.600 3rd Qu.:5.000 3rd Qu.:1.8000 ## Max. :7.900 Max. :4.400 Max. :6.700 Max. :2.5000 ## Species ## setosa :42 ## versicolor: 8 ## virginica :17 ## ## ## # On the factor variable only makes sense == and !=summary(iris[iris$Species == "setosa", ])## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ## Species ## setosa :50 ## versicolor: 0 ## virginica : 0 ## ## ## # Subset argument in lmlm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)## ## Call:## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > ## 3)## ## Coefficients:## (Intercept) Petal.Length ## 3.59439 0.05455lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)## ## Call:## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > ## 3)## ## Coefficients:## (Intercept) Petal.Length ## 3.59439 0.05455# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris# tells R to look for Sepal.Width in the iris dataset# Same thing for the subset field in R Commander's menus# AND operator &TRUE & TRUE## [1] TRUETRUE & FALSE## [1] FALSEFALSE & FALSE## [1] FALSE# OR operator TRUE  TRUE## [1] TRUETRUE  FALSE## [1] TRUEFALSE  FALSE## [1] FALSE# Both operators are useful for checking for ranges of datay## [1] 0 3 1 5 2index1 < (y <= 3) & (y > 0)y[index1]## [1] 3 1 2index2 < (y < 2)  (y > 4)y[index2]## [1] 0 1 5# In an examplesummary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.400 Min. :3.100 Min. :1.200 Min. :0.100 ## 1st Qu.:4.925 1st Qu.:3.125 1st Qu.:1.500 1st Qu.:0.200 ## Median :5.950 Median :3.200 Median :4.450 Median :1.400 ## Mean :5.781 Mean :3.245 Mean :3.460 Mean :1.145 ## 3rd Qu.:6.700 3rd Qu.:3.400 3rd Qu.:5.375 3rd Qu.:2.075 ## Max. :7.200 Max. :3.400 Max. :6.000 Max. :2.500 ## Species ## setosa :20 ## versicolor: 8 ## virginica :14 ## ## ##
Do the following for the iris
dataset:
 Compute the subset corresponding to
Petal.Length
either smaller than 1.5
or larger than 2
. Save this dataset as irisPetal
.  Compute and summarize a linear regression of
Sepal.Width
into Petal.Width + Petal.Length
for the dataset irisPetal
. What is the R^{2}? (Solution: 0.101
)  Check that the previous model is the same as regressing
Sepal.Width
into Petal.Width + Petal.Length
for the dataset iris
with the appropriate subset
expression.  Compute the variance for
Petal.Width
when Petal.Width
is smaller or equal that 1.5
and larger than 0.3
. (Solution: 0.1266541
)
Plotting functions
# plot is the main function for plotting in R# It has a different behaviour depending on the kind of object that it receives# For example, for a regression model, it produces diagnostic plotsmod < lm(Sepal.Width ~ Sepal.Length, data = iris)plot(mod, 1)
# How to plot some dataplot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")
# Change the axis limitsplot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))
# How to plot a curve (a parabola)x < seq(1, 1, l = 50)y < x^2plot(x, y)
plot(x, y, main = "A dotted parabola")
plot(x, y, main = "A parabola", type = "l")
plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)
# Plotting a more complicated curve between pi and pix < seq(pi, pi, l = 50)y < (2 + sin(10 * x)) * x^2plot(x, y, type = "l") # Kind of rough...
# More detailed plotx < seq(pi, pi, l = 500)y < (2 + sin(10 * x)) * x^2plot(x, y, type = "l")
# Remember that we are joining points for creating a curve!# For more options in the plot customization see?plot?par# plot is a first level plotting function. That means that whenever is called,# it creates a new plot. If we want to add information to an existing plot, we# have to use a second level plotting function such as points, lines or ablineplot(x, y) # Create a plotlines(x, x^2, col = "red") # Add linespoints(x, y + 10, col = "blue") # Add pointsabline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x
Distributions
The operations on distributions described here are implemented in R Commander
through the menu 'Distributions'
, but is convenient for you to grasp how are they working.
# R allows to sample [r], compute density/probability mass [d],# compute distribution function [p] and compute quantiles [q] for several# continuous and discrete distributions. The format employed is [rdpq]name,# where name stands for:#  norm > Normal#  unif > Uniform#  exp > Exponential#  t > Student's t#  f > Snedecor's F#  chisq > Chi squared#  pois > Poisson#  binom > Binomial# More distributions:?Distributions# Sampling from a Normal  100 random points from a N(0, 1)rnorm(n = 10, mean = 0, sd = 1)## [1] 1.8367426 1.3366952 0.4906582 1.0215158 0.1637865 2.5039127## [7] 1.3113124 0.1352548 0.1846896 0.6373963# If you want to have always the same result, set the seed of the random number# generatorset.seed(45678)rnorm(n = 10, mean = 0, sd = 1)## [1] 1.4404800 0.7195761 0.6709784 0.4219485 0.3782196 1.6665864## [7] 0.5082030 0.4433822 1.7993868 0.6179521# Plotting the density of a N(0, 1)  the Gauss bellx < seq(4, 4, l = 100)y < dnorm(x = x, mean = 0, sd = 1)plot(x, y, type = "l")
# Plotting the distribution function of a N(0, 1)x < seq(4, 4, l = 100)y < pnorm(q = x, mean = 0, sd = 1)plot(x, y, type = "l")
# Computing the 95% quantile for a N(0, 1)qnorm(p = 0.95, mean = 0, sd = 1)## [1] 1.644854# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...) # and qname(p,...), but the parameters in ... change. Look them in ?Distributions# For example, here is que same for the uniform distribution# Sampling from a U(0, 1)set.seed(45678)runif(n = 10, min = 0, max = 1)## [1] 0.9251342 0.3339988 0.2358930 0.3366312 0.7488829 0.9327177 0.3365313## [8] 0.2245505 0.6473663 0.0807549# Plotting the density of a U(0, 1)x < seq(2, 2, l = 100)y < dunif(x = x, min = 0, max = 1)plot(x, y, type = "l")
# Computing the 95% quantile for a U(0, 1)qunif(p = 0.95, min = 0, max = 1)## [1] 0.95# Sampling from a Bi(10, 0.5)set.seed(45678)samp < rbinom(n = 200, size = 10, prob = 0.5)table(samp) / 200## samp## 1 2 3 4 5 6 7 8 9 ## 0.010 0.060 0.115 0.220 0.210 0.215 0.115 0.045 0.010# Plotting the probability mass of a Bi(10, 0.5)x < 0:10y < dbinom(x = x, size = 10, prob = 0.5)plot(x, y, type = "h") # Vertical bars
# Plotting the distribution function of a Bi(10, 0.5)x < 0:10y < pbinom(q = x, size = 10, prob = 0.5)plot(x, y, type = "h")
Do the following:
 Compute the 90%, 95% and 99% quantiles of a F distribution with
df1 = 1
and df2 = 5
. (Answer: c(4.060420, 6.607891, 16.258177)
)  Plot the distribution function of a U(0, 1). Does it make sense with its density function?
 Sample 100 points from a Poisson with
lambda = 5
.  Sample 100 points from a U(−1, 1) and compute its mean.
 Plot the density of a t distribution with
df = 1
(use a sequence spanning from 4
to 4
). Add lines of different colors with the densities for df = 5
, df = 10
, df = 50
and df = 100
. Do you see any pattern?
Defining functions
# A function is a way of encapsulating a block of code so it can be reused easily# They are useful for simplifying repetitive tasks and organize the analysis# For example, in Setion 3.7 we had to make use of simpleAnova for computing# the simple ANOVA table in multiple regression.# This is a silly function that takes x and y and returns its sumadd < function(x, y) { x + y}# Calling add  you need to run the definition of the function first!add(1, 1)## [1] 2add(x = 1, y = 2)## [1] 3# A more complex function: computes a linear model and its posterior summary.# Saves us a few keystrokes when computing a lm and a summarylmSummary < function(formula, data) { model < lm(formula = formula, data = data) summary(model)}# UsagelmSummary(Sepal.Length ~ Petal.Width, iris)## ## Call:## lm(formula = formula, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## 1.38822 0.29358 0.04393 0.26429 1.34521 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 4.77763 0.07293 65.51 <2e16 ***## Petal.Width 0.88858 0.05137 17.30 <2e16 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.478 on 148 degrees of freedom## Multiple Rsquared: 0.669, Adjusted Rsquared: 0.6668 ## Fstatistic: 299.2 on 1 and 148 DF, pvalue: < 2.2e16# Recall: there is no variable called model in the workspace.# The function works on its own workspace!model## ## Call:## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)## ## Coefficients:## (Intercept) crim lstat zn nox ## 30.93462 0.08297 0.90940 0.03493 5.42234# Add a line to a plotaddLine < function(x, beta0, beta1) { lines(x, beta0 + beta1 * x, lwd = 2, col = 2)}# Usageplot(x, y)addLine(x, beta0 = 0.1, beta1 = 0)
Examples and applications
Case study: The Challenger disaster
The Challenger disaster occurred on the 28th January of 1986, when the NASA Space Shuttle orbiter Challenger broke apart and disintegrated at 73 seconds into its flight, leading to the deaths of its seven crew members. The accident deeply shocked the US society, in part due to the attention the mission had received because of the presence of Christa McAuliffe, who would have been the first astronautteacher. Because of this, NASA TV broadcasted live the launch to US public schools, which resulted in millions of school children witnessing the accident. The accident had serious consequences for the NASA credibility and resulted in an interruption of 32 months in the shuttle program. The Presidential Rogers Commission (formed by astronaut Neil A. Armstrong and Nobel laureate Richard P. Feynman, among others) was created to investigate the disaster.
The Rogers Commission elaborated a report (Presidential Commission on the Space Shuttle Challenger Accident 1986) with all the findings. The commission determined that the disintegration began with the failure of an Oring seal in the solid rocket motor due to the unusual cold temperatures (0.6 Celsius degrees) during the launch. This failure produced a breach of burning gas through the solid rocket motor that compromised the whole shuttle structure, resulting in its disintegration due to the extreme aerodynamic forces. The problematic with Orings was something known: the night before the launch, there was a threehour teleconference between motor engineers and NASA management, discussing the effect of low temperature forecasted for the launch on the Oring performance. The conclusion, influenced by Figure 4.2a, was:
“Temperature data [are] not conclusive on predicting primary Oring blowby.”
The Rogers Commission noted a major flaw in Figure 4.2a: the flights with zero incidents were excluded from the plot because it was felt that these flights did not contribute any information about the temperature effect (Figure 4.2b). The Rogers Commission concluded:
“A careful analysis of the flight history of Oring performance would have revealed the correlation of Oring damage in low temperature”.
The purpose of this case study, inspired by Dalal, Fowlkes, and Hoadley (1989), is to quantify what was the influence of the temperature in the probability of having at least one incident related with the Orings. Specifically, we want to address the following questions:
 Q1. Is the temperature associated with Oring incidents?
 Q2. In which way was the temperature affecting the probability of Oring incidents?
 Q3. What was the predicted probability of an incidient in an Oring for the temperature of the launch day?
To try to answer these questions we have the challenger
dataset (download). The dataset contains (shown in Table 4.1) information regarding the state of the solid rocket boosters after launch for 23 flights. Each row has, among others, the following variables:
fail.field
, fail.nozzle
: binary variables indicating whether there was an incident with the Orings in the field joints or in the nozzles of the solid rocket boosters. 1
codifies an incident and 0
its absence. On the analysis, we focus on the Orings of the field joint as being the most determinants for the accident.temp
: temperature in the day of launch. Measured in Celsius degrees.pres.field
, pres.nozzle
: leakcheck pressure tests of the Orings. These tests assured that the rings would seal the joint.
Table 4.1: The challenger
dataset.1  12/04/81  0  0  18.9 
2  12/11/81  1  0  21.1 
3  22/03/82  0  0  20.6 
5  11/11/82  0  0  20.0 
6  04/04/83  0  1  19.4 
7  18/06/83  0  0  22.2 
8  30/08/83  0  0  22.8 
9  28/11/83  0  0  21.1 
41B  03/02/84  1  1  13.9 
41C  06/04/84  1  1  17.2 
41D  30/08/84  1  1  21.1 
41G  05/10/84  0  0  25.6 
51A  08/11/84  0  0  19.4 
51C  24/01/85  1  1  11.7 
51D  12/04/85  0  1  19.4 
51B  29/04/85  0  1  23.9 
51G  17/06/85  0  1  21.1 
51F  29/07/85  0  0  27.2 
51I  27/08/85  0  0  24.4 
51J  03/10/85  0  0  26.1 
61A  30/10/85  1  0  23.9 
61B  26/11/85  0  1  24.4 
61C  12/01/86  1  1  14.4 
Let’s begin the analysis by replicating Figures 4.2a and 4.2b and checking that linear regression is not the right tool for answering Q1–Q3. For that, we make two scatterplots of nfails.field
(number of total incidents in the field joints) versus temp
, the first one excluding the launches without incidents (subset = nfails.field > 0
) and the second one for all the data. Doing it through R Commander
as we saw in Chapter 2, you should get something similar to:
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE, boxplots = FALSE, data = challenger, subset = nfails.field > 0)
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE, boxplots = FALSE, data = challenger)
There is a fundamental problem in using linear regression for this data: the response is not continuous. As a consequence, there is no linearity and the errors around the mean are not normal (indeed, they are strongly non normal). Let’s check this with the corresponding diagnostic plots:
mod < lm(nfails.field ~ temp, data = challenger)par(mfrow = 1:2)plot(mod, 1)plot(mod, 2)
Albeit linear regression is not the adequate tool for this data, it is able to detect the obvious difference between the two plots:
 The trend for launches with incidents is flat, hence suggesting there is no dependence on the temperature (Figure 4.2a). This was one of the arguments behind NASA’s decision of launching the rocket at a temperature of 0.6 degrees.
 However, the trend for all launches indicates a clear negative dependence between temperature and number of incidents! (Figure 4.2b). Think about it in this way: the minimum temperature for a launch without incidents ever recorded was above 18 degrees, and the Challenger was launched at 0.6 without clearly knowing the effects of such low temperatures.
Instead of trying to predict the number of incidents, we will concentrate on modelling the
probability of expecting at least one incident given the temperature, a simpler but also revealing approach. In other words, we look to estimate the following curve:
\[p(x)=\mathbb{P}(\text{incident}=1\text{temperature}=x)\] from
fail.field
and
temp
. This probability can not be properly modeled as a linear function like
\(\beta_0+\beta_1x\), since inevitably will fall outside
\([0,1]\) for some value of
\(x\) (some will have negative probabilities or probabilities larger than one). The technique that solves this problem is the
logistic regression. The idea behind is quite simple: transform a linear model
\(\beta_0+\beta_1x\) – which is aimed for a response in
\(\mathbb{R}\) – so that it yields a value in
\([0,1]\). This is achieved by the
logistic function\[\begin{align}\text{logistic}(t)=\frac{e^t}{1+e^t}=\frac{1}{1+e^{t}}.\tag{4.1}\end{align}\]The logistic model in this case is \[\mathbb{P}(\text{incident}=1\text{temperature}=x)=\text{logistic}\left(\beta_0+\beta_1x\right)=\frac{1}{1+e^{(\beta_0+\beta_1x)}},\] with \(\beta_0\) and \(\beta_1\) unknown. Let’s fit the model to the data by estimating \(\beta_0\) and \(\beta_1\).
In order to fit a logistic regression to the data, go to 'Statistics' > 'Fit models' > 'Generalized linear model...'
. A window like Figure 4.3 will popup, which you should fill as indicated.
A code like this will be generated:
nasa < glm(fail.field ~ temp, family = "binomial", data = challenger)summary(nasa)## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.0566 0.7575 0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp 0.4166 0.1940 2.147 0.0318 *## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")## (Intercept) temp ## 1965.9743592 0.6592539
The summary of the logistic model is notably different from the linear regression, as the methodology behind is quite different. Nevertheless, we have tests for the significance of each coefficient. Here we obtain that temp
is significantly different from zero, at least at a level \(\alpha=0.05\). Therefore we can conclude that the temperature is indeed affecting the probability of an incident with the Orings (answers Q1).
The precise interpretation of the coefficients will be given in the next section. For now, the coefficient of temp
, \(\hat\beta_1\), can be regarded the “correlation between the temperature and the probability of having at least one incident”. These correlation, as evidenced by the sign of \(\hat\beta_1\), is negative. Let’s plot the fitted logistic curve to see that indeed the probability of incident and temperature are negatively correlated:
# Plot dataplot(challenger$temp, challenger$fail.field, xlim = c(1, 30), xlab = "Temperature", ylab = "Incident probability")# Draw the fitted logistic curvex < seq(1, 30, l = 200)y < exp((nasa$coefficients[1] + nasa$coefficients[2] * x))y < 1 / (1 + y)lines(x, y, col = 2, lwd = 2)# The Challengerpoints(0.6, 1, pch = 16)text(0.6, 1, labels = "Challenger", pos = 4)
At the sight of this curve and the summary of the model we can conclude that the temperature was increasing the probability of an Oring incident (Q2). Indeed, the confidence intervals for the coefficients show a significative negative correlation at level \(\alpha=0.05\):
confint(nasa, level = 0.95)## 2.5 % 97.5 %## (Intercept) 1.3364047 17.7834329## temp 0.9237721 0.1089953
Finally, the probability of having at least one incident with the Orings in the launch day was \(0.9996\) according to the fitted logistic model (Q3). This is easily obtained:
predict(nasa, newdata = data.frame(temp = 0.6), type = "response")## 1 ## 0.999604
Be aware that type = "response"
has a different meaning in logistic regression. As you can see it does not return a CI for the prediction as in linear models. Instead, type = "response"
means that the probability should be returned, instead of the value of the link function, which is returned with type = "link"
(the default).
Recall that there is a serious problem of extrapolation in the prediction, which makes it less precise (or more variable). But this extrapolation, together with the evidences raised by a simple analysis like we did, should have been strong arguments for postponing the launch.
To conclude this section, we refer to a funny and comprehensive exposition by Juan Cuesta (University of Cantabria) on the flawed statistical analysis that contributed to the Challenger disaster.
Model formulation and estimation by maximum likelihood
As saw in Section
3.2, the multiple linear model described the relation between the random variables
\(X_1,\ldots,X_k\) and
\(Y\) by assuming the linear relation
\[\begin{align*}Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k + \varepsilon.\end{align*}\]Since we assume
\(\mathbb{E}[\varepsilonX_1=x_1,\ldots,X_k=x_k]=0\), the previous equation was equivalent to
\[\begin{align}\mathbb{E}[YX_1=x_1,\ldots,X_k=x_k]=\beta_0+\beta_1x_1+\ldots+\beta_kx_k,\tag{4.2}\end{align}\]where \(\mathbb{E}[YX_1=x_1,\ldots,X_k=x_k]\) is the mean of \(Y\) for a particular value of the set of predictors. As remarked in Section 3.3, it was a necessary condition that \(Y\) was continuous in order to satisfy the normality of the errors, hence the linear model assumptions. Or in other words, the linear model is designed for a continuous response.
The situation when \(Y\) is discrete (naturally ordered values; e.g. number of fails, number of students) or categorical (nonordered categories; e.g. territorial divisions, ethnic groups) requires a special treatment. The simplest situation is when \(Y\) is binary (or dichotomic): it can only take two values, codified for convenience as \(1\) (success) and \(0\) (failure). For example, in the Challenger case study we used fail.field
as an indicator of whether “there was at least an incident with the Orings” (1
= yes, 0
= no). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables.
More formally, a binary variable is known as a
Bernoulli variable, which is the simplest nontrivial random variable. We say that
\(Y\sim\mathrm{Ber}(p)\),
\(0\leq p\leq1\), if
\[Y=\left\{\begin{array}{ll}1,&\text{with probability }p,\\0,&\text{with probability }1p,\end{array}\right.\] or, equivalently, if
\(\mathbb{P}[Y=1]=p\) and
\(\mathbb{P}[Y=0]=1p\), which can be written compactly as
\[\begin{align}\mathbb{P}[Y=y]=p^y(1p)^{1y},\quad y=0,1.\tag{4.3}\end{align}\]Recall that a binomial variable with size \(n\) and probability \(p\), \(\mathrm{Bi}(n,p)\), is obtained by summing \(n\) independent \(\mathrm{Ber}(p)\) (so \(\mathrm{Ber}(p)\) is the same as \(\mathrm{Bi}(1,p)\)). This is why we need to use a family = "binomial"
in glm
, to indicate that the response is binomial.
A Bernoulli variable Y is completely determined by the probability p. So do its mean and variance:
 𝔼[Y]=p × 1 + (1 − p)×0 = p
 𝕍ar[Y]=p(1 − p)
In particular, recall that ℙ[Y = 1]=𝔼[Y]=p.
This is something relatively uncommon (or a 𝒩(μ, σ^{2}), μ determines the mean and σ^{2} the variance) that has important consequences for the logistic model: we do not need a σ^{2}.
Are these Bernoulli variables? If so, which is the value of p and what could the codes 0 and 1 represent?
 The toss of a fair coin.
 A variable with mean p and variance p(1 − p).
 The roll of a dice.
 A binary variable with mean 0.5 and variance 0.45.
 The winner of an election with two candidates.
Assume then that \(Y\) is a binary/Bernoulli variable and that \(X_1,\ldots,X_k\) are predictors associated to \(Y\) (no particular assumptions on them). The purpose in logistic regression is to estimate \[p(x_1,\ldots,x_k)=\mathbb{P}[Y=1X_1=x_1,\ldots,X_k=x_k]=\mathbb{E}[YX_1=x_1,\ldots,X_k=x_k],\] this is, how the probability of \(Y=1\) is changing according to particular values, denoted by \(x_1,\ldots,x_k\), of the random variables \(X_1,\ldots,X_k\). \(p(x_1,\ldots,x_k)=\mathbb{P}[Y=1X_1=x_1,\ldots,X_k=x_k]\) stands for the conditional probability of \(Y=1\) given \(X_1,\ldots,X_k\). At sight of (4.2), a tempting possibility is to consider the model \[p(x_1,\ldots,x_k)=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.\] However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one. The solution is to consider a function to encapsulate the value of \(z=\beta_0+\beta_1x_1+\ldots+\beta_kx_k\), in \(\mathbb{R}\), and map it back to \([0,1]\). There are several alternatives to do so, based on distribution functions \(F:\mathbb{R}\longrightarrow[0,1]\) that deliver \(y=F(z)\in[0,1]\) (see Figure 4.5). Different choices of \(F\) give rise to different models:
 Uniform. Truncate \(z\) to \(0\) and \(1\) when \(z<0\) and \(z>1\), respectively.
 Logit. Consider the logistic distribution function: \[F(z)=\mathrm{logistic}(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{z}}.\]
 Probit. Consider the normal distribution function, this is, \(F=\Phi\).
The
logistic transformation is the most employed due to its
tractability, interpretability and smoothness. Its inverse,
\(F^{1}:[0,1]\longrightarrow\mathbb{R}\), known as the
logit function, is
\[\mathrm{logit}(p)=\mathrm{logistic}^{1}(p)=\log\frac{p}{1p}.\] This is a
link function, this is, a function that maps a given space (in this case
\([0,1]\)) into
\(\mathbb{R}\). The term link function is employed in
generalized linear models, which follow exactly the same philosophy of the logistic regression – mapping the domain of
\(Y\) to
\(\mathbb{R}\) in order to apply there a linear model. We will concentrate here exclusively on the logit as a link function. Therefore, the
logistic model is
\[\begin{align}p(x_1,\ldots,x_k)=\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)=\frac{1}{1+e^{(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}.\tag{4.4}\end{align}\]The linear form inside the exponent has a clear interpretation:
 If \(\beta_0+\beta_1x_1+\ldots+\beta_kx_k=0\), then \(p(x_1,\ldots,x_k)=\frac{1}{2}\) (\(Y=1\) and \(Y=0\) are equally likely).
 If \(\beta_0+\beta_1x_1+\ldots+\beta_kx_k<0\), then \(p(x_1,\ldots,x_k)<\frac{1}{2}\) (\(Y=1\) less likely).
 If \(\beta_0+\beta_1x_1+\ldots+\beta_kx_k>0\), then \(p(x_1,\ldots,x_k)>\frac{1}{2}\) (\(Y=1\) more likely).
To be more precise on the interpretation of the coefficients
\(\beta_0,\ldots,\beta_k\) we need to introduce the
odds. The
odds is an equivalent way of expressing the distribution of probabilities in a binary variable. Since
\(\mathbb{P}[Y=1]=p\) and
\(\mathbb{P}[Y=0]=1p\), both the success and failure probabilities can be inferred from
\(p\). Instead of using
\(p\) to characterize the distribution of
\(Y\), we can use
\[\begin{align}\mathrm{odds}(Y)=\frac{p}{1p}=\frac{\mathbb{P}[Y=1]}{\mathbb{P}[Y=0]}.\tag{4.5}\end{align}\]The odds is the ratio between the probability of success and the probability of failure. It is extensively used in betting due to its better interpretability. For example, if a horse \(Y\) has a probability \(p=2/3\) of winning a race (\(Y=1\)), then the odds of the horse is \[\text{odds}=\frac{p}{1p}=\frac{2/3}{1/3}=2.\] This means that the horse has a probability of winning that is twice larger than the probability of losing. This is sometimes written as a \(2:1\) or \(2 \times 1\) (spelled “twotoone”). Conversely, if the odds of \(Y\) is given, we can easily know what is the probability of success \(p\), using the inverse of (4.5): \[p=\mathbb{P}[Y=1]=\frac{\text{odds}(Y)}{1+\text{odds}(Y)}.\] For example, if the odds of the horse were \(5\), that would correspond to a probability of winning \(p=5/6\).
Recall that the odds is a number in [0, +∞]. The 0 and +∞ values are attained for p = 0 and p = 1, respectively. The logodds (or logit) is a number in [ − ∞, +∞].
We can rewrite
(4.4) in terms of the odds
(4.5). If we do so, we have:
\[\begin{align}\mathrm{odds}(YX_1=x_1,\ldots,X_k=x_k)&=\frac{p(x_1,\ldots,x_k)}{1p(x_1,\ldots,x_k)}\nonumber\\&=e^{\beta_0+\beta_1x_1+\ldots+\beta_kx_k}\nonumber\\&=e^{\beta_0}e^{\beta_1x_1}\ldots e^{\beta_kx_k}.\tag{4.6}\end{align}\]or, taking logarithms, the
logodds (or logit)
\[\begin{align}\log(\mathrm{odds}(YX_1=x_1,\ldots,X_k=x_k))=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.\tag{4.7}\end{align}\]The conditional logodds (4.7) plays here the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:
 \(\beta_0\): is the logodds when \(X_1=\ldots=X_k=0\).
 \(\beta_j\), \(1\leq j\leq k\): is the additive increment of the logodds for an increment of one unit in \(X_j=x_j\), provided that the remaining variables \(X_1,\ldots,X_{j1},X_{j+1},\ldots,X_k\) do not change.
The logodds is not so easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on (4.6), is:
 \(e^{\beta_0}\): is the odds when \(X_1=\ldots=X_k=0\).
 \(e^{\beta_j}\), \(1\leq j\leq k\): is the multiplicative increment of the odds for an increment of one unit in \(X_j=x_j\), provided that the remaining variables \(X_1,\ldots,X_{j1},X_{j+1},\ldots,X_k\) do not change. If the increment in \(X_j\) is of \(r\) units, then the multiplicative increment in the odds is \((e^{\beta_j})^r\).
As a consequence of this last interpretation, we have:
If
\(\beta_j>0\) (respectively,
\(\beta_j<0\)) then
\(e^{\beta_j}>1\) (
\(e^{\beta_j}<1\)) in
(4.6). Therefore, an increment of one unit in
\(X_j\), provided that the remaining variables
\(X_1,\ldots,X_{j1},X_{j+1},\ldots,X_k\) do not change, results in an increment (decrement) of the odds, this is, in an increment (decrement) of
\(\mathbb{P}[Y=1]\).
Since the relationship between p(X_{1}, …, X_{k}) and X_{1}, …, X_{k} is not linear, β_{j} does not correspond to the change in p(X_{1}, …, X_{k}) associated with a oneunit increase in X_{j}.
Let’s visualize this concepts quickly with the output of the Challenger case study:
nasa < glm(fail.field ~ temp, family = "binomial", data = challenger)summary(nasa)## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.0566 0.7575 0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp 0.4166 0.1940 2.147 0.0318 *## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")## (Intercept) temp ## 1965.9743592 0.6592539# Plot dataplot(challenger$temp, challenger$fail.field, xlim = c(1, 30), xlab = "Temperature", ylab = "Incident probability")# Draw the fitted logistic curvex < seq(1, 30, l = 200)y < exp((nasa$coefficients[1] + nasa$coefficients[2] * x))y < 1 / (1 + y)lines(x, y, col = 2, lwd = 2)# The Challengerpoints(0.6, 1, pch = 16)text(0.6, 1, labels = "Challenger", pos = 4)
The exponentials of the estimated coefficients are:
 \(e^{\hat\beta_0}=1965.974\). This means that, when the temperature is zero, the fitted odds is \(1965.974\), so the probability of having an incident (\(Y=1\)) is \(1965.974\) times larger than the probability of not having an incident (\(Y=0\)). In other words, the probability of having an incident at temperature zero is \(\frac{1965.974}{1965.974+1}=0.999\).
 \(e^{\hat\beta_1}=0.659\). This means that each Celsius degree increment in the temperature multiplies the fitted odds by a factor of \(0.659\approx\frac{2}{3}\), hence reducing it.
The estimation of
\(\boldsymbol{\beta}=(\beta_0,\beta_1,\ldots,\beta_k)\) from a sample
\((\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)\) is different than in linear regression. It is not based on minimizing the RSS but on the principle of
Maximum Likelihood Estimation (MLE). MLE is based on the following
leitmotiv:
what are the coefficients \(\boldsymbol{\beta}\) that make the sample more likely? Or in other words,
what coefficients make the model more probable, based on the sample. Since
\(Y_i\sim \mathrm{Ber}(p(\mathbf{X}_i))\),
\(i=1,\ldots,n\), the likelihood of
\(\boldsymbol{\beta}\) is
\[\begin{align}\text{lik}(\boldsymbol{\beta})=\prod_{i=1}^np(\mathbf{X}_i)^{Y_i}(1p(\mathbf{X}_i))^{1Y_i}.\tag{4.8}\end{align}\]\(\text{lik}(\boldsymbol{\beta})\) is the probability of the data based on the model. Therefore, it is a number between \(0\) and \(1\). Its detailed interpretation is the following:
 \(\prod_{i=1}^n\) appears because the sample elements are assumed to be independent and we are computing the probability of observing the whole sample \((\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)\). This probability is equal to the product of the probabilities of observing each \((\mathbf{X}_{i},Y_i)\).
 \(p(\mathbf{X}_i)^{Y_i}(1p(\mathbf{X}_i))^{1Y_i}\) is the probability of observing \((\mathbf{X}_{i},Y_i)\), as given by (4.3). Remember that \(p\) depends on \(\boldsymbol{\beta}\) due to (4.4).
Usually, the loglikelihood is considered instead of the likelihood for stability reasons – the estimates obtained are exactly the same and are \[\hat{\boldsymbol{\beta}}=\arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{k+1}}\log \text{lik}(\boldsymbol{\beta}).\] Unfortunately, due to the nonlinearity of the optimization problem there are no explicit expressions for \(\hat{\boldsymbol{\beta}}\). These have to be obtained numerically by means of an iterative procedure (the number of iterations required is printed in the output of summary
). In low sample situations with perfect classification, the iterative procedure may not converge.
Figure
4.6 shows how the loglikelihood changes with respect to the values for
\((\beta_0,\beta_1)\) in three data patterns.
The data of the illustration has been generated with the following code:
# Dataset.seed(34567)x < rnorm(50, sd = 1.5)y1 < 0.5 + 3 * xy2 < 0.5  2 * xy3 < 2 + 5 * xy1 < rbinom(50, size = 1, prob = 1 / (1 + exp(y1)))y2 < rbinom(50, size = 1, prob = 1 / (1 + exp(y2)))y3 < rbinom(50, size = 1, prob = 1 / (1 + exp(y3)))# DatadataMle < data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
Let’s check that indeed the coefficients given by glm
are the ones that maximize the likelihood given in the animation of Figure 4.6. We do so for y ~ x1
.
mod < glm(y1 ~ x, family = "binomial", data = dataMle)mod$coefficients## (Intercept) x ## 0.1691947 2.4281626
For the regressions y ~ x2
and y ~ x3
, do the following:
 Check that \(\boldsymbol{\beta}\) is indeed maximizing the likelihood as compared with Figure 4.6.
 plot the fitted logistic curve and compare it with the one in Figure 4.6.
In linear regression we relied on least squares estimation, in other words, the minimization of the RSS. Why do we need MLE in logistic regression and not least squares? The answer is twofold:
 MLE is asymptotically optimal when estimating unknown parameters in a model. That means that when the sample size n is large, it is guaranteed to perform better than any other estimation method. Therefore, considering a least squares approach for logistic regression will result in suboptimal estimates.
 In multiple linear regression, due to the normality assumption, MLE and least squares estimation coincide. So MLE is hidden under the form of the least squares, which is a more intuitive estimation procedure. Indeed, the maximized likelihood \(\text{lik}(\hat{\boldsymbol{\beta}})\) in the linear model and the RSS are intimately related.
As in the linear model, the inclusion of a new predictor changes the coefficient estimates of the logistic model.
Assumptions of the model
Some probabilistic assumptions are required for performing inference on the model parameters \(\boldsymbol\beta\) from the sample \((\mathbf{X}_1, Y_1),\ldots,(\mathbf{X}_n, Y_n)\). These assumptions are somehow simpler than the ones for linear regression.
The assumptions of the logistic model are the following:
 Linearity in the logit: \(\mathrm{logit}(p(\mathbf{x}))=\log\frac{ p(\mathbf{x})}{1p(\mathbf{x})}=\beta_0+\beta_1x_1+\ldots+\beta_kx_k\).
 Binariness: \(Y_1,\ldots,Y_n\) are binary variables.
 Independence: \(Y_1,\ldots,Y_n\) are independent.
A good oneline summary of the logistic model is the following (independence is assumed)
\[\begin{align}Y(X_1=x_1,\ldots,X_k=x_k)&\sim\mathrm{Ber}\left(\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)\right)\nonumber\\&=\mathrm{Ber}\left(\frac{1}{1+e^{(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}\right).\tag{4.9}\end{align}\]There are three important points of the linear model assumptions missing in the ones for the logistic model:
 Why is homoscedasticity not required? As seen in the previous section, Bernoulli variables are determined only by the probability of success, in this case \(p(\mathbf{x})\). That determines also the variance, which is variable, so there is heteroskedasticity. In the linear model, we have to control \(\sigma^2\) explicitly due to the higher flexibility of the normal.
 Where are the errors? The errors played a fundamental role in the linear model assumptions, but are not employed in logistic regression. The errors are not fundamental for building the linear model but just a helpful concept related to least squares. The linear model can be constructed without errors as (3.5), which has a logistic analogous in (4.9).
 Why is normality not present? A normal distribution is not adequate to replace the Bernoulli distribution in (4.9) since the response \(Y\) has to be binary and the Normal or other continuous distribution would put yield illegal values for \(Y\).
Recall that:
 Nothing is said about the distribution of X_{1}, …, X_{k}. They could be deterministic or random. They could be discrete or continuous.
 X_{1}, …, X_{k} are not required to be independent between them.
Inference for model parameters
The assumptions on which the logistic model is constructed allow to specify what is the asymptotic distribution of the random vector \(\hat{\boldsymbol{\beta}}\). Again, the distribution is derived conditionally on the sample predictors \(\mathbf{X}_1,\ldots,\mathbf{X}_n\). In other words, we assume that the randomness of \(Y\) comes only from \(Y(X_1=x_1,\ldots,X_k=x_k)\sim\mathrm{Ber}(p(\mathbf{x}))\) and not from the predictors. To denote this, we employ lowercase for the sample predictors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\).
There is an important difference between the inference results for the linear model and for logistic regression:
 In linear regression the inference is exact. This is due to the nice properties of the normal, least squares estimation and linearity. As a consequence, the distributions of the coefficients are perfectly known assuming that the assumptions hold.
 In logistic regression the inference is asymptotic. This means that the distributions of the coefficients are unknown except for large sample sizes \(n\), for which we have approximations. The reason is the more complexity of the model in terms of nonlinearity. This is the usual situation for the majority of the regression models.
Distributions of the fitted coefficients
The distribution of
\(\hat{\boldsymbol{\beta}}\) is given by the asymptotic theory of MLE:
\[\begin{align}\hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol\beta,I(\hat{\boldsymbol{\beta}})^{1}\right)\tag{4.10}\end{align}\]where
\(\sim\) must be understood as
approximately distributed as […] when \(n\to\infty\) for the rest of this chapter.
\(I(\boldsymbol\beta)\) is known as the
Fisher information matrix, and receives that name because
it measures the information available in the sample for estimating \(\boldsymbol\beta\). Therefore, the
larger the matrix is, the more precise is the estimation of
\(\boldsymbol\beta\), because that results in smaller variances in
(4.10). The inverse of the Fisher information matrix is
\[\begin{align}I(\hat{\boldsymbol{\beta}})^{1}=(\mathbf{X}^T\mathbf{V}\mathbf{X})^{1},\tag{4.11}\end{align}\]where \(\mathbf{V}\) is a diagonal matrix containing the different variances for each \(Y_i\) (remember that \(p(\mathbf{x})=1/(1+e^{(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)})\)): \[\mathbf{V}=\begin{pmatrix}p(\mathbf{X}_1)(1p(\mathbf{X}_1)) & & &\\& p(\mathbf{X}_2)(1p(\mathbf{X}_2)) & & \\& & \ddots & \\& & & p(\mathbf{X}_n)(1p(\mathbf{X}_n))\end{pmatrix}\] In the case of the multiple linear regression, \(I(\hat{\boldsymbol{\beta}})^{1}=\sigma^2(\mathbf{X}^T\mathbf{X})^{1}\) (see (3.6)), so the presence of \(\mathbf{V}\) here is revealing the heteroskedasticity of the model.
The interpretation of (4.10) and (4.11) give some useful insights on what concepts affect the quality of the estimation:
 Bias. The estimates are asymptotically unbiased.
Variance. It depends on:
 Sample size \(n\). Hidden inside \(\mathbf{X}^T\mathbf{V}\mathbf{X}\). As \(n\) grows, the precision of the estimators increases.
 Weighted predictor sparsity \((\mathbf{X}^T\mathbf{V}\mathbf{X})^{1}\). The more sparse the predictor is (small \((\mathbf{X}^T\mathbf{V}\mathbf{X})^{1}\)), the more precise \(\hat{\boldsymbol{\beta}}\) is.
The precision of \(\hat{\boldsymbol{\beta}}\) is affected by the value of \(\boldsymbol{\beta}\), which is hidden inside
\(\mathbf{V}\). This contrasts sharply with the linear model, where the precision of the least squares estimator was not affected by the value of the unknown coefficients (see
(3.6)). The reason is partially due to the
heteroskedasticity of logistic regression, which implies a dependence of the variance of
\(Y\) in the logistic curve, hence in
\(\boldsymbol{\beta}\).
Similar to linear regression, the problem with
(4.10) and
(4.11) is that
\(\mathbf{V}\) is unknown in practice because it depends on
\(\boldsymbol{\beta}\). Pluggingin the estimate
\(\hat{\boldsymbol{\beta}}\) to
\(\boldsymbol{\beta}\) in
\(\mathbf{V}\) results in
\(\hat{\mathbf{V}}\). Now we can use
\(\hat{\mathbf{V}}\) to get
\[\begin{align}\frac{\hat\beta_j\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim \mathcal{N}(0,1),\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=v_j^2\tag{4.12}\end{align}\]where \[v_j\text{ is the $j$th element of the diagonal of }(\mathbf{X}^T\hat{\mathbf{V}}\mathbf{X})^{1}.\] The LHS of (3.7) is the Wald statistic for \(\beta_j\), \(j=0,\ldots,k\). They are employed for building confidence intervals and hypothesis tests.
Confidence intervals for the coefficients
Thanks to
(4.12), we can have the
\(100(1\alpha)\%\) CI for the coefficient
\(\beta_j\),
\(j=0,\ldots,k\):
\[\begin{align}\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)\tag{4.13}\end{align}\]where \(z_{\alpha/2}\) is the \(\alpha/2\)upper quantile of the \(\mathcal{N}(0,1)\). In case we are interested in the CI for \(e^{\beta_j}\), we can just simply take the exponential on the above CI. So the \(100(1\alpha)\%\) CI for \(e^{\beta_j}\), \(j=0,\ldots,k\), is \[e^{\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)}.\] Of course, this CI is not the same as \(\left(e^{\hat\beta_j}\pm e^{\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\right)\), which is not a CI for \(e^{\hat\beta_j}\).
Let’s see how we can compute the CIs. We return to the challenger
dataset, so in case you do not have it loaded, you can download it here. We analyse the CI for the coefficients of fail.field ~ temp
.
# Fit modelnasa < glm(fail.field ~ temp, family = "binomial", data = challenger)# Confidence intervals at 95%confint(nasa)## Waiting for profiling to be done...## 2.5 % 97.5 %## (Intercept) 1.3364047 17.7834329## temp 0.9237721 0.1089953# Confidence intervals at other levelsconfint(nasa, level = 0.90)## Waiting for profiling to be done...## 5 % 95 %## (Intercept) 2.2070301 15.7488590## temp 0.8222858 0.1513279# Confidence intervals for the factors affecting the oddsexp(confint(nasa))## Waiting for profiling to be done...## 2.5 % 97.5 %## (Intercept) 3.8053375 5.287456e+07## temp 0.3970186 8.967346e01
In this example, the 95% confidence interval for \(\beta_0\) is \((1.3364, 17.7834)\) and for \(\beta_1\) is \((0.9238, 0.1090)\). For \(e^{\beta_0}\) and \(e^{\beta_1}\), the CIs are \((3.8053, 5.2874\times10^7)\) and \((0.3070, 0.8967)\), respectively. Therefore, we can say with a 95% confidence that:
 When
temp
=0, the probability of fail.field
=1 is significantly lager than the probability of fail.field
=0 (using the CI for \(\beta_0\)). Indeed, fail.field
=1 is between \(3.8053\) and \(5.2874\times10^7\) more likely than fail.field
=0 (using the CI for \(e^{\beta_0}\)). temp
has a significantly negative effect in the probability of fail.field
=1 (using the CI for \(\beta_1\)). Indeed, each unit increase in temp
produces a reduction of the odds of fail.field
by a factor between \(0.3070\) and \(0.8967\) (using the CI for \(e^{\beta_1}\)).
Compute and interpret the CIs for the exponentiated coefficients, at level \(\alpha=0.05\), for the following regressions (challenger
dataset):
fail.field ~ temp + pres.field
fail.nozzle ~ temp + pres.nozzle
fail.field ~ temp + pres.nozzle
fail.nozzle ~ temp + pres.field
The interpretation of the variables is given above Table
4.1.
Testing on the coefficients
The distributions in
(4.12) also allow to conduct a formal hypothesis test on the coefficients
\(\beta_j\),
\(j=0,\ldots,k\). For example, the test for significance:
\[\begin{align*}H_0:\beta_j=0\end{align*}\]for
\(j=0,\ldots,k\). The test of
\(H_0:\beta_j=0\) with
\(1\leq j\leq k\) is specially interesting, since it allows to answer whether
the variable \(X_j\) has a significant effect on \(\mathbb{P}[Y=1]\). The statistic used for testing for significance is the Wald statistic
\[\begin{align*}\frac{\hat\beta_j0}{\hat{\mathrm{SE}}(\hat\beta_j)},\end{align*}\]which is asymptotically distributed as a \(\mathcal{N}(0,1)\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the bilateral alternative hypothesis \(H_1:\beta_j\neq 0\).
The tests for significance are builtin in the summary
function. However, a note of caution is required when applying the rule of thumb:
Is the CI for \(\beta_j\) below (above) \(0\) at level \(\alpha\)?
 Yes \(\rightarrow\) reject \(H_0\) at level \(\alpha\).
 No \(\rightarrow\) the criterion is not conclusive.
The significances given in summary
and the output of confint
are slightly incoherent and the previous rule of thumb does not apply. The reason is because MASS
’s confint
is using a more sophisticated method (profile likelihood) to estimate the standard error of \(\hat\beta_j\), \(\hat{\mathrm{SE}}(\hat\beta_j)\), and not the asymptotic distribution behind Wald statistic.
By changing confint
to R
’s default confint.default
, the results of the latter will be completely equivalent to the significances in summary
, and the rule of thumb still be completely valid. For the contents of this course we prefer confint.default
due to its better interpretability.
To illustrate this we consider the regression of fail.field ~ temp + pres.field
:
# Significances with asymptotic approximation for the standard errorsnasa2 < glm(fail.field ~ temp + pres.field, family = "binomial", data = challenger)summary(nasa2)## ## Call:## glm(formula = fail.field ~ temp + pres.field, family = "binomial", ## data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.2109 0.6081 0.4292 0.3498 2.0913 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 6.642709 4.038547 1.645 0.1000 ## temp 0.435032 0.197008 2.208 0.0272 *## pres.field 0.009376 0.008821 1.063 0.2878 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 19.078 on 20 degrees of freedom## AIC: 25.078## ## Number of Fisher Scoring iterations: 5# CIs with asymptotic approximation  coherent with summaryconfint.default(nasa2, level = 0.90)## 5 % 95 %## (Intercept) 0.000110501 13.28552771## temp 0.759081468 0.11098301## pres.field 0.005132393 0.02388538confint.default(nasa2, level = 0.99)## 0.5 % 99.5 %## (Intercept) 3.75989977 17.04531697## temp 0.94249107 0.07242659## pres.field 0.01334432 0.03209731# CIs with profile likelihood  incoherent with summaryconfint(nasa2, level = 0.90) # intercept still significant## Waiting for profiling to be done...## 5 % 95 %## (Intercept) 0.945372123 14.93392497## temp 0.845250023 0.16532086## pres.field 0.004184814 0.02602181confint(nasa2, level = 0.99) # temp still significant## Waiting for profiling to be done...## 0.5 % 99.5 %## (Intercept) 1.86541750 21.49637422## temp 1.17556090 0.04317904## pres.field 0.01164943 0.03836968
For the previous exercise, check the differences of using confint
or confint.default
for computing the CIs.
Prediction
Prediction in logistic regression focuses mainly on predicting the values of the logistic curve \[p(x_1,\ldots,x_k)=\mathbb{P}[Y=1X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}\] by means of \[\hat p(x_1,\ldots,x_k)=\hat{\mathbb{P}}[Y=1X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{(\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_kx_k)}}.\] From the perspective of the linear model, this is the same as predicting the conditional mean (not the conditional response) of the response, but this time this conditional mean is also a conditional probability. The prediction of the conditional response is not so interesting since it follows immediately from \(\hat p(x_1,\ldots,x_k)\): \[\hat{Y}(X_1=x_1,\ldots,X_k=x_k)=\left\{\begin{array}{ll}1,&\text{with probability }\hat p(x_1,\ldots,x_k),\\0,&\text{with probability }1\hat p(x_1,\ldots,x_k).\end{array}\right.\] As a consequence, we can predict \(Y\) as \(1\) if \(\hat p(x_1,\ldots,x_k)>\frac{1}{2}\) and as \(0\) if \(\hat p(x_1,\ldots,x_k)<\frac{1}{2}\).
Let’s focus then on how to make predictions and compute CIs in practice with predict
. Similarly to the linear model, the objects required for predict
are: first, the output of glm
; second, a data.frame
containing the locations \(\mathbf{x}=(x_1,\ldots,x_k)\) where we want to predict \(p(x_1,\ldots,x_k)\). However, there are two differences with respect to the use of predict
for lm
:
 The argument
type
. type = "link"
, gives the predictions in the logodds, this is, returns \(\log\frac{\hat p(x_1,\ldots,x_k)}{1\hat p(x_1,\ldots,x_k)}\). type = "response"
gives the predictions in the probability space \([0,1]\), this is, returns \(\hat p(x_1,\ldots,x_k)\).  There is no
interval
argument for using predict
for glm
. That means that there is no easy way of computing CIs for prediction.
Since it is a bit cumbersome to compute by yourself the CIs, we can code the function predictCIsLogistic
so that it compute them automatically for you, see below.
# Data for which we want a prediction# Important! You have to name the column with the predictor name!newdata < data.frame(temp = 0.6)# Prediction of the conditional logodds  the defaultpredict(nasa, newdata = newdata, type = "link")## 1 ## 7.833731# Prediction of the conditional probabilitypredict(nasa, newdata = newdata, type = "response")## 1 ## 0.999604# Function for computing the predictions and CIs for the conditional probabilitypredictCIsLogistic < function(object, newdata, level = 0.95) { # Compute predictions in the logodds pred < predict(object = object, newdata = newdata, se.fit = TRUE) # CI in the logodds za < qnorm(p = (1  level) / 2) lwr < pred$fit + za * pred$se.fit upr < pred$fit  za * pred$se.fit # Transform to probabilities fit < 1 / (1 + exp(pred$fit)) lwr < 1 / (1 + exp(lwr)) upr < 1 / (1 + exp(upr)) # Return a matrix with column names "fit", "lwr" and "upr" result < cbind(fit, lwr, upr) colnames(result) < c("fit", "lwr", "upr") return(result)}# Simple callpredictCIsLogistic(nasa, newdata = newdata)## fit lwr upr## 1 0.999604 0.4838505 0.9999999# The CI is large because there is no data around temp = 0.6 and# that makes the prediction more variable (and also because we only# have 23 observations)
For the challenger
dataset, do the following:
 Regress
fail.nozzle
on temp
and pres.nozzle
.  Compute the predicted probability of
fail.nozzle=1
for temp
=15 and pres.nozzle
=200. What is the predicted probability for fail.nozzle=0
?  Compute the confidence interval for the two predicted probabilities at level 95%.
Finally, Figure 4.9 gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section 2.6 and Figure 2.23.
Deviance and model fit
The
deviance is a key concept in logistic regression. Intuitively, it measures the
deviance of the fitted logistic model with respect to a perfect model for \(\mathbb{P}[Y=1X_1=x_1,\ldots,X_k=x_k]\). This perfect model, known as the
saturated model, denotes an abstract model that fits perfectly the sample, this is, the model such that
\[\hat{\mathbb{P}}[Y=1X_1=X_{i1},\ldots,X_k=X_{ik}]=Y_i,\quad i=1,\ldots,n.\] This model assigns probability
\(0\) or
\(1\) to
\(Y\) depending on the actual value of
\(Y_i\). To clarify this concept, Figure
4.10 shows a saturated model and a fitted logistic regression.
More precisely, the deviance is defined as the difference of likelihoods between the fitted model and the saturated model: \[D=2\log\text{lik}(\hat{\boldsymbol{\beta}})+2\log\text{lik}(\text{saturated model}).\] Since the likelihood of the saturated model is exactly one, then the deviance is simply another expression of the likelihood: \[D=2\log\text{lik}(\hat{\boldsymbol{\beta}}).\] As a consequence, the deviance is always larger or equal than zero, being zero only if the fit is perfect.
A benchmark for evaluating the magnitude of the deviance is the null deviance, \[D_0=2\log\text{lik}(\hat{\beta}_0),\] which is the deviance of the worst model, the one fitted without any predictor, and the perfect model: \[Y(X_1=x_1,\ldots,X_k=x_k)\sim \mathrm{Ber}(\mathrm{logistic}(\beta_0)).\] In this case, \(\hat\beta_0=\mathrm{logit}(\frac{m}{n})=\log\frac{\frac{m}{n}}{1\frac{m}{n}}\) where \(m\) is the number of \(1\)’s in \(Y_1,\ldots,Y_n\) (see Figure 4.10).
The null deviance serves for comparing how much the model has improved by adding the predictors
\(X_1,\ldots,X_k\). This can be done by means of the
\(R^2\) statistic, which is a
generalization of the determination coefficient in multiple linear regression:
\[\begin{align}R^2=1\frac{D}{D_0}=1\frac{\text{deviance(fitted logistic, saturated model)}}{\text{deviance(null model, saturated model)}}.\tag{4.14}\end{align}\]This global measure of fit is similar indeed shares some important properties with the determination coefficient in linear regression:
 It is a quantity between \(0\) and \(1\).
 If the fit is perfect, then \(D=0\) and \(R^2=1\). If the predictors do not add anything to the regression, then \(D=D_0\) and \(R^2=0\).
In logistic regression, R^{2} does not have the same interpretation as in linear regression:
 Is not the percentage of variance explained by the logistic model, but rather a ratio indicating how close is the fit to being perfect or the worst.
 It is not related to any correlation coefficient.
The
\(R^2\) in
(4.14) is valid for the whole family of
generalized linear models, for which linear and logistic regression are particular cases. The connexion between
(4.14) and the determination coefficient is given by the expressions of the deviance and null the deviance for the linear model:
\[D=\mathrm{SSE}\text{ (or $D=\mathrm{RSS}$) and }D_0=\mathrm{SST}.\]Let’s see how these concepts are given by the summary
function:
# Summary of modelnasa < glm(fail.field ~ temp, family = "binomial", data = challenger)summaryLog < summary(nasa)summaryLog # 'Residual deviance' is the deviance; 'Null deviance' is the null deviance## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.0566 0.7575 0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp 0.4166 0.1940 2.147 0.0318 *## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5# Null model  only interceptnull < glm(fail.field ~ 1, family = "binomial", data = challenger)summaryNull < summary(null)summaryNull## ## Call:## glm(formula = fail.field ~ 1, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 0.8519 0.8519 0.8519 1.5425 1.5425 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 0.8267 0.4532 1.824 0.0681 .## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 28.267 on 22 degrees of freedom## AIC: 30.267## ## Number of Fisher Scoring iterations: 4# Computation of the R^2 with a function  useful for repetitive computationsr2Log < function(model) { summaryLog < summary(model) 1  summaryLog$deviance / summaryLog$null.deviance}# R^2r2Log(nasa)## [1] 0.280619r2Log(null)## [1] 2.220446e16
Another way of evaluating the model fit is its predictive accuracy. The motivation is that most of the times we are interested simply in classifying, for an observation of the predictors, the value of \(Y\) as either \(0\) or \(1\), but not in predicting the value of \(p(x_1,\ldots,x_k)=\mathbb{P}[Y=1X_1=x_1,\ldots,X_k=x_k]\). The classification in prediction is simply done by the rule \[\hat{Y}=\left\{\begin{array}{ll}1,&\hat{p}(x_1,\ldots,x_k)>\frac{1}{2},\\0,&\hat{p}(x_1,\ldots,x_k)<\frac{1}{2}.\end{array}\right.\] The overall predictive accuracy can be summarized with the hit matrix
\(Y=0\)  Correct\(_{0}\)  Incorrect\(_{01}\) 
\(Y=1\)  Incorrect\(_{10}\)  Correct\(_{1}\) 
and with the hit ratio \(\frac{\text{Correct}_0+\text{Correct}_1}{n}\). The hit matrix is easily computed with the table
function. The function, whenever called with two vectors, computes the crosstable between the two vectors.
# Fitted probabilities for Y = 1nasa$fitted.values## 1 2 3 4 5 6 ## 0.42778935 0.23014393 0.26910358 0.32099837 0.37772880 0.15898364 ## 7 8 9 10 11 12 ## 0.12833090 0.23014393 0.85721594 0.60286639 0.23014393 0.04383877 ## 13 14 15 16 17 18 ## 0.37772880 0.93755439 0.37772880 0.08516844 0.23014393 0.02299887 ## 19 20 21 22 23 ## 0.07027765 0.03589053 0.08516844 0.07027765 0.82977495# Classified Y'syHat < nasa$fitted.values > 0.5# Hit matrix:#  16 correctly classified as 0#  4 correclty classified as 1#  3 incorrectly classified as 0tab < table(challenger$fail.field, yHat)tab## yHat## FALSE TRUE## 0 16 0## 1 3 4# Hit ratio (ratio of correct classification)(16 + 4) / 23 # Manually## [1] 0.8695652sum(diag(tab)) / sum(tab) # Automatically## [1] 0.8695652
It is important to recall that the hit matrix will be always biased towards unrealistc good classification rates if it is computed in the same sample used for fitting the logistic model. A familiar analogy is asking to your mother (data) whether you (model) are a goodlooking human being (good predictive accuracy) – the answer will be highly positively biased. To get a fair hit matrix, the right approach is to split randomly the sample into two: a training dataset, used for fitting the model, and a test dataset, used for evaluating the predictive accuracy.
Model selection and multicollinearity
The same discussion we did in Section 3.7 is applicable to logistic regression with small changes:
 The deviance of the model (reciprocally the likelihood and the \(R^2\)) always decreases (increase) with the inclusion of more predictors – no matter whether they are significant or not.
 The excess of predictors in the model is paid by a larger variability in the estimation of the model which results in less precise prediction.
 Multicollinearity may hide significant variables, change the sign of them and result in an increase of the variability of the estimation at the expense of little improvement in .
The use of information criteria, stepwise
and vif
allow to efficiently fight back these issues. Let’s review them quickly from the perspective of logistic regression.
First, remember that the BIC/AIC information criteria are based on a
balance between the model fitness, given by the likelihood, and its complexity. In the logistic regression, the BIC is
\[\begin{align*}\text{BIC}(\text{model}) &= 2\log \text{lik}(\hat{\boldsymbol{\beta}}) + (k + 1)\times\log n\\&=D+(k+1)\times \log n,\end{align*}\]where \(\text{lik}(\hat{\boldsymbol{\beta}})\) is the likelihood of the model. The AIC replaces \(\log n\) by \(2\), hence penalizing less model complexity. The BIC and AIC can be computed in R
through the functions BIC
and AIC
, and we can check manually that they match with its definition.
# Modelsnasa < glm(fail.field ~ temp, family = "binomial", data = challenger)nasa2 < glm(fail.field ~ temp + pres.field, family = "binomial", data = challenger)# nasasummary1 < summary(nasa)summary1## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.0566 0.7575 0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp 0.4166 0.1940 2.147 0.0318 *## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5BIC(nasa)## [1] 26.60584summary1$deviance + 2 * log(dim(challenger)[1])## [1] 26.60584AIC(nasa)## [1] 24.33485summary1$deviance + 2 * 2## [1] 24.33485# nasa2summary2 < summary(nasa2)summary2## ## Call:## glm(formula = fail.field ~ temp + pres.field, family = "binomial", ## data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.2109 0.6081 0.4292 0.3498 2.0913 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 6.642709 4.038547 1.645 0.1000 ## temp 0.435032 0.197008 2.208 0.0272 *## pres.field 0.009376 0.008821 1.063 0.2878 ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 19.078 on 20 degrees of freedom## AIC: 25.078## ## Number of Fisher Scoring iterations: 5BIC(nasa2)## [1] 28.48469summary2$deviance + 3 * log(dim(challenger)[1])## [1] 28.48469AIC(nasa2)## [1] 25.07821summary2$deviance + 3 * 2## [1] 25.07821
Second, stepwise
works analogously to the linear regression situation. Here is an illustration for a binary variable that measures whether a Boston suburb (Boston
dataset) is wealth or not. The binary variable is medv > 25
: it is TRUE
(1
) for suburbs with median house value larger than 25000$) and FALSE
(0
) otherwise. The cutoff 25000$ corresponds to the 25% richest suburbs.
# Boston datasetdata(Boston)# Model whether a suburb has a median house value larger than 25000$mod < glm(I(medv > 25) ~ ., data = Boston, family = "binomial")summary(mod)## ## Call:## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 3.3498 0.2806 0.0932 0.0006 3.3781 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 5.312511 4.876070 1.090 0.275930 ## crim 0.011101 0.045322 0.245 0.806503 ## zn 0.010917 0.010834 1.008 0.313626 ## indus 0.110452 0.058740 1.880 0.060060 . ## chas 0.966337 0.808960 1.195 0.232266 ## nox 6.844521 4.483514 1.527 0.126861 ## rm 1.886872 0.452692 4.168 3.07e05 ***## age 0.003491 0.011133 0.314 0.753853 ## dis 0.589016 0.164013 3.591 0.000329 ***## rad 0.318042 0.082623 3.849 0.000118 ***## tax 0.010826 0.004036 2.682 0.007314 ** ## ptratio 0.353017 0.122259 2.887 0.003884 ** ## black 0.002264 0.003826 0.592 0.554105 ## lstat 0.367355 0.073020 5.031 4.88e07 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 209.11 on 492 degrees of freedom## AIC: 237.11## ## Number of Fisher Scoring iterations: 7r2Log(mod)## [1] 0.628923# With BIC  wnds up with only the significant variables and a similar R^2modBIC < stepwise(mod, trace = 0)## ## Direction: backward/forward## Criterion: BICsummary(modBIC)## ## Call:## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio + ## lstat, family = "binomial", data = Boston)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 3.3077 0.2970 0.0947 0.0005 3.2552 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 1.556433 3.948818 0.394 0.693469 ## indus 0.143236 0.054771 2.615 0.008918 ** ## rm 1.950496 0.441794 4.415 1.01e05 ***## dis 0.426830 0.111572 3.826 0.000130 ***## rad 0.301060 0.076542 3.933 8.38e05 ***## tax 0.010240 0.003631 2.820 0.004800 ** ## ptratio 0.404964 0.112086 3.613 0.000303 ***## lstat 0.384823 0.069121 5.567 2.59e08 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 215.03 on 498 degrees of freedom## AIC: 231.03## ## Number of Fisher Scoring iterations: 7r2Log(modBIC)## [1] 0.6184273
Finally, multicollinearity can also be present in logistic regression. Despite the nonlinear logistic curve, the predictors are combined linearly in (4.4). Due to this, if two or more predictors are highly correlated between them, the fit of the model will be compromised since the individual linear effect of each predictor is hard to disentangle from the rest of correlated predictors.
In addition to inspecting the correlation matrix and look for high correlations, a powerful tool to detect multicollinearity is the generalized Variance Inflation Factor (gVIF) of each coefficient \(\hat\beta_j\). This is a measure of how much the variability in the estimation has increased due to the addition of the predictor \(X_j\). The next rule of thumb gives direct insight into which predictors are multicollinear:
 gVIF close to 1: absence of multicollinearity.
 gVIF larger than 5 or 10: problematic amount of multicollinearity. Advised to remove the predictor with largest gVIF.
Here is an example illustrating the use of gVIF, through vif
, in practice. It also shows also how the simple inspection of the covariance matrix is not enough for detecting collinearity in tricky situations.
# Create predictors with multicollinearity: x4 depends on the restset.seed(45678)x1 < rnorm(100)x2 < 0.5 * x1 + rnorm(100)x3 < 0.5 * x2 + rnorm(100)x4 < x1 + x2 + rnorm(100, sd = 0.25)# Responsez < 1 + 0.5 * x1 + 2 * x2  3 * x3  x4y < rbinom(n = 100, size = 1, prob = 1/(1 + exp(z)))data < data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)# Correlations  none seems suspiciuscor(data)## x1 x2 x3 x4 y## x1 1.0000000 0.38254782 0.2142011 0.5261464 0.20198825## x2 0.3825478 1.00000000 0.5167341 0.5673174 0.07456324## x3 0.2142011 0.51673408 1.0000000 0.2500123 0.49853746## x4 0.5261464 0.56731738 0.2500123 1.0000000 0.11188657## y 0.2019882 0.07456324 0.4985375 0.1118866 1.00000000# Generalized variance inflation factors anormal: largest for x4, we remove itmodMultiCo < glm(y ~ x1 + x2 + x3 + x4, family = "binomial")vif(modMultiCo)## x1 x2 x3 x4 ## 27.84756 36.66514 4.94499 36.78817# Without x4modClean < glm(y ~ x1 + x2 + x3, family = "binomial")# Comparisonsummary(modMultiCo)## ## Call:## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 2.4743 0.3796 0.1129 0.4052 2.3887 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 1.2527 0.4008 3.125 0.00178 ** ## x1 3.4269 1.8225 1.880 0.06007 . ## x2 6.9627 2.1937 3.174 0.00150 ** ## x3 4.3688 0.9312 4.691 2.71e06 ***## x4 5.0047 1.9440 2.574 0.01004 * ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 132.81 on 99 degrees of freedom## Residual deviance: 59.76 on 95 degrees of freedom## AIC: 69.76## ## Number of Fisher Scoring iterations: 7summary(modClean)## ## Call:## glm(formula = y ~ x1 + x2 + x3, family = "binomial")## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 2.0952 0.4144 0.1839 0.4762 2.5736 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 0.9237 0.3221 2.868 0.004133 ** ## x1 1.2803 0.4235 3.023 0.002502 ** ## x2 1.7946 0.5290 3.392 0.000693 ***## x3 3.4838 0.7491 4.651 3.31e06 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 132.813 on 99 degrees of freedom## Residual deviance: 68.028 on 96 degrees of freedom## AIC: 76.028## ## Number of Fisher Scoring iterations: 6r2Log(modMultiCo)## [1] 0.5500437r2Log(modClean)## [1] 0.4877884# Genralized variance inflation factors normalvif(modClean)## x1 x2 x3 ## 1.674300 2.724351 3.743940
For the Boston
dataset, do the following:
 Compute the hit matrix and hit ratio for the regression
I(medv > 25) ~ .
(hint: do table(medv > 25, …)
).  Fit
I(medv > 25) ~ .
but now using only the first 300 observations of Boston
, the training dataset (hint: use subset
).  For the previous model, predict the probability of the responses and classify them into
0
or 1
in the last 206 observations, the testing dataset (hint: use predict
on that subset).  Compute the hit matrix and hit ratio for the new predictions. Check that the hit ratio is smaller than the one in the first point. The hit ratio on the testing dataset, and not the first hit rate, is an estimator of how well the model is going to classify future observations.
Principal component analysis
Principal Component Analysis (PCA) is a powerful multivariate technique designed to summarize the most important features and relations of \(k\) numerical random variables \(X_1,\ldots,X_k\). PCA does dimension reduction of the original dataset by computing a new set of variables, the principal components \(\text{PC}_1,\ldots \text{PC}_k\), which explain the same information as \(X_1,\ldots,X_k\) but in an ordered way: \(\text{PC}_1\) explains the most of the information and \(\text{PC}_k\) the least.
There is no response \(Y\) or particular variable in PCA that deserves a particular attention – all variables are treated equally.
Examples and applications
Case study: Employement in European countries in the late 70s
The purpose of this case study, motivated by Hand et al. (1994) and Bartholomew et al. (2008), is to reveal the structure of the job market and economy in different developed countries. The final aim is to have a meaningful and rigorous plot that is able to show the most important features of the countries in a concise form.
The dataset eurojob
(download) contains the data employed in this case study. It contains the percentage of workforce employed in 1979 in 9 industries for 26 European countries. The industries measured are:
 Agriculture (
Agr
)  Mining (
Min
)  Manufacturing (
Man
)  Power supply industries
(Pow
)  Construction (
Con
)  Service industries (
Ser
)  Finance (
Fin
)  Social and personal services (
Soc
)  Transport and communications (
Tra
)
If the dataset is imported into R
and the case names are set as Country
(important in order to have only numerical variables), then the data should look like this:
Table 5.1: The eurojob
dataset.Belgium  3.3  0.9  27.6  0.9  8.2  19.1  6.2  26.6  7.2 
Denmark  9.2  0.1  21.8  0.6  8.3  14.6  6.5  32.2  7.1 
France  10.8  0.8  27.5  0.9  8.9  16.8  6.0  22.6  5.7 
WGerm  6.7  1.3  35.8  0.9  7.3  14.4  5.0  22.3  6.1 
Ireland  23.2  1.0  20.7  1.3  7.5  16.8  2.8  20.8  6.1 
Italy  15.9  0.6  27.6  0.5  10.0  18.1  1.6  20.1  5.7 
Luxem  7.7  3.1  30.8  0.8  9.2  18.5  4.6  19.2  6.2 
Nether  6.3  0.1  22.5  1.0  9.9  18.0  6.8  28.5  6.8 
UK  2.7  1.4  30.2  1.4  6.9  16.9  5.7  28.3  6.4 
Austria  12.7  1.1  30.2  1.4  9.0  16.8  4.9  16.8  7.0 
Finland  13.0  0.4  25.9  1.3  7.4  14.7  5.5  24.3  7.6 
Greece  41.4  0.6  17.6  0.6  8.1  11.5  2.4  11.0  6.7 
Norway  9.0  0.5  22.4  0.8  8.6  16.9  4.7  27.6  9.4 
Portugal  27.8  0.3  24.5  0.6  8.4  13.3  2.7  16.7  5.7 
Spain  22.9  0.8  28.5  0.7  11.5  9.7  8.5  11.8  5.5 
Sweden  6.1  0.4  25.9  0.8  7.2  14.4  6.0  32.4  6.8 
Switz  7.7  0.2  37.8  0.8  9.5  17.5  5.3  15.4  5.7 
Turkey  66.8  0.7  7.9  0.1  2.8  5.2  1.1  11.9  3.2 
Bulgaria  23.6  1.9  32.3  0.6  7.9  8.0  0.7  18.2  6.7 
Czech  16.5  2.9  35.5  1.2  8.7  9.2  0.9  17.9  7.0 
EGerm  4.2  2.9  41.2  1.3  7.6  11.2  1.2  22.1  8.4 
Hungary  21.7  3.1  29.6  1.9  8.2  9.4  0.9  17.2  8.0 
Poland  31.1  2.5  25.7  0.9  8.4  7.5  0.9  16.1  6.9 
Romania  34.7  2.1  30.1  0.6  8.7  5.9  1.3  11.7  5.0 
USSR  23.7  1.4  25.8  0.6  9.2  6.1  0.5  23.6  9.3 
Yugoslavia  48.7  1.5  16.8  1.1  4.9  6.4  11.3  5.3  4.0 
So far, we know how to compute summaries for each variable, and how to quantify and visualize relations between variables with the correlation matrix and the scatterplot matrix. But even for a moderate number of variables like this, their results are hard to process.
# Summary of the data  marginalsummary(eurojob)## Agr Min Man Pow ## Min. : 2.70 Min. :0.100 Min. : 7.90 Min. :0.1000 ## 1st Qu.: 7.70 1st Qu.:0.525 1st Qu.:23.00 1st Qu.:0.6000 ## Median :14.45 Median :0.950 Median :27.55 Median :0.8500 ## Mean :19.13 Mean :1.254 Mean :27.01 Mean :0.9077 ## 3rd Qu.:23.68 3rd Qu.:1.800 3rd Qu.:30.20 3rd Qu.:1.1750 ## Max. :66.80 Max. :3.100 Max. :41.20 Max. :1.9000 ## Con Ser Fin Soc ## Min. : 2.800 Min. : 5.20 Min. : 0.500 Min. : 5.30 ## 1st Qu.: 7.525 1st Qu.: 9.25 1st Qu.: 1.225 1st Qu.:16.25 ## Median : 8.350 Median :14.40 Median : 4.650 Median :19.65 ## Mean : 8.165 Mean :12.96 Mean : 4.000 Mean :20.02 ## 3rd Qu.: 8.975 3rd Qu.:16.88 3rd Qu.: 5.925 3rd Qu.:24.12 ## Max. :11.500 Max. :19.10 Max. :11.300 Max. :32.40 ## Tra ## Min. :3.200 ## 1st Qu.:5.700 ## Median :6.700 ## Mean :6.546 ## 3rd Qu.:7.075 ## Max. :9.400# Correlation matrixcor(eurojob)## Agr Min Man Pow Con Ser## Agr 1.00000000 0.03579884 0.6710976 0.40005113 0.53832522 0.7369805## Min 0.03579884 1.00000000 0.4451960 0.40545524 0.02559781 0.3965646## Man 0.67109759 0.44519601 1.0000000 0.38534593 0.49447949 0.2038263## Pow 0.40005113 0.40545524 0.3853459 1.00000000 0.05988883 0.2019066## Con 0.53832522 0.02559781 0.4944795 0.05988883 1.00000000 0.3560216## Ser 0.73698054 0.39656456 0.2038263 0.20190661 0.35602160 1.0000000## Fin 0.21983645 0.44268311 0.1558288 0.10986158 0.01628255 0.3655553## Soc 0.74679001 0.28101212 0.1541714 0.13241132 0.15824309 0.5721728## Tra 0.56492047 0.15662892 0.3506925 0.37523116 0.38766214 0.1875543## Fin Soc Tra## Agr 0.21983645 0.7467900 0.5649205## Min 0.44268311 0.2810121 0.1566289## Man 0.15582884 0.1541714 0.3506925## Pow 0.10986158 0.1324113 0.3752312## Con 0.01628255 0.1582431 0.3876621## Ser 0.36555529 0.5721728 0.1875543## Fin 1.00000000 0.1076403 0.2459257## Soc 0.10764028 1.0000000 0.5678669## Tra 0.24592567 0.5678669 1.0000000# Scatterplot matrixscatterplotMatrix(eurojob, reg.line = lm, smooth = FALSE, spread = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), id.n = 0, diagonal = 'histogram')
We definitely need a way of visualizing and quantifying the relations between variables for a moderate to large amount of variables. PCA will be a handy way. In a nutshell, what PCA does is:
 Takes the data for the variables \(X_1,\ldots,X_k\).
 Using this data, looks for new variables \(\text{PC}_1,\ldots \text{PC}_k\) such that:
 \(\text{PC}_j\) is a linear combination of \(X_1,\ldots,X_k\), \(1\leq j\leq k\). This is, \(\text{PC}_j=a_{1j}X_1+a_{2j}X_2+\ldots+a_{kj}X_k\).
 \(\text{PC}_1,\ldots \text{PC}_k\) are sorted decreasingly in terms of variance. Hence \(\text{PC}_j\) has more variance than \(\text{PC}_{j+1}\), \(1\leq j\leq k1\),
 \(\text{PC}_{j_1}\) and \(\text{PC}_{j_2}\) are uncorrelated, for \(j_1\neq j_2\).
 \(\text{PC}_1,\ldots \text{PC}_k\) have the same information, measured in terms of total variance, as \(X_1,\ldots,X_k\).
 Produces three key objects:
 Variances of the PCs. They are sorted decreasingly and give an idea of which PCs are contain most of the information of the data (the ones with more variance).
 Weights of the variables in the PCs. They give the interpretation of the PCs in terms of the original variables, as they are the coefficients of the linear combination. The weights of the variables \(X_1,\ldots,X_k\) on the PC\(_j\), \(a_{1j},\ldots,a_{kj}\), are normalized: \(a_{1j}^2+\ldots+a_{kj}^2=1\), \(j=1,\ldots,k\). In
R
, they are called loadings
.  Scores of the data in the PCs: this is the data with \(\text{PC}_1,\ldots \text{PC}_k\) variables instead of \(X_1,\ldots,X_k\). The scores are uncorrelated. Useful for knowing which PCs have more effect on a certain observation.
Hence, PCA rearranges our variables in an informationequivalent, but more convenient, layout where the variables are sorted according to the ammount of information they are able to explain. From this position, the next step is clear: stick only with a limited number of PCs such that they explain most of the information (e.g., 70% of the total variance) and do dimension reduction. The effectiveness of PCA in practice varies from the structure present in the dataset. For example, in the case of highly dependent data, it could explain more than the 90% of variability of a dataset with tens of variables with just two PCs.
Let’s see how to compute a full PCA in R
.
# The main function  use cor = TRUE to avoid scale distortionspca < princomp(eurojob, cor = TRUE)# What is inside?str(pca)## List of 7## $ sdev : Named num [1:9] 1.867 1.46 1.048 0.997 0.737 ...## .. attr(*, "names")= chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...## $ loadings: loadings [1:9, 1:9] 0.52379 0.00132 0.3475 0.25572 0.32518 ...## .. attr(*, "dimnames")=List of 2## .. ..$ : chr [1:9] "Agr" "Min" "Man" "Pow" ...## .. ..$ : chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...## $ center : Named num [1:9] 19.131 1.254 27.008 0.908 8.165 ...## .. attr(*, "names")= chr [1:9] "Agr" "Min" "Man" "Pow" ...## $ scale : Named num [1:9] 15.245 0.951 6.872 0.369 1.614 ...## .. attr(*, "names")= chr [1:9] "Agr" "Min" "Man" "Pow" ...## $ n.obs : int 26## $ scores : num [1:26, 1:9] 1.71 0.953 0.755 0.853 0.104 ...## .. attr(*, "dimnames")=List of 2## .. ..$ : chr [1:26] "Belgium" "Denmark" "France" "WGerm" ...## .. ..$ : chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...## $ call : language princomp(x = eurojob, cor = TRUE)##  attr(*, "class")= chr "princomp"# The standard deviation of each PCpca$sdev## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 ## 1.867391569 1.459511268 1.048311791 0.997237674 0.737033056 0.619215363 ## Comp.7 Comp.8 Comp.9 ## 0.475135828 0.369851221 0.006754636# Weights: the expression of the original variables in the PCs# E.g. Agr = 0.524 * PC1 + 0.213 * PC5  0.152 * PC6 + 0.806 * PC9# And also: PC1 = 0.524 * Agr + 0.347 * Man + 0256 * Pow + 0.325 * Con + ...# (Because the matrix is orthogonal, so the transpose is the inverse)pca$loadings## ## Loadings:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9## Agr 0.524 0.213 0.153 0.806## Min 0.618 0.201 0.164 0.101 0.726 ## Man 0.347 0.355 0.150 0.346 0.385 0.288 0.479 0.126 0.366## Pow 0.256 0.261 0.561 0.393 0.295 0.357 0.256 0.341 ## Con 0.325 0.153 0.668 0.472 0.130 0.221 0.356 ## Ser 0.379 0.350 0.115 0.284 0.615 0.229 0.388 0.238## Fin 0.454 0.587 0.280 0.526 0.187 0.174 0.145## Soc 0.387 0.222 0.312 0.412 0.220 0.263 0.191 0.506 0.351## Tra 0.367 0.203 0.375 0.314 0.513 0.124 0.545 ## ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111## Cumulative Var 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889## Comp.9## SS loadings 1.000## Proportion Var 0.111## Cumulative Var 1.000# Scores of the data on the PCs: how is the data reexpressed into PCshead(pca$scores, 10)## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Belgium 1.7104977 1.22179120 0.11476476 0.33949201 0.32453569## Denmark 0.9529022 2.12778495 0.95072216 0.59394893 0.10266111## France 0.7546295 1.12120754 0.49795370 0.50032910 0.29971876## WGerm 0.8525525 0.01137659 0.57952679 0.11046984 1.16522683## Ireland 0.1035018 0.41398717 0.38404787 0.92666396 0.01522133## Italy 0.3754065 0.76954739 1.06059786 1.47723127 0.64518265## Luxem 1.0594424 0.75582714 0.65147987 0.83515611 0.86593673## Nether 1.6882170 2.00484484 0.06374194 0.02351427 0.63517966## UK 1.6304491 0.37312967 1.14090318 1.26687863 0.81292541## Austria 1.1764484 0.14310057 1.04336386 0.15774745 0.52098078## Comp.6 Comp.7 Comp.8 Comp.9## Belgium 0.04725409 0.34008766 0.4030352 0.0010904043## Denmark 0.82730228 0.30292281 0.3518357 0.0156187715## France 0.11580705 0.18547802 0.2661924 0.0005074307## WGerm 0.61809939 0.44455923 0.1944841 0.0065393717## Ireland 1.42419990 0.03704285 0.3340389 0.0108793301## Italy 1.00210439 0.14178212 0.1302796 0.0056017552## Luxem 0.21879618 1.69417817 0.5473283 0.0034530991## Nether 0.21197502 0.30339781 0.5906297 0.0109314745## UK 0.03605094 0.04128463 0.3485948 0.0054775709## Austria 0.80190706 0.41503736 0.2150993 0.0028164222# Scatterplot matrix of the scores  they are uncorrelated!scatterplotMatrix(pca$scores, reg.line = lm, smooth = FALSE, spread = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), id.n = 0, diagonal = 'histogram')
# Means of the variables  before PCA the variables are centeredpca$center## Agr Min Man Pow Con Ser ## 19.1307692 1.2538462 27.0076923 0.9076923 8.1653846 12.9576923 ## Fin Soc Tra ## 4.0000000 20.0230769 6.5461538# Rescalation done to each variable#  if cor = FALSE (default), a vector of ones#  if cor = TRUE, a vector with the standard deviations of the variablespca$scale## Agr Min Man Pow Con Ser ## 15.2446654 0.9512060 6.8716767 0.3689101 1.6136300 4.4864045 ## Fin Soc Tra ## 2.7520622 6.6969171 1.3644471# Summary of the importance of components  the third row is keysummary(pca)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Standard deviation 1.8673916 1.4595113 1.0483118 0.9972377 0.73703306## Proportion of Variance 0.3874613 0.2366859 0.1221064 0.1104981 0.06035753## Cumulative Proportion 0.3874613 0.6241472 0.7462536 0.8567517 0.91710919## Comp.6 Comp.7 Comp.8 Comp.9## Standard deviation 0.61921536 0.47513583 0.36985122 6.754636e03## Proportion of Variance 0.04260307 0.02508378 0.01519888 5.069456e06## Cumulative Proportion 0.95971227 0.98479605 0.99999493 1.000000e+00# Scree plot  the variance of each componentplot(pca)
# With connected lines  useful for looking for the "elbow"plot(pca, type = "l")
# PC1 and PC2pca$loadings[, 1:2]## Comp.1 Comp.2## Agr 0.523790989 0.05359389## Min 0.001323458 0.61780714## Man 0.347495131 0.35505360## Pow 0.255716182 0.26109606## Con 0.325179319 0.05128845## Ser 0.378919663 0.35017206## Fin 0.074373583 0.45369785## Soc 0.387408806 0.22152120## Tra 0.366822713 0.20259185
PCA produces uncorrelated variables from the original set X_{1}, …, X_{k}. This implies that:
 The PCs are uncorrelated, but not independent (uncorrelated does not imply independent).
 An uncorrelated or independent variable in X_{1}, …, X_{k} will get a PC only associated to it. In the extreme case where all the X_{1}, …, X_{k} are uncorrelated, these coincide with the PCs (up to sign flips).
Based on the weights of the variables on the PCs, we can extract the following interpretation:
 PC1 is roughly a linear combination of
Agr
, with negative weight, and (Man
, Pow
, Con
, Ser
, Soc
, Tra
), with positive weights. So it can be interpreted as an indicator of the kind of economy of the country: agricultural (negative values) or industrial (positive values).  PC2 has negative weights on (
Min
, Man
, Pow
, Tra
) and positive weights in (Ser
, Fin
, Soc
). It can be interpreted as the contrast between relatively large or small service sectors. So it tends to be negative in communist countries and positive in capitalist countries.
The interpretation of the PCs involves inspecting the weights and interpreting the linear combination of the original variables, which might be separating between two clear characteristics of the data
To conclude, let’s see how we can represent our original data into a plot called biplot that summarizes all the analysis for two PCs.
# Biplot  plot together the scores for PC1 and PC2 and the# variables expressed in terms of PC1 and PC2biplot(pca)
Case studies: Analysis of USArrests
, USJudgeRatings
and La Liga 2015/2016 metrics
The selection of the number of PCs and their interpretation though the weights and biplots are key aspects in a successful application of PCA. In this section we will see examples of both points through the datasets USArrests
, USJudgeRatings
and La Liga 2015/2016 (download).
The selection of the number of components \(l\), \(1\leq l\leq k\), is a tricky problem without a unique and wellestablished criterion for what is the best number of components. The reason is that selecting the number of PCs is a tradeoff between the variance of the original data that we want to explain and the price we want to pay in terms of a more complex dataset. Obviously, except for particular cases, none of the extreme situations \(l=1\) (potential low explained variance) or \(l=k\) (same number of PCs as the original data – no dimension reduction) is desirable.
There are several heuristic rules in order to determine the number of components:
 Select \(l\) up to a threshold of the percentage of variance explained, such as \(70\%\) or \(80\%\). We do so by looking into the third row of the
summary(...)
of a PCA.  Plot the variances of the PCs and look for an “elbow” in the graph whose location gives \(l\). Ideally, this elbow appears at the PC for which the next PC variances are almost similar and notably smaller when compared with the first ones. Use
plot(..., type = "l")
for creating the plot.  Select \(l\) based on the threshold of the individual variance of each component. For example, select only the PCs with larger variance than the mean of the variances of all the PCs. If we are working with standardized variables (
cor = TRUE
), this equals to taking the PCs with standard deviation larger than one. We do so by looking into the first row of the summary(...)
of a PCA.
In addition to these three heuristics, in practice we might apply a justified bias towards:
 \(l=1,2\), since these are the ones that allow to have a simple graphical representation of the data. Even if the variability explained by the \(l\) PCs is low (lower than \(50\%\)), these graphical representations are usually insightful. \(l=3\) is preferred as a second option since its graphical representation is more cumbersome (see the end of this section).
 \(l\)’s such that they yield intepretable PCs. Interpreting PCs is not so straightforward as interpreting the original variables. Furthermore, it becomes more difficult the larger the index of the PC is, since it explains less information of the data.
Let’s see these heuristics in practice with the USArrests
dataset (arrest statistics and population of US states).
# Load datadata(USArrests)# Snapshot of the datahead(USArrests)## Murder Assault UrbanPop Rape## Alabama 13.2 236 58 21.2## Alaska 10.0 263 48 44.5## Arizona 8.1 294 80 31.0## Arkansas 8.8 190 50 19.5## California 9.0 276 91 40.6## Colorado 7.9 204 78 38.7# PCApcaUSArrests < princomp(USArrests, cor = TRUE)summary(pcaUSArrests)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4## Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752## Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000# Plot of variances (screeplot)plot(pcaUSArrests, type = "l")
The selections of \(l\) for this PCA, based on the previous heuristics, are:
 \(l=2\), since it explains the \(86\%\) of the variance and \(l=1\) only the \(62\%\).
 \(l=2\), since from \(l=2\) onward the variances are very similar.
 \(l=1\), since the \(\text{PC}_2\) has standard deviation smaller than \(1\) (limit case).
 \(l=2\) is fine, it can be easily represented graphically.
 \(l=2\) is fine, both components are interpretable, as we will see later.
Therefore, we can conclude that \(l=2\) PCs is a good compromise for representing the USArrests
dataset.
Let’s see what happens for the USJudgeRatings
dataset (lawyers’ ratings of US Superior Court judges).
# Load datadata(USJudgeRatings)# Snapshot of the datahead(USJudgeRatings)## CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN## AARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8## ALEXANDER,J.M. 6.8 8.9 8.8 8.5 7.8 8.1 8.0 8.0 7.8 7.9 8.5 8.7## ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8## BERDON,R.I. 6.8 8.8 8.5 8.8 8.3 8.5 8.7 8.7 8.4 8.5 8.8 8.7## BRACKEN,J.J. 7.3 6.4 4.3 6.5 6.0 6.2 5.7 5.7 5.1 5.3 5.5 4.8## BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6# PCApcaUSJudgeRatings < princomp(USJudgeRatings, cor = TRUE)summary(pcaUSJudgeRatings)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4## Standard deviation 3.1833165 1.05078398 0.5769763 0.50383231## Proportion of Variance 0.8444586 0.09201225 0.0277418 0.02115392## Cumulative Proportion 0.8444586 0.93647089 0.9642127 0.98536661## Comp.5 Comp.6 Comp.7 Comp.8## Standard deviation 0.290607615 0.193095982 0.140295449 0.124158319## Proportion of Variance 0.007037732 0.003107172 0.001640234 0.001284607## Cumulative Proportion 0.992404341 0.995511513 0.997151747 0.998436354## Comp.9 Comp.10 Comp.11 Comp.12## Standard deviation 0.0885069038 0.0749114592 0.0570804224 0.0453913429## Proportion of Variance 0.0006527893 0.0004676439 0.0002715146 0.0001716978## Cumulative Proportion 0.9990891437 0.9995567876 0.9998283022 1.0000000000# Plot of variances (screeplot)plot(pcaUSJudgeRatings, type = "l")
The selections of \(l\) for this PCA, based on the previous heuristics, are:
 \(l=1\), since it explains alone the \(84\%\) of the variance.
 \(l=1\), since from \(l=1\) onward the variances are very similar compared to the first one.
 \(l=2\), since the \(\text{PC}_3\) has standard deviation smaller than \(1\).
 \(l=1,2\) are fine, they can be easily represented graphically.
 \(l=1,2\) are fine, both components are interpretable, as we will see later.
Based on the previous cirteria, we can conclude that \(l=1\) PC is a reasonable compromise for representing the USJudgeRatings
dataset.
We analyse now a slightly more complicated dataset. It contains the standings and team statistics for La Liga 2015/2016:
Table 5.2: Selection of variables for La Liga 2015/2016 dataset.Barcelona  91  38  29  4  5 
Real Madrid  90  38  28  6  4 
Atlético Madrid  88  38  28  4  6 
Villarreal  64  38  18  10  10 
Athletic  62  38  18  8  12 
Celta  60  38  17  9  12 
Sevilla  52  38  14  10  14 
Málaga  48  38  12  12  14 
Real Sociedad  48  38  13  9  16 
Betis  45  38  11  12  15 
Las Palmas  44  38  12  8  18 
Valencia  44  38  11  11  16 
Eibar  43  38  11  10  17 
Espanyol  43  38  12  7  19 
Deportivo  42  38  8  18  12 
Granada  39  38  10  9  19 
Sporting Gijón  39  38  10  9  19 
Rayo Vallecano  38  38  9  11  18 
Getafe  36  38  9  9  20 
Levante  32  38  8  8  22 
# PCA  we remove the second variable, matches played, since it is constantpcaLaliga < princomp(laliga[, 2], cor = TRUE)summary(pcaLaliga)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4## Standard deviation 3.4372781 1.5514051 1.14023547 0.91474383## Proportion of Variance 0.6563823 0.1337143 0.07222983 0.04648646## Cumulative Proportion 0.6563823 0.7900966 0.86232642 0.90881288## Comp.5 Comp.6 Comp.7 Comp.8## Standard deviation 0.85799980 0.60295746 0.4578613 0.373829925## Proportion of Variance 0.04089798 0.02019765 0.0116465 0.007763823## Cumulative Proportion 0.94971086 0.96990851 0.9815550 0.989318830## Comp.9 Comp.10 Comp.11 Comp.12## Standard deviation 0.327242606 0.22735805 0.1289704085 0.0991188181## Proportion of Variance 0.005949318 0.00287176 0.0009240759 0.0005458078## Cumulative Proportion 0.995268148 0.99813991 0.9990639840 0.9996097918## Comp.13 Comp.14 Comp.15 Comp.16## Standard deviation 0.0837498223 2.860411e03 1.238298e03 3.519922e08## Proportion of Variance 0.0003896685 4.545528e07 8.518782e08 6.883250e17## Cumulative Proportion 0.9999994603 9.999999e01 1.000000e+00 1.000000e+00## Comp.17 Comp.18## Standard deviation 0 0## Proportion of Variance 0 0## Cumulative Proportion 1 1# Plot of variances (screeplot)plot(pcaLaliga, type = "l")
The selections of \(l\) for this PCA, based on the previous heuristics, are:
 \(l=2,3\), since they explain the \(79\%\) and \(86\%\) of the variance (it depends on the threshold of the variance, \(70\%\) or \(80\%\)).
 \(l=3\), since from \(l=1\) onward the variances are very similar compared to the first one.
 \(l=3\), since the \(\text{PC}_4\) has standard deviation smaller than \(1\).
 \(l=2\) is preferred to \(l=3\).
 \(l=1,2\) are fine, both components are interpretable, as we will see later. \(l=3\) is harder to interpret.
Based on the previous citeria, we can conclude that \(l=2\) PCs is a reasonable compromise for representing La Liga 2015/2016 dataset.
Let’s focus now on the interpretation of the PCs. In addition to the weights present in the loadings
slot, biplot
provides a powerful and succinct way of displaying the relevant information for \(1\leq l\leq 2\). The biplot shows:
 The scores of the data in PC1 and PC2 by points (with optional text labels, depending if there are case names). This is the representation of the data in the first two PCs.
 The variables represented in the PC1 and PC2 by the arrows. These arrows are centered at \((0, 0)\).
Let’s examine the arrow associated to the variable \(X_j\). \(X_j\) is expressed in terms of \(\text{PC}_1\) and \(\text{PC}_2\) by the weights \(a_{j1}\) and \(a_{j2}\): \[X_j=a_{j1}\text{PC}_{1} + a_{j2}\text{PC}_{2} + \ldots + a_{jk}\text{PC}_{k}\approx a_{j1}\text{PC}_{1} + a_{j2}\text{PC}_{2}.\] \(a_{j1}\) and \(a_{j2}\) have the same sign as \(\mathrm{Cor}(X_j,\text{PC}_{1})\) and \(\mathrm{Cor}(X_j,\text{PC}_{2})\), respectively. The arrow associated to \(X_j\) is given by the segment joining \((0,0)\) and \((a_{j1},a_{j2})\). Therefore:
 If the arrow points right (\(a_{j1}>0\)), there is positive correlation between \(X_j\) and \(\text{PC}_1\). Analogous if the arrow points left.
 If the arrow is approximately vertical (\(a_{j1}\approx0\)), there is uncorrelation between \(X_j\) and \(\text{PC}_1\).
Analogously:
 If the arrow points up (\(a_{j2}>0\)), there is positive correlation between \(X_j\) and \(\text{PC}_2\). Analogous if the arrow points down.
 If the arrow is approximately horizontal (\(a_{j2}\approx0\)), there is uncorrelation between \(X_j\) and \(\text{PC}_2\).
In addition, the magnitude of the arrow informs about the correlation.
The biplot also provides the direct relation between variables, at sight of their expresions in \(\text{PC}_1\) and \(\text{PC}_2\). The angle of the arrows of variable \(X_j\) and \(X_m\) gives an approximation to the correlation between them, \(\mathrm{Cor}(X_j,X_m)\):
 If \(\text{angle}\approx 0^\circ\), the two variables are highly positively correlated.
 If \(\text{angle}\approx 90^\circ\), they are approximately uncorrelated.
 If \(\text{angle}\approx 180^\circ\), the two variables are highly negatively correlated.
The approximation to the correlation by means of the arrow angles is as good as the percentage of variance explained by \(\text{PC}_1\) and \(\text{PC}_2\).
Let see an indepth illustration of the previous concepts for pcaUSArrests
:
# Weights and biplotpcaUSArrests$loadings
## ## Loadings:## Comp.1 Comp.2 Comp.3 Comp.4## Murder 0.536 0.418 0.341 0.649## Assault 0.583 0.188 0.268 0.743## UrbanPop 0.278 0.873 0.378 0.134## Rape 0.543 0.167 0.818 ## ## Comp.1 Comp.2 Comp.3 Comp.4## SS loadings 1.00 1.00 1.00 1.00## Proportion Var 0.25 0.25 0.25 0.25## Cumulative Var 0.25 0.50 0.75 1.00
We can extract the following conclusions regarding the arrows and PCs:
Murder
, Assault
and Rape
are negatively correlated with \(\text{PC}_1\), which might be regarded as an indicator of the absence of crime (positive for less crimes, negative for more). The variables are highly correlated between them and the arrows are:\[\begin{align*}\vec{\text{Murder}} = (0.536, 0.418)\\\vec{\text{Assault}} = (0.583, 0.188)\\\vec{\text{Rape}} = (0.543, 0.167)\end{align*}\]Murder
and UrbanPop
are approximately uncorrelated.UrbanPop
is the most correlated variable with \(\text{PC}_2\) (positive for low urban population, negative for high). Its arrow is:\[\begin{align*}\vec{\text{UrbanPop}} = (0.278 0.873).\end{align*}\]Therefore, the biplot shows that states like Florida, South Carolina and California have high crime rate, whereas states like North Dakota or Vermont have low crime rate. California, in addition to have a high crime rate has a large urban population, whereas South Carolina has a low urban population. With the biplot, we can visualize the differences between states according to the crime rate and urban population in a simple way.
Let’s see now the biplot for the USJudgeRatings
, which has a clear interpretation:
# Weights and biplotpcaUSJudgeRatings$loadings
## ## Loadings:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9## CONT 0.933 0.335 ## INTG 0.289 0.182 0.549 0.174 0.370 0.450 0.334 0.275## DMNR 0.287 0.198 0.556 0.124 0.229 0.395 0.467 0.247 0.199## DILG 0.304 0.164 0.321 0.302 0.599 0.210 0.355 ## CFMG 0.303 0.168 0.207 0.448 0.247 0.714 0.143## DECI 0.302 0.128 0.298 0.424 0.393 0.536 0.302 0.258## PREP 0.309 0.152 0.214 0.203 0.335 0.154 0.109## FAMI 0.307 0.195 0.201 0.507 0.102 0.223## ORAL 0.313 0.246 0.150 0.300## WRIT 0.311 0.137 0.306 0.238 0.126 ## PHYS 0.281 0.154 0.841 0.118 0.299 0.266## RTEN 0.310 0.173 0.184 0.256 0.221 0.756## Comp.10 Comp.11 Comp.12## CONT ## INTG 0.109 0.113 ## DMNR 0.134 ## DILG 0.383 ## CFMG 0.166 ## DECI 0.128 ## PREP 0.680 0.319 0.273 ## FAMI 0.573 0.422 ## ORAL 0.256 0.639 0.494 ## WRIT 0.475 0.696 ## PHYS ## RTEN 0.250 0.286 ## ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083## Cumulative Var 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667## Comp.9 Comp.10 Comp.11 Comp.12## SS loadings 1.000 1.000 1.000 1.000## Proportion Var 0.083 0.083 0.083 0.083## Cumulative Var 0.750 0.833 0.917 1.000
biplot(pcaUSJudgeRatings, cex = 0.75)
The \(\text{PC}_1\) gives a lawyer indicator of how badly the judge conducts a trial. The variable CONT
, which measures the number of contacts between judge and lawyer, is almost uncorrelated with the rest of variables and is captured by \(\text{PC}_2\) (hence the rates of the lawyers are not affected by the number of contacts with the judge). We can identify the highrated and lowrated judges in the left and right of the plot, respectively.
Let’s see an application of the biplot in a La Liga 2015/2016, a dataset with more variables and a harder interpretation of PCs.
# Weights and biplotpcaLaliga$loadings
## ## Loadings:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7## Points 0.280 0.136 0.145 0.137## Wins 0.277 0.219 0.146 ## Draws 0.157 0.633 0.278 0.332 0.102 0.382## Loses 0.269 0.118 0.215 0.137 0.131 0.313## Goals.scored 0.271 0.220 0.101 ## Goals.conceded 0.232 0.336 0.178 0.374 ## Difference.goals 0.288 0.171 ## Percentage.scored.goals 0.271 0.219 ## Percentage.conceded.goals 0.232 0.336 0.178 0.375 ## Shots 0.229 0.299 0.133 0.325 0.272 0.249## Shots.on.goal 0.252 0.265 0.209 ## Penalties.scored 0.160 0.272 0.410 0.636 0.389 0.160## Assistances 0.271 0.186 0.158 0.129 0.176## Fouls.made 0.189 0.561 0.178 0.213 0.592## Matches.without.conceding 0.222 0.364 0.163 0.138 0.105 0.239## Yellow.cards 0.244 0.108 0.358 0.161 0.144## Red.cards 0.158 0.340 0.594 0.192 0.385 0.303## Offsides 0.163 0.341 0.453 0.426 0.429 0.283## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13## Points 0.135 0.290 0.137 0.213 ## Wins 0.263 0.120 0.230 ## Draws 0.150 0.126 0.249 ## Loses 0.220 0.338 0.171 0.103 0.154 ## Goals.scored 0.129 0.104 0.251 0.238 0.289 0.225 ## Goals.conceded 0.103 0.225 0.144 ## Difference.goals 0.125 0.272 0.151 0.109 ## Percentage.scored.goals 0.132 0.103 0.251 0.252 0.297 0.223 ## Percentage.conceded.goals 0.231 0.161 ## Shots 0.236 0.267 0.452 0.478 0.188 ## Shots.on.goal 0.471 0.325 0.439 0.488 0.192 ## Penalties.scored 0.131 0.328 ## Assistances 0.154 0.570 0.458 0.504 ## Fouls.made 0.363 0.187 0.135 0.108 0.135 ## Matches.without.conceding 0.282 0.303 0.388 0.258 0.554 ## Yellow.cards 0.733 0.369 0.152 0.219 ## Red.cards 0.384 0.216 0.127 ## Offsides 0.302 0.255 0.146 0.128 ## Comp.14 Comp.15 Comp.16 Comp.17 Comp.18## Points 0.239 0.789 ## Wins 0.820 0.136 ## Draws 0.293 0.203 ## Loses 0.424 0.563 ## Goals.scored 0.260 0.511 0.505 ## Goals.conceded 0.424 0.477 0.374 ## Difference.goals 0.374 0.103 0.775 ## Percentage.scored.goals 0.501 0.568 ## Percentage.conceded.goals 0.601 0.422 ## Shots ## Shots.on.goal ## Penalties.scored ## Assistances ## Fouls.made ## Matches.without.conceding ## Yellow.cards ## Red.cards ## Offsides ## ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056## Cumulative Var 0.056 0.111 0.167 0.222 0.278 0.333 0.389 0.444## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.056 0.056 0.056 0.056 0.056 0.056 0.056## Cumulative Var 0.500 0.556 0.611 0.667 0.722 0.778 0.833## Comp.16 Comp.17 Comp.18## SS loadings 1.000 1.000 1.000## Proportion Var 0.056 0.056 0.056## Cumulative Var 0.889 0.944 1.000
biplot(pcaLaliga, cex = 0.75)
Some interesting highlights:
 \(\text{PC}_1\) can be regarded as the nonperformance of a team during the season. It is negatively correlated with
Wins
, Points
,and positively correlated with Draws
, Loses
, Yellow.cards
,The best performing teams are not surprising: Barcelona, Real Madrid and Atlético Madrid. On the other hand, among the worstperforming teams are Levante, Getafe and Granada.  \(\text{PC}_2\) can be seen as the inefficiency of a team (conceding points with little participation in the game). Using this interpretation we can see that Rayo Vallecano and Real Madrid were the most inefficient teams and Atlético Madrid and Villareal were the most.
Offsides
is approximately uncorrelated with Red.cards
. \(\text{PC}_3\) does not have a clear interpretation.
If you are wondering about the 3D representation of the biplot, it can be computed through:
# Install this package with install.packages("pca3d")library(pca3d)pca3d(pcaLaliga, show.labels = TRUE, biplot = TRUE)
## [1] 0.1822244 0.0677458 0.0800248
You must enable Javascript to view this page properly.
Finally, we mention that R Commander
has a menu entry for performing PCA: 'Statistics' > 'Dimensional analysis' > 'Principalcomponents analysis...'
. Alternatively, the plugin FactoMineR
implements a PCA with more options and graphical outputs. It can be loaded (if installed) in 'Tools' > 'Load Rcmdr plugin(s)...' > 'RcmdPlugin.FactoMineR'
(you will need to restart R Commander
). For performing a PCA in FactoMineR
, go to 'FactoMineR' > 'Principal Component Analysis (PCA)'
. In that menu you will have more advanced options than in R Commander
’s PCA.
Cluster analysis
Cluster analysis is the collection of techniques designed to find subgroups or clusters in a dataset of variables \(X_1,\ldots,X_k\). Depending on the similarities between the observations, these are partitioned in homogeneous groups as separated as possible between them. Clustering methods can be classified into two main categories:
 Partition methods. Given a fixed number of cluster \(k\), these methods aim to assign each observation of \(X_1,\ldots,X_k\) to a unique cluster, in such a way that the withincluster variation is as small as possible (the clusters are as homogeneous as possible) while the between cluster variation is as large as possible (the clusters are as separated as possible).
 Hierarchical methods. These methods construct a hierarchy for the observations in terms of their similitudes. This results in a treebased representation of the data in terms of a dendogram, which depicts how the observations are clustered at different levels – from the smallest groups of one element to the largest representing the whole dataset.
We will see the basics of the most wellknown partition method, namely \(k\)means clustering, and of the agglomerative hierarchical clustering.
\(k\)means clustering
The \(k\)means clustering looks for \(k\) clusters in the data such that they are as compact as possible and as separated as possible. In clustering terminology, the clusters minimize the within cluster variation with respect to the cluster centroid while they maximize the between cluster variation among clusters. The distance used for measuring proximity is the usual Euclidean distance between points. As a consequence, this clustering method tend to yield spherical or rounded clusters and is not adequate for categorical variables.
Let’s analyze the possible clusters in a smaller subset of La Liga 2015/2016 (download) dataset, where the results can be easily visualized. To that end, import the data as laliga
.
# We consider only a smaller dataset (Points and Yellow.cards)head(laliga, 2)## Points Matches Wins Draws Loses Goals.scored Goals.conceded## Barcelona 91 38 29 4 5 112 29## Real Madrid 90 38 28 6 4 110 34## Difference.goals Percentage.scored.goals## Barcelona 83 2.95## Real Madrid 76 2.89## Percentage.conceded.goals Shots Shots.on.goal Penalties.scored## Barcelona 0.76 600 277 11## Real Madrid 0.89 712 299 6## Assistances Fouls.made Matches.without.conceding Yellow.cards## Barcelona 79 385 18 66## Real Madrid 90 420 14 72## Red.cards Offsides## Barcelona 1 120## Real Madrid 5 114pointsCards < laliga[, c(1, 17)]plot(pointsCards)
# kmeans uses a random initialization of the clusters, so the results may vary# from one call to another. We use set.seed() to have reproducible outputs.set.seed(2345678)# kmeans call:#  centers is the k, the number of clusters.#  nstart indicates how many different starting assignments should be considered# (useful for avoiding suboptimal clusterings)k < 2km < kmeans(pointsCards, centers = k, nstart = 20)# What is inside km?km## Kmeans clustering with 2 clusters of sizes 4, 16## ## Cluster means:## Points Yellow.cards## 1 82.7500 78.25## 2 44.8125 113.25## ## Clustering vector:## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 1 2 ## Athletic Celta Sevilla Málaga ## 1 2 2 2 ## Real Sociedad Betis Las Palmas Valencia ## 2 2 2 2 ## Eibar Espanyol Deportivo Granada ## 2 2 2 2 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 2 2 2 2 ## ## Within cluster sum of squares by cluster:## [1] 963.500 4201.438## (between_SS / total_SS = 62.3 %)## ## Available components:## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"str(km)## List of 9## $ cluster : Named int [1:20] 1 1 1 2 1 2 2 2 2 2 ...## .. attr(*, "names")= chr [1:20] "Barcelona" "Real Madrid" "Atlético Madrid" "Villarreal" ...## $ centers : num [1:2, 1:2] 82.8 44.8 78.2 113.2## .. attr(*, "dimnames")=List of 2## .. ..$ : chr [1:2] "1" "2"## .. ..$ : chr [1:2] "Points" "Yellow.cards"## $ totss : num 13691## $ withinss : num [1:2] 964 4201## $ tot.withinss: num 5165## $ betweenss : num 8526## $ size : int [1:2] 4 16## $ iter : int 1## $ ifault : int 0##  attr(*, "class")= chr "kmeans"# between_SS / total_SS gives a criterion to select k similar to PCA.# Recall that between_SS / total_SS = 100% if k = n# Centroids of each clusterkm$centers## Points Yellow.cards## 1 82.7500 78.25## 2 44.8125 113.25# Assignments of observations to the k clusterskm$cluster## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 1 2 ## Athletic Celta Sevilla Málaga ## 1 2 2 2 ## Real Sociedad Betis Las Palmas Valencia ## 2 2 2 2 ## Eibar Espanyol Deportivo Granada ## 2 2 2 2 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 2 2 2 2# Plot data wih colors according to clustersplot(pointsCards, col = km$cluster)# Add the names of the observations above the pointstext(x = pointsCards, labels = rownames(pointsCards), col = km$cluster, pos = 3, cex = 0.75)
# Clustering with k = 3k < 3set.seed(2345678)km < kmeans(pointsCards, centers = k, nstart = 20)plot(pointsCards, col = km$cluster)text(x = pointsCards, labels = rownames(pointsCards), col = km$cluster, pos = 3, cex = 0.75)
# Clustering with k = 4k < 4set.seed(2345678)km < kmeans(pointsCards, centers = k, nstart = 20)plot(pointsCards, col = km$cluster)text(x = pointsCards, labels = rownames(pointsCards), col = km$cluster, pos = 3, cex = 0.75)
So far, we have only taken the information of two variables for performing clustering. Using PCA, we can visualize the clustering performed with all the available variables in the dataset.
By default, kmeans
does not standardize variables, which will affect the clustering result. As a consequence, the clustering of a dataset will be different if one variable is expressed in millions or in tenths. If you want to avoid this distortion, you can use scale
to automatically center and standardize a data frame (the result will be a matrix, so you need to transform it to a data frame again).
# Work with standardized data (and remove Matches)laligaStd < data.frame(scale(laliga[, 2]))# Clustering with all the variables  unstandardized dataset.seed(345678)kme < kmeans(laliga, centers = 3, nstart = 20)kme$cluster## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 3 2 ## Athletic Celta Sevilla Málaga ## 3 2 2 2 ## Real Sociedad Betis Las Palmas Valencia ## 3 3 3 2 ## Eibar Espanyol Deportivo Granada ## 2 2 3 2 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 2 2 2 2table(kme$cluster)## ## 1 2 3 ## 2 12 6# Clustering with all the variables  standardized dataset.seed(345678)kme < kmeans(laligaStd, centers = 3, nstart = 20)kme$cluster## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 1 2 ## Athletic Celta Sevilla Málaga ## 2 2 2 2 ## Real Sociedad Betis Las Palmas Valencia ## 2 2 2 2 ## Eibar Espanyol Deportivo Granada ## 3 3 2 3 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 3 3 3 3table(kme$cluster)## ## 1 2 3 ## 3 10 7# PCApca < princomp(laliga[, 2], cor = TRUE)summary(pca)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4## Standard deviation 3.4372781 1.5514051 1.14023547 0.91474383## Proportion of Variance 0.6563823 0.1337143 0.07222983 0.04648646## Cumulative Proportion 0.6563823 0.7900966 0.86232642 0.90881288## Comp.5 Comp.6 Comp.7 Comp.8## Standard deviation 0.85799980 0.60295746 0.4578613 0.373829925## Proportion of Variance 0.04089798 0.02019765 0.0116465 0.007763823## Cumulative Proportion 0.94971086 0.96990851 0.9815550 0.989318830## Comp.9 Comp.10 Comp.11 Comp.12## Standard deviation 0.327242606 0.22735805 0.1289704085 0.0991188181## Proportion of Variance 0.005949318 0.00287176 0.0009240759 0.0005458078## Cumulative Proportion 0.995268148 0.99813991 0.9990639840 0.9996097918## Comp.13 Comp.14 Comp.15 Comp.16## Standard deviation 0.0837498223 2.860411e03 1.238298e03 3.519922e08## Proportion of Variance 0.0003896685 4.545528e07 8.518782e08 6.883250e17## Cumulative Proportion 0.9999994603 9.999999e01 1.000000e+00 1.000000e+00## Comp.17 Comp.18## Standard deviation 0 0## Proportion of Variance 0 0## Cumulative Proportion 1 1# Biplot (the scores of the first two PCs)biplot(pca)
# Redo the biplot with colors indicating the cluster assignmentsplot(pca$scores[, 1:2], col = kme$cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = kme$cluster)
# Recall: this is a visualization with PC1 and PC2 of the clustering done with all the variables,# not just PC1 and PC2# Clustering with only the first two PCs  different and less accurate result, but still insightfulset.seed(345678)kme2 < kmeans(pca$scores[, 1:2], centers = 3, nstart = 20)plot(pca$scores[, 1:2], col = kme2$cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = kme2$cluster)
\(k\)means can also be performed through the help of R Commander
. To do so, go to 'Statistics' > 'Dimensional Analysis' > 'Clustering' > 'kmeans cluster analysis...'
. If you do this for the USArrests
dataset after rescaling it, select to 'Assign clusters to the data set'
and name the 'Assignment variable'
as 'KMeans'
, you should get something like this:
# Load data and scale itdata(USArrests)USArrests < as.data.frame(scale(USArrests))# Statistics > Dimensional Analysis > Clustering > kmeans cluster analysis....cluster < KMeans(model.matrix(~1 + Assault + Murder + Rape + UrbanPop, USArrests), centers = 2, iter.max = 10, num.seeds = 10).cluster$size # Cluster Sizes## [1] 20 30.cluster$centers # Cluster Centroids## new.x.Assault new.x.Murder new.x.Rape new.x.UrbanPop## 1 1.0138274 1.004934 0.8469650 0.1975853## 2 0.6758849 0.669956 0.5646433 0.1317235.cluster$withinss # Within Cluster Sum of Squares## [1] 46.74796 56.11445.cluster$tot.withinss # Total Within Sum of Squares## [1] 102.8624.cluster$betweenss # Between Cluster Sum of Squares## [1] 93.1376remove(.cluster).cluster < KMeans(model.matrix(~1 + Assault + Murder + Rape + UrbanPop, USArrests), centers = 2, iter.max = 10, num.seeds = 10).cluster$size # Cluster Sizes## [1] 30 20.cluster$centers # Cluster Centroids## new.x.Assault new.x.Murder new.x.Rape new.x.UrbanPop## 1 0.6758849 0.669956 0.5646433 0.1317235## 2 1.0138274 1.004934 0.8469650 0.1975853.cluster$withinss # Within Cluster Sum of Squares## [1] 56.11445 46.74796.cluster$tot.withinss # Total Within Sum of Squares## [1] 102.8624.cluster$betweenss # Between Cluster Sum of Squares## [1] 93.1376biplot(princomp(model.matrix(~1 + Assault + Murder + Rape + UrbanPop, USArrests)), xlabs = as.character(.cluster$cluster))
USArrests$KMeans < assignCluster(model.matrix(~1 + Assault + Murder + Rape + UrbanPop, USArrests), USArrests, .cluster$cluster)remove(.cluster)
How many clusters k do we need in practice? There is not a single answer: the advice is to try several and compare. Inspecting the ‘between_SS / total_SS’
for a good tradeoff between the number of clusters and the percentage of total variation explained usually gives a good starting point for deciding on k.
For the iris
dataset, do sequentially:
 Apply
scale
to the dataset and save it as irisStd
. Note: the fifth variable is a factor, so you must skip it.  Fix the seed to 625365712.
 Run kmeans with 20 runs for k = 2, 3, 4. Save the results as
km2
, km3
and km4
.  Compute the PCA of
irisStd
.  Plot the first two scores, colored by the assignments of
km2
.  Do the same for
km3
and km4
.  Which k do you think it gives the most sensible partition based on the previous plots?
Agglomerative hierarchical clustering
Hierarchical clustering starts by considering that each observation is its own cluster, and then merges sequentially the clusters with a lower degree of disimilarity \(d\) (the lower the similarity, the larger the similarity). For example, if there are three clusters, \(A\), \(B\) and \(C\), and their dissimilarities are \(d(A,B)=0.1\), \(d(A,C)=0.5\), \(d(B,C)=0.9\), then the three clusters will be reduced to just two: \((A,B)\) and \(C\).
The advantages of hierarchical clustering are several:
 We do not need to specify a fixed number of clusters \(k\).
 The clusters are naturally nested within each other, something that does not happen in \(k\)means. Is possible to visualize this nested structure throughout the dendogram.
 It can deal with categorical variables, throughout the specification of proper dissimilarity measures. In particular, it can deal with numerical variables using the Euclidean distance.
The linkage employed by hierarchical clustering refers to how the cluster are fused:
 Complete. Takes the maximal dissimilarity between all the pairwise dissimilarities between the observations in cluster A and cluster B.
 Single. Takes the minimal dissimilarity between all the pairwise dissimilarities between the observations in cluster A and cluster B.
 Average. Takes the average dissimilarity between all the pairwise dissimilarities between the observations in cluster A and cluster B.
Hierarchical clustering is quite sensible to the kind of dissimilarity employed and the kind of linkage used. In addition, the hierarchical property might force the clusters to unnatural behaviors. Particularly, single linkage may result in extended, chained clusters in which a single observation is added at a new level. As a consequence, complete and average are usually recommended in practice.
Let’s illustrate how to perform hierarchical clustering in laligaStd
.
# Compute dissimilary matrix  in this case Euclidean distanced < dist(laligaStd)# Hierarchical clustering with complete linkagetreeComp < hclust(d, method = "complete")plot(treeComp)
# With average linkagetreeAve < hclust(d, method = "average")plot(treeAve)
# With single linkagetreeSingle < hclust(d, method = "single")plot(treeSingle) # Chaining
# Set the number of clusters after inspecting visually the dendogram for "long" groups of hanging leaves# These are the cluster assignmentscutree(treeComp, k = 2) # (Barcelona, Real Madrid) and (rest)## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 1 2 ## Athletic Celta Sevilla Málaga ## 2 2 2 2 ## Real Sociedad Betis Las Palmas Valencia ## 2 2 2 2 ## Eibar Espanyol Deportivo Granada ## 2 2 2 2 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 2 2 2 2cutree(treeComp, k = 3) # (Barcelona, Real Madrid), (Atlético Madrid) and (rest)## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 2 3 ## Athletic Celta Sevilla Málaga ## 3 3 3 3 ## Real Sociedad Betis Las Palmas Valencia ## 3 3 3 3 ## Eibar Espanyol Deportivo Granada ## 3 3 3 3 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 3 3 3 3# Compare differences  treeComp makes more sense than treeAvecutree(treeComp, k = 4)## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 2 3 ## Athletic Celta Sevilla Málaga ## 3 3 3 3 ## Real Sociedad Betis Las Palmas Valencia ## 3 3 3 4 ## Eibar Espanyol Deportivo Granada ## 4 4 3 4 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 4 4 4 4cutree(treeAve, k = 4)## Barcelona Real Madrid Atlético Madrid Villarreal ## 1 1 2 3 ## Athletic Celta Sevilla Málaga ## 3 3 3 3 ## Real Sociedad Betis Las Palmas Valencia ## 3 3 3 3 ## Eibar Espanyol Deportivo Granada ## 3 3 4 3 ## Sporting Gijón Rayo Vallecano Getafe Levante ## 3 3 3 3# We can plot the results in the first two PCs, as we did in kmeanscluster < cutree(treeComp, k = 2)plot(pca$scores[, 1:2], col = cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = cluster)
cluster < cutree(treeComp, k = 3)plot(pca$scores[, 1:2], col = cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = cluster)
cluster < cutree(treeComp, k = 4)plot(pca$scores[, 1:2], col = cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = cluster)
If categorical variables are present, replace dist
by daisy
from the cluster
package (you need to do first library(cluster)
). For example, let’s cluster the iris
dataset.
# Load datadata(iris)# The fifth variable is a factorhead(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa# Compute dissimilarity matrix using the Gower dissimilarity measure# This dissimilarity is able to handle both numerical and categorical variables# daisy automatically detects whether there are factors present in the data and applies Gower# (otherwise it applies the Euclidean distance)library(cluster)d < daisy(iris)tree < hclust(d)# 3 main clustersplot(tree)
# The clusters correspond to the Speciescutree(tree, k = 3)## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3## [141] 3 3 3 3 3 3 3 3 3 3table(iris$Species, cutree(tree, k = 3))## ## 1 2 3## setosa 50 0 0## versicolor 0 50 0## virginica 0 0 50
Performing hierarchical clustering in practice depends on several decisions that may have big consequences on the final output:
 What kind of dissimilarity and linkage should be employed? Not a single answer: try several and compare.
 Where to cut the dendogram? The general advice is to look for groups of branches hanging for a long space and cut on their top.
Despite the general advice, there is not a single and best solution for the previous questions. What is advisable in practice is to analyse several choices, report the general patterns that arise and the different features of the data the methods expose.
Hierarchical clustering can also be performed through the help of R Commander
. To do so, go to 'Statistics' > 'Dimensional Analysis' > 'Clustering' > 'Hierar...'
. If you do this for the USArrests
dataset after rescaling, you should get something like this:
HClust.1 < hclust(dist(model.matrix(~1 + Assault+Murder+Rape+UrbanPop, USArrests)) , method= "complete")plot(HClust.1, main= "Cluster Dendrogram for Solution HClust.1", xlab= "Observation Number in Data Set USArrests", sub="Method=complete; Distance=euclidian")
Import the eurojob
(download) dataset and standardize it properly. Perform a hierarchical clustering analysis for the three kind of linkages seen.
Glossary of important R
commands
Basic usage
The following table contains important R
commands for its basic usage.
Assign values to a variable  <  x < 1 
Compute several expressions at once  ;  x < 1; 2 + 2; 3 * 8 
Create vectors by concatenating numbers  c  c(1, 2, 1) 
Create sequential integer vectors  :  1:10 
Create a matrix by columns  cbind  cbind(1:3, c(0, 2, 0)) 
Create a matrix by rows  rbind  rbind(1:3, c(0, 2, 0)) 
Create a data frame  data.frame  data.frame(name1 = c(1, 3), name2 = c(0.4, 1)) 
Create a list  list  list(obj1 = c(1, 3), obj2 = 1:5, obj3 = rbind(1:2, 3:2)) 
Access elements of a…   
… vector  []  c(0.5, 2)[1], c(0.5, 2)[1]; c(0.5, 2)[2:1] 
… matrix  [, ]  cbind(1:2, 3:4)[1, 2]; cbind(1:2, 3:4)[1, ] 
… data frame  [, ] and $  data.frame(name1 = c(1, 3), name2 = c(0.4, 1))$name1; data.frame(name1 = c(1, 3), name2 = c(0.4, 1))[2, 1] 
… list  $  list(x = 2, y = 7:0)$y 
Summarize any object  summary  summary(1:10) 
Linear regression
Some useful commands for performing simple and multiple linear regression are given in the next table. We assume that:
dataset
is an imported dataset such thatresp
is the response variablepred1
is first predictorpred2
is second predictor …
predk
is the last predictor
model
is the result of applying lm
newPreds
is a data.frame
with variables named as the predictorsnum
is 1
, 2
or 3
level
is a number between 0 and 1
Fit a simple linear model  lm(response ~ pred1, data = dataset) 
Fit a multiple linear model…  
… on two predictors  lm(response ~ pred1 + pred2, data = dataset) 
… on all predictors  lm(response ~ ., data = dataset) 
… on all predictors except pred1  lm(response ~ .  pred1, data = dataset) 
Summarize linear model: coefficient estimates, standard errors, \(t\)values, \(p\)values for \(H_0:\beta_j=0\), \(\hat\sigma\) (Residual standard error), degrees of freedom, \(R^2\), Adjusted \(R^2\), \(F\)test, \(p\)value for \(H_0:\beta_1=\ldots=\beta_k=0\)  summary(model) 
ANOVA decomposition  anova(model) 
CIs coefficients  confint(model, level = level) 
Prediction  predict(model, newdata = new) 
CIs predicted mean  predict(model, newdata = new, interval = "confidence", level = level) 
CIs predicted response  predict(model, newdata = new, interval = "prediction", level = level) 
Variable selection  stepwise(model) 
Multicollinearity detection  vif(model) 
Compare model coefficients  compareCoefs(model1, model2) 
Diagnostic plots  plot(model, num) 
More basic usage
The following table contains important R
commands for its basic usage. We assume the following dataset is available:
data < data.frame(x = 1:10, y = c(1, 2, 3, 0, 3, 1, 1, 3, 0, 1))
Data frame management   
variable names  names  names(data) 
structure  str  str(data) 
dimensions  dim  dim(data) 
beginning  head  head(data) 
Vector related functions   
create sequences  seq  seq(0, 1, l = 10); seq(0, 1, by = 0.25) 
reverse a vector  rev  rev(1:5) 
length of a vectors  length  length(1:5) 
count repetitions in a vector  table  table(c(1:5, 4:2)) 
Logical conditions   
relational operators  < , <= , > , >= , == , !=  1 < 0; 1 <= 1; 2 > 1; 3 >= 4; 1 == 0; 1 != 0 
“and”  &  TRUE & FALSE 
“or”    TRUE  FALSE 
Subsetting   
vector   data$x[data$x > 0]; data$x[data$x > 2 & data$x < 8] 
data frame   data[data$x > 0, ]; data[data$x < 2  data$x > 8, ] 
Distributions   
sampling  rxxxx  rnorm(n = 10, mean = 0, sd = 1) 
density  dxxxx  x < seq(4, 4, l = 20); dnorm(x = x, mean = 0, sd = 1) 
distribution  pxxxx  x < seq(4, 4, l = 20); pnorm(q = x, mean = 0, sd = 1) 
quantiles  qxxxx  p < seq(0.1, 0.9, l = 10); qnorm(p = p, mean = 0, sd = 1) 
Plotting   
scatterplot  plot  plot(rnorm(100), rnorm(100)) 
plot a curve  plot , seq  x < seq(0, 1, l = 100); plot(x, x^2, type = "l") 
add lines  lines ,  x < seq(0, 1, l = 100); plot(x, x^2 + rnorm(100, sd = 0.1)); lines(x, x^2, col = 2, lwd = 2) 
Logistic regression
Some useful commands for performing logistic regression are given in the next table. We assume that:
dataset
is an imported dataset such thatresp
is the response binary variablepred1
is first predictorpred2
is second predictor …
predk
is the last predictor
model
is the result of applying glm
newPreds
is a data.frame
with variables named as the predictorslevel
is a number between 0 and 1
Fit a simple logistic model  glm(response ~ pred1, data = dataset, family = "binomial") 
Fit a multiple logistic model…  
… on two predictors  glm(response ~ pred1 + pred2, data = dataset, family = "binomial") 
… on all predictors  glm(response ~ ., data = dataset, family = "binomial") 
… on all predictors except pred1  glm(response ~ .  pred1, data = dataset, family = "binomial") 
Summarize logistic model: coefficient estimates, standard errors, Wald statistics ('z value' ), \(p\)values for \(H_0:\beta_j=0\), Null deviance, deviance ('Residual deviance' ), AIC, number of iterations  summary(model) 
CIs coefficients  confint(model, level = level); confint.default(model, level = level) 
CIs expcoefficients  exp(confint(model, level = level)); exp(confint.default(model, level = level)) 
Prediction  predict(model, newdata = new, type = "response") 
CIs predicted probability  Not immediate. Use predictCIsLogistic(model, newdata = new, level = level) as seen in Section 4.6 
Variable selection  stepwise(model) 
Multicollinearity detection  vif(model) 
\(R^2\)  Not immediate. Use r2Log(model = model) as seen in Section 4.8 
Hit matrix  table(data$resp, model$fitted.values > 0.5) 
Principal Component Analysis
Some useful commands for performing logistic regression are given in the next table. We assume that:
dataset
is an imported dataset with several noncategorical variables (the variables must be continuous or discrete).pca
is a PCA object, this is, the output of princomp
.
Compute a PCA…  
… unnormalized (if variables have the same scale)  princomp(dataset) 
… normalized (if variables have different scales)  princomp(dataset, cor = TRUE) 
Summarize PCA: standard deviation explained by each PC, proportion of variance explained by each PC, cumulative proportion of variance explained up to a given component  summary(pca) 
Weights  pca$loadings 
Scores  pca$scores 
Standard deviations of the PCs  pca$sdev 
Means of the original variables  pca$center 
Screeplot  plot(pca); plot(pca, type = "l") 
Biplot  biplot(pca) 
Use of qualitative predictors in regression
An important situation not covered in Chapters 2, 3 and 4 is how to deal with qualitative, and not quantitative, predictors. Qualitative predictors, also known as categorical variables or, in R
’s terminology, factors, are ubiquitous in social sciences. Dealing with them requires some care and proper understanding of how these variables are represented in statistical softwares such as R
.
Two levels
The simplest case is the situation with two levels, this is, the binary case covered in logistic regression. There we saw that a binary variable \(C\) with two levels (for example, a and b) could be represented as \[D=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C=a.\end{array}\right.\] \(D\) now is a dummy variable: it codifies with zeros and ones the two possible levels of the categorical variable. An example of \(C\) could be gender, which has levels male and female. The dummy variable associated is \(D=0\) if the gender is male and \(D=1\) if the gender is female.
The advantage of this dummification is its interpretability in regression models. Since level a corresponds to \(0\), it can be seen as the reference level to which level b is compared. This is the key point in dummification: set one level as the reference and codify the rest as departures from them with ones.
The previous interpretation translates easily to regression models. Assume that the dummy variable \(D\) is available together with other predictors \(X_1,\ldots,X_k\). Then:
Linear model \[\mathbb{E}[YX_1=x_1,\ldots,X_k=x_k,D=d]=\beta_0+\beta_1X_1+\ldots+\beta_kX_k+\beta_{k+1}D.\] The coefficient associated to \(D\) is easily interpretable. \(\beta_{k+1}\) is the increment in mean of \(Y\) associated to changing \(D=0\) (reference) to \(D=1\), while the rest of the predictors are fixed. Or in other words, \(\beta_{k+1}\) is the increment in mean of \(Y\) associated to changing of the level of the categorical variable from a to b.
Logistic model \[\mathbb{P}[Y=1X_1=x_1,\ldots,X_k=x_k,D=d]=\text{logisitc}(\beta_0+\beta_1X_1+\ldots+\beta_kX_k+\beta_{k+1}D).\] We have two interpretations of \(\beta_{k+1}\), either in terms of logodds or odds:
 \(\beta_{k+1}\) is the additive increment in logodds of \(Y\) associated to changing of the level of the categorical variable from a (reference, \(D=0\)) to b (\(D=1\)).
 \(e^{\beta_{k+1}}\) is the multiplicative increment in odds of \(Y\) associated to changing of the level of the categorical variable from a (reference, \(D=0\)) to b (\(D=1\)).
R
does the dummification automatically (translates a categorical variable \(C\) into its dummy version \(D\)) if it detects that a factor variable is present in the regression model. Let’s see an example of this in linear and logistic regression.
# Load the Boston datasetlibrary(MASS)data(Boston)# Structure of the datastr(Boston)## 'data.frame': 506 obs. of 14 variables:## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...## $ rm : num 6.58 6.42 7.18 7 7.15 ...## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...## $ black : num 397 397 393 395 397 ...## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...# chas is a dummy variable measuring if the suburb is close to the river (1)# or not (0). In this case it is not codifyed as a factor but as a 0 or 1.# Summary of a linear modelmod < lm(medv ~ chas + crim, data = Boston)summary(mod)## ## Call:## lm(formula = medv ~ chas + crim, data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## 16.540 5.421 1.878 2.575 30.134 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 23.61403 0.41862 56.409 < 2e16 ***## chas 5.57772 1.46926 3.796 0.000165 ***## crim 0.40598 0.04339 9.358 < 2e16 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 8.373 on 503 degrees of freedom## Multiple Rsquared: 0.1744, Adjusted Rsquared: 0.1712 ## Fstatistic: 53.14 on 2 and 503 DF, pvalue: < 2.2e16# The coefficient associated to chas is 5.57772. That means that if the suburb# is close to the river, the mean of medv increases in 5.57772 units.# chas is significant (the presence of the river adds a valuable information# for explaining medv)# Create a binary response (1 expensive suburb, 0 inexpensive)Boston$expensive < Boston$medv > 25# Summary of a logistic modelmod < glm(expensive ~ chas + crim, data = Boston, family = "binomial")summary(mod)## ## Call:## glm(formula = expensive ~ chas + crim, family = "binomial", data = Boston)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 1.26764 0.84292 0.67854 0.00099 2.87470 ## ## Coefficients:## Estimate Std. Error z value Pr(>z) ## (Intercept) 0.82159 0.12217 6.725 1.76e11 ***## chas 1.04165 0.36962 2.818 0.00483 ** ## crim 0.22816 0.05265 4.333 1.47e05 ***## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 513.44 on 503 degrees of freedom## AIC: 519.44## ## Number of Fisher Scoring iterations: 6# The coefficient associated to chas is 1.04165. That means that if the suburb# is close to the river, the logodds of expensive increases by 1.04165.# Alternatively, the odds of expensive increases by a factor of exp(1.04165).# chas is significant (the presence of the river adds a valuable information# for explaining medv)
More than two levels
Let’s see now the case with more than two levels, for example, a categorical variable \(C\) with levels a, b and c. If we take a as the reference level, this variable can be represented by two dummy variables: \[D_1=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C\neq b\end{array}\right.\] and \[D_2=\left\{\begin{array}{ll}1,&\text{if }C=c,\\0,& \text{if }C\neq c.\end{array}\right.\] Then \(C=a\) is represented by \(D_1=D_2=0\), \(C=b\) is represented by \(D_1=1,D_2=0\) and \(C=c\) is represented by \(D_1=0,D_2=1\). The interpretation of the regression models with the presence of \(D_1\) and \(D_2\) is the very similar to the one before. For example, for the linear model, the coefficient associated to \(D_1\) gives the increment in mean of \(Y\) when the category of \(C\) changes from a to b. The coefficient for \(D_2\) gives the increment in mean of \(Y\) when it changes from a to c.
In general, if we have a categorical variable with \(J\) levels, then the number of dummy variables required is \(J1\). Again, R
does the dummification automatically for you if it detects that a factor variable is present in the regression model.
# Load dataset  factors in the last columndata(iris)summary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## # Summary of a linear modelmod1 < lm(Sepal.Length ~ ., data = iris)summary(mod1)## ## Call:## lm(formula = Sepal.Length ~ ., data = iris)## ## Residuals:## Min 1Q Median 3Q Max ## 0.79424 0.21874 0.00899 0.20255 0.73103 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 2.17127 0.27979 7.760 1.43e12 ***## Sepal.Width 0.49589 0.08607 5.761 4.87e08 ***## Petal.Length 0.82924 0.06853 12.101 < 2e16 ***## Petal.Width 0.31516 0.15120 2.084 0.03889 * ## Speciesversicolor 0.72356 0.24017 3.013 0.00306 ** ## Speciesvirginica 1.02350 0.33373 3.067 0.00258 ** ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3068 on 144 degrees of freedom## Multiple Rsquared: 0.8673, Adjusted Rsquared: 0.8627 ## Fstatistic: 188.3 on 5 and 144 DF, pvalue: < 2.2e16# Speciesversicolor (D1) coefficient: 0.72356. The average increment of# Sepal.Length when the species is versicolor instead of setosa (reference).# Speciesvirginica (D2) coefficient: 1.02350. The average increment of# Sepal.Length when the species is virginica instead of setosa (reference).# Both dummy variables are significant# How to set a different level as reference (versicolor)iris$Species < relevel(iris$Species, ref = "versicolor")# Same estimates except for the dummy coefficientsmod2 < lm(Sepal.Length ~ ., data = iris)summary(mod2)## ## Call:## lm(formula = Sepal.Length ~ ., data = iris)## ## Residuals:## Min 1Q Median 3Q Max ## 0.79424 0.21874 0.00899 0.20255 0.73103 ## ## Coefficients:## Estimate Std. Error t value Pr(>t) ## (Intercept) 1.44770 0.28149 5.143 8.68e07 ***## Sepal.Width 0.49589 0.08607 5.761 4.87e08 ***## Petal.Length 0.82924 0.06853 12.101 < 2e16 ***## Petal.Width 0.31516 0.15120 2.084 0.03889 * ## Speciessetosa 0.72356 0.24017 3.013 0.00306 ** ## Speciesvirginica 0.29994 0.11898 2.521 0.01280 * ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.3068 on 144 degrees of freedom## Multiple Rsquared: 0.8673, Adjusted Rsquared: 0.8627 ## Fstatistic: 188.3 on 5 and 144 DF, pvalue: < 2.2e16# Speciessetosa (D1) coefficient: 0.72356. The average increment of# Sepal.Length when the species is setosa instead of versicolor (reference).# Speciesvirginica (D2) coefficient: 0.29994.s The average increment of# Sepal.Length when the species is virginica instead of versicolor (reference).# Both dummy variables are significant# Coefficients of the modelconfint(mod2)## 2.5 % 97.5 %## (Intercept) 0.8913266 2.00408209## Sepal.Width 0.3257653 0.66601260## Petal.Length 0.6937939 0.96469395## Petal.Width 0.6140049 0.01630542## Speciessetosa 0.2488500 1.19827390## Speciesvirginica 0.5351144 0.06475727# The coefficients of Speciesversicolor and Speciesvirginica are significantly# negative. Therefore, there are significant
Do not codify a categorical variable as a discrete variable. This constitutes a major methodological fail that will flaw the subsequent statistical analysis.
For example if you have a categorical variable party
with levels partyA
, partyB
and partyC
, do not encode it as a discrete variable taking the values 1
, 2
and 3
, respectively. If you do so:
 You assume implicitly an order in the levels of
party
, since partyA
is closer to partyB
than to partyC
.  You assume implicitly that
partyC
is three times larger than partyA
.  The codification is completely arbitrary – why not considering
1
, 1.5
and 1.75
instead of?
The right way of dealing with categorical variables in regression is to set the variable as a factor and let R
do internally the dummification.
Multinomial logistic regression
The logistic model can be generalized to categorical variables
\(Y\) with more than two possible levels, namely
\(\{1,\ldots,J\}\). Given the predictors
\(X_1,\ldots,X_k\),
multinomial logistic regression models the probability of each level
\(j\) of
\(Y\) by
\[\begin{align}p_j(\mathbf{x})=\mathbb{P}[Y=jX_1=x_1,\ldots,X_k=x_k]=\frac{e^{\beta_{0j}+\beta_{1j}X_1+\ldots+\beta_{kj}X_k}}{1+\sum_{l=1}^{J1}e^{\beta_{0l}+\beta_{1l}X_1+\ldots+\beta_{kl}X_k}} \tag{C.1}\end{align}\]for
\(j=1,\ldots,J1\) and (for the last level
\(J\))
\[\begin{align}p_J(\mathbf{x})=\mathbb{P}[Y=JX_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+\sum_{l=1}^{J1}e^{\beta_{0l}+\beta_{1l}X_1+\ldots+\beta_{kl}X_k}}. \tag{C.2}\end{align}\]Note that (C.1) and (C.2) implies that \(\sum_{j=1}^J p_j(\mathbf{x})=1\) and that there are \((J1)\times(k+1)\) coefficients (\((J1)\) intercepts and \((J1)\times k\) slopes). Also, (C.2) reveals that the last level, \(J\), is given a different treatment. This is because it is the reference level (it could be a different one, but is tradition to choose the last one).
The multinomial logistic model has an interesting interpretation in terms of logistic regressions. Taking the quotient between
(C.1) and
(C.2) gives
\[\begin{align}\frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}=e^{\beta_{0j}+\beta_{1j}X_1+\ldots+\beta_{kj}X_k}\tag{C.3}\end{align}\]for
\(j=1,\ldots,J1\). Therefore, applying a logarithm to both sides we have:
\[\begin{align}\log\frac{p_j(\mathbf{x})}{p_J(\mathbf{x})}=\beta_{0j}+\beta_{1j}X_1+\ldots+\beta_{kj}X_k.\tag{C.4}\end{align}\]This equation is indeed very similar to (4.7). If \(J=2\), is the same up to a change in the codes for the levels: the logistic regression giving the probability of \(Y=1\) versus \(Y=2\). On the LHS of (C.4) we have the log of the ratio of two probabilities and on the RHS a linear combination of the predictors. If the probabilities on the LHS were complementary (if they add one), then we would have a logodds and hence a logistic regression for \(Y\). This is not the situation, but is close: instead of odds and logodds, we have ratios and logratios of non complementary probabilities. Also, it gives a good insight on what the multinomial logistic regression is: a set of \(J1\) independent “logistic regressions” for the probability of \(Y=j\) versus the probability of the reference \(Y=J\).
Equation (C.3) gives also interpretation on the coefficients of the model since \[p_j(\mathbf{x})=e^{\beta_{0j}+\beta_{1j}X_1+\ldots+\beta_{kj}X_k}p_J(\mathbf{x}).\] Therefore:
 \(e^{\beta_{0j}}\): is the ratio between \(p_j(\mathbf{0})/p_J(\mathbf{0})\), the probabilities of \(Y=j\) and \(Y=J\) when \(X_1=\ldots=X_k=0\). If \(e^{\beta_{0j}}>1\) (equivalently, \(\beta_{0j}>0\)), then \(Y=j\) is more likely than \(Y=J\). If \(e^{\beta_{0j}}<1\) (\(\beta_{0j}<0\)), then \(Y=j\) is less likely than \(Y=J\).
 \(e^{\beta_{lj}}\), \(l\geq1\): is the multiplicative increment of the ratio between \(p_j(\mathbf{x})/p_J(\mathbf{x})\) for an increment of one unit in \(X_l=x_l\), provided that the remaining variables \(X_1,\ldots,X_{l1},X_{l+1},\ldots,X_k\) do not change. If \(e^{\beta_{lj}}>1\) (equivalently, \(\beta_{lj}>0\)), then \(Y=j\) becomes more likely than \(Y=J\) for each increment in \(X_j\). If \(e^{\beta_{lj}}<1\) (\(\beta_{lj}<0\)), then \(Y=j\) becomes less likely than \(Y=J\).
The following code illustrates how to compute a basic multinomial regression in R
.
# Package included in R that implements multinomial regressionlibrary(nnet)# Data from the voting intentions in the 1988 Chilean national plebiscitedata(Chile)summary(Chile)## region population sex age education ## C :600 Min. : 3750 F:1379 Min. :18.00 P :1107 ## M :100 1st Qu.: 25000 M:1321 1st Qu.:26.00 PS : 462 ## N :322 Median :175000 Median :36.00 S :1120 ## S :718 Mean :152222 Mean :38.55 NA's: 11 ## SA:960 3rd Qu.:250000 3rd Qu.:49.00 ## Max. :250000 Max. :70.00 ## NA's :1 ## income statusquo vote ## Min. : 2500 Min. :1.80301 A :187 ## 1st Qu.: 7500 1st Qu.:1.00223 N :889 ## Median : 15000 Median :0.04558 U :588 ## Mean : 33876 Mean : 0.00000 Y :868 ## 3rd Qu.: 35000 3rd Qu.: 0.96857 NA's:168 ## Max. :200000 Max. : 2.04859 ## NA's :98 NA's :17# vote is a factor with levels A (abstention), N (against Pinochet),# U (undecided), Y (for Pinochet)# Fit of the model done by multinom: Response ~ Predictors# It is an iterative procedure (maxit sets the maximum number of iterations)# Read the documentation in ?multinom for more informationmod1 < multinom(vote ~ age + education + statusquo, data = Chile, maxit = 1e3)## # weights: 24 (15 variable)## initial value 3476.826258 ## iter 10 value 2310.201176## iter 20 value 2135.385060## final value 2132.416452 ## converged# Each row of coefficients gives the coefficients of the logistic# regression of a level versus the reference level (A)summary(mod1)## Call:## multinom(formula = vote ~ age + education + statusquo, data = Chile, ## maxit = 1000)## ## Coefficients:## (Intercept) age educationPS educationS statusquo## N 0.3002851 0.004829029 0.4101765 0.1526621 1.7583872## U 0.8722750 0.020030032 1.0293079 0.6743729 0.3261418## Y 0.5093217 0.016697208 0.4419826 0.6909373 1.8752190## ## Std. Errors:## (Intercept) age educationPS educationS statusquo## N 0.3315229 0.006742834 0.2659012 0.2098064 0.1292517## U 0.3183088 0.006630914 0.2822363 0.2035971 0.1059440## Y 0.3333254 0.006915012 0.2836015 0.2131728 0.1197440## ## Residual Deviance: 4264.833 ## AIC: 4294.833# Set a different level as the reference (N) for easening interpretationsChile$vote < relevel(Chile$vote, ref = "N")mod2 < multinom(vote ~ age + education + statusquo, data = Chile, maxit = 1e3)## # weights: 24 (15 variable)## initial value 3476.826258 ## iter 10 value 2393.713801## iter 20 value 2134.438912## final value 2132.416452 ## convergedsummary(mod2)## Call:## multinom(formula = vote ~ age + education + statusquo, data = Chile, ## maxit = 1000)## ## Coefficients:## (Intercept) age educationPS educationS statusquo## A 0.3002035 0.00482911 0.4101274 0.1525608 1.758307## U 0.5720544 0.01519931 1.4394862 0.5217093 2.084491## Y 0.2091397 0.01186576 0.8521205 0.5382716 3.633550## ## Std. Errors:## (Intercept) age educationPS educationS statusquo## A 0.3315153 0.006742654 0.2658887 0.2098012 0.1292494## U 0.2448452 0.004819103 0.2116375 0.1505854 0.1091445## Y 0.2850655 0.005700894 0.2370881 0.1789293 0.1316567## ## Residual Deviance: 4264.833 ## AIC: 4294.833exp(coef(mod2))## (Intercept) age educationPS educationS statusquo## A 0.7406675 0.9951825 0.6635657 1.1648133 5.802607## U 1.7719034 1.0153154 0.2370495 0.5935052 8.040502## Y 1.2326171 1.0119364 0.4265095 0.5837564 37.846937# Some highlights:#  intercepts do not have too much interpretation (correspond to age = 0).# A possible solution is to center age by its mean (so age = 0 would# represent the mean of the ages)#  both age and statusquo increase the probability of voting Y, A or U# with respect to voting N > conservativeness increases with ages#  both age and statusquo increase more the probability of voting Y and U# than A > elderly and status quo supporters more decided to participate#  a PS level of education increases the probability of voting N. Same for# a S level of education, but more prone to A# Prediction of votes  three profile of votersnewdata < data.frame(age = c(23, 40, 50), education = c("PS", "S", "P"), statusquo = c(1, 0, 2))# Probabilities of belonging to each classpredict(mod2, newdata = newdata, type = "probs")## N A U Y## 1 0.856057623 0.064885869 0.06343390 0.01562261## 2 0.208361489 0.148185871 0.40245842 0.24099422## 3 0.000288924 0.005659661 0.07076828 0.92328313# Predicted classpredict(mod2, newdata = newdata, type = "class")## [1] N U Y## Levels: N A U Y
Multinomial logistic regression will suffer from numerical instabilities and its iterative algorithm might even fail to converge if the levels of the categorical variable are very separated (e.g., two data clouds clearly separated corresponding to a different level of the categorical variable).
The multinomial model employs (J − 1)(k + 1) parameters. It is easy to end up with complex models – that require a large sample size to be fitted properly – if the response has a few number of levels and there are several predictors. For example, with 5 levels and 8 predictors we will have 36 parameters. Estimating this model with 50 − 100 observations will probably result in overfitting.
Reporting with R
and R Commander
A nice feature of R Commander
is that integrates seamless with R Markdown
, which is able to create .html
, .pdf
and .docx
reports directly from the outputs of R
. Depending on the kind of report that we want, we will need the following auxiliary software:
.html
. No extra software is required..docx
and .rtf
. You must install Pandoc
, a document converter software. Download it here..pdf
(only recommended for experts). An installation of LaTeX, additionally to Pandoc
, is needed. Download LaTeX here.
The workflow is simple. Once you have done some statistical analysis, either by using R Commander
’s menus or R
code directly, you will end up with an R
script, on the 'R Script'
tab, that contains all the commands you have run so far. Switch then to the 'R Markdown'
tab and you will see the commands you have entered in a different layout, which essentially encapsulates the code into chunks delimited by ```{r}
and ```
. This will generate a report once you click in the 'Generate report'
button.
Let’s illustrate this process through an example. Suppose we were analyzing the Boston
dataset, as we did in Section 3.1.2. Ideally our final script would be something like this:
# A simple and nonexhaustive analysis for the price of the houses in the Boston# dataset. The purpose is to quantify, by means of a multiple linear model,# the effect of 14 variables in the price of a house in the suburbs of Boston.# Import datalibrary(MASS)data(Boston)# Make a multiple linear regression od medv in the rest of variablesmod < lm(medv ~ ., data = Boston)summary(mod)# Check the linearty assumptionplot(mod, 1) # Clear nonlinearity# Let's consider the transformations given in Harrison and Rubinfeld (1978)modTransf < lm(I(log(medv * 1000)) ~ I(rm^2) + age + log(dis) + log(rad) + tax + ptratio + I(black / 1000) + I(log(lstat / 100)) + crim + zn + indus + chas + I((10 * nox)^2), data = Boston)summary(modTransf)# The nonlinearity is more subtle nowplot(modTransf, 1)# Look for the best model in terms of the BICmodTransfBIC < stepwise(modTransf)summary(modTran