Chapter 6 Lab 4 - 29/03/2022
In this lecture we will learn how to implement the K-nearest neighbors (KNN) method for classification and regression problems.
The following packages are required: class
, FNN
and tidyverse
.
library(class)
library(FNN)
##
## Attaching package: 'FNN'
## The following objects are masked from 'package:class':
##
## knn, knn.cv
library(tidyverse)
6.1 KNN for regression problems
For KNN regression we will use data regarding bike sharing (link). The data are stored in the file named bikesharing.csv
which is available in the e-learning. The data regard the bike sharing counts aggregated on daily basis.
We start by importing the data. Please, note that in the following code my file path is shown; obviously for your computer it will be different:
<- read.csv("./files/bikesharing.csv") bikesharing
glimpse(bikesharing)
## Rows: 731
## Columns: 16
## $ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <fct> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05,…
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
The dataset is composed by 731 rows (i.e. days) and 16 variables. We are in particular interested in the following quantitative variables (in this case \(p=3\) regressors):
atemp
: normalized feeling temperature in Celsiushum
: normalized humiditywindspeed
: normalized wind speedcnt
: count of total rental bikes (response variable)
The response variable cnt
can be summarized as follows:
summary(bikesharing$cnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22 3152 4548 4504 5956 8714
%>%
bikesharing ggplot()+
geom_histogram(aes(cnt),bins=10,col="white")
Obviously, the number of bike rentals depends strongly on weather conditions. In the following plot we represent cnt
as a function of atemp
and windspeed
, specifying a scale color from yellow to dark blue with 12 levels:
%>%
bikesharing ggplot()+
geom_point(aes(x=atemp,y=windspeed,col=cnt))+
scale_colour_gradientn(colours = rev(hcl.colors(12)))
It can be observed that the number of bike rentals increases with temperature.
6.1.1 Creation of the training and testing set: method 1
We divide the dataset into 3 subsets:
- training set: containing 70% of the observations
- validating set: containing 15% of the observations
- test set: containing 15% of the observations
The procedure is done randomly by using the function sample
. In order to have reproducible results, it is necessary to set the seed.
set.seed(1, sample.kind = "Rejection")
# Sample the indexes for the training observations
<- sample(1:3,
index size=nrow(bikesharing),
prob=c(0.7,0.15,0.15),
replace=T) # the same unit (1,2 or 3) must be resampled!
head(index)
## [1] 1 1 1 2 1 2
# Create a new dataframe selecting the training observations
<- bikesharing[index==1,]
bike_training head(bike_training)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 8 8 2011-01-08 1 0 1 0 6 0 2
## 9 9 2011-01-09 1 0 1 0 0 0 1
## temp atemp hum windspeed casual registered cnt
## 1 0.344167 0.363625 0.805833 0.160446 331 654 985
## 2 0.363478 0.353739 0.696087 0.248539 131 670 801
## 3 0.196364 0.189405 0.437273 0.248309 120 1229 1349
## 5 0.226957 0.229270 0.436957 0.186900 82 1518 1600
## 8 0.165000 0.162254 0.535833 0.266804 68 891 959
## 9 0.138333 0.116175 0.434167 0.361950 54 768 822
# Create a new test dataframe selecting the validation observations
<- bikesharing[index==2,]
bike_validation head(bike_validation)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## 7 7 2011-01-07 1 0 1 0 5 1 2
## 18 18 2011-01-18 1 0 1 0 2 1 2
## 21 21 2011-01-21 1 0 1 0 5 1 1
## 29 29 2011-01-29 1 0 1 0 6 0 1
## temp atemp hum windspeed casual registered cnt
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606
## 7 0.196522 0.208839 0.498696 0.1687260 148 1362 1510
## 18 0.216667 0.232333 0.861667 0.1467750 9 674 683
## 21 0.177500 0.157833 0.457083 0.3532420 75 1468 1543
## 29 0.196522 0.212126 0.651739 0.1453650 123 975 1098
# Create a new test dataframe selecting the test observations
<- bikesharing[index==3,]
bike_test head(bike_test)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 15 15 2011-01-15 1 0 1 0 6 0 2
## 17 17 2011-01-17 1 0 1 1 1 0 2
## 20 20 2011-01-20 1 0 1 0 4 1 2
## 35 35 2011-02-04 1 0 2 0 5 1 2
## 37 37 2011-02-06 1 0 2 0 0 0 1
## 39 39 2011-02-08 1 0 2 0 2 1 1
## temp atemp hum windspeed casual registered cnt
## 15 0.233333 0.248112 0.498750 0.157963 222 1026 1248
## 17 0.175833 0.176771 0.537500 0.194017 117 883 1000
## 20 0.261667 0.255050 0.538333 0.195904 83 1844 1927
## 35 0.211304 0.228587 0.585217 0.127839 88 1620 1708
## 37 0.285833 0.291671 0.568333 0.141800 354 1269 1623
## 39 0.220833 0.198246 0.537917 0.361950 64 1466 1530
The validation set will be used to tune the model, i.e. to choose the best value of the tuning parameter (the number of neighbors) which minimizes the validation error. Finally, we will rejoin together training and validation datasets and we will calculate the Test MSE.
6.1.2 Implementation of KNN regression with \(k=1\)
The function used to implement KNN regression is knn.reg
from the FNN
package (see ?knn.reg
). We start by considering the case when \(k=1\), corresponding to the most flexible model, when only one neighbor is considered.
In the argument train
and test
of the knn.reg
function it is necessary to provide the 3 regressors from the training and test dataset. It would be possible to create two new objects (dataframes) containing this selection of variables. In this case however we select directly the necessary variables from the existing dataframes (bike_training
and bike_validation
) by using the function select
from the dplyr
package. Finally, the arguments y
and k
refer to the training response variable vector and the number of neighbors, respectively.
= knn.reg(train = dplyr::select(bike_training,atemp,windspeed,hum),
KNN1 test = dplyr::select(bike_validation,atemp,windspeed,hum),
y = bike_training$cnt,
k = 1)
The object KNN1
contains different objects
names(KNN1)
## [1] "call" "k" "n" "pred" "residuals" "PRESS"
## [7] "R2Pred"
We are in particular interested in the object pred
which contains the predictions \(\hat y_0\) for the test observations. It is then possible to compute the test mean square error (MSE):
\[
\text{mean}(y_0-\hat y_0)^2
\]
where average is computed over all the observations in the validation set.
# Compute the MSE
mean((bike_validation$cnt - KNN1$pred)^2)
## [1] 3124805
It is also possible to plot observed and predicted values. To do this we first create a new dataframe containing the observed values (\(y_0\)) and the KNN predictions (\(\hat y_0\)):
= data.frame(obs=bike_validation$cnt,
pred_df_k1 pred=KNN1$pred)
head(pred_df_k1)
## obs pred
## 1 1562 2431
## 2 1606 1746
## 3 1510 1204
## 4 683 2177
## 5 1543 822
## 6 1098 1985
Then we represent graphically the observed and predicted values and assign to the points a color given by the values of the squared error:
<- pred_df_k1 %>%
plot1 ggplot() +
geom_point(aes(x=obs,y=pred,col=(obs-pred)^2)) +
scale_color_gradient(low="blue",high="red")
Blue observations are characterized by a smaller squared error meaning that predictions are close to observed values.
6.1.3 Implementation of KNN regression with different values of \(k\)
We now use a for
loop to implement automatically the KNN regression for different values of \(k\) . In particular, we consider the values 1, 10, 25, 50,100, 200, 500 and the maximum available observation that can be used for our training algorithm ( = the entire training dataset records). Each step of the loop, indexed by a variable i
, considers a different value of \(k\). We want to save in a vector all the values of the test MSE.
# Create an empty vector where values of the MSE will be saved
= c()
mse_vec # Create a vector with the values of k
= c(1, 10, 25, 50, 100, 200, 500, nrow(bike_training))
k_vec # Start the for loop
for(i in 1:length(k_vec)){
# Run KNN
= knn.reg(train = dplyr::select(bike_training,atemp, windspeed,hum),
KNN test = dplyr::select(bike_validation,atemp, windspeed,hum),
y = bike_training$cnt,
k = k_vec[i])
# Save the MSE
= mean((bike_validation$cnt-KNN$pred)^2)
mse_vec[i]
}
The value of the test MSE are the following:
mse_vec
## [1] 3124805 1922806 1912324 1890743 1991518 2243669 3638538 3705685
It is useful to plot the values of the test MSE as a function of \(k\):
plot(k_vec,mse_vec,type="b")
Note the U-shape of the plotted line which is typical of the test error.
The value of \(k\) which minimizes the MSE is in position 4 and is given by \(k\) equal to 50:
# Smallest MSE
= min(mse_vec)
MSE_KNN MSE_KNN
## [1] 1890743
# Its position
which.min(mse_vec)
## [1] 4
# Corresponding value of k
= k_vec[which.min(mse_vec)]
kbest kbest
## [1] 50
6.1.4 Assessment of the tuned model
Now we can finally calculate the Test MSE with the selected parameter:
<- rbind(bike_validation, bike_training) # recall to rejoin together test + validation
bike_newtraining
<- knn.reg(train = dplyr::select(bike_newtraining, atemp, hum, windspeed),
KNNbest test = dplyr::select(bike_test, atemp, hum, windspeed),
y = bike_newtraining$cnt,
k=kbest)
<- mean((bike_test$cnt - KNNbest$pred)^2)
MSE_KNN_final MSE_KNN_final
## [1] 1725030
6.1.5 Comparison of KNN with the multiple linear model
We consider for the same set of data also the multiple linear model with 3 regressors: \[ Y = \beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3 +\epsilon \]
This can be implemented by using the lm
function applied to the training set of data:
= lm(cnt ~ atemp+windspeed+hum,
modlm data = bike_training)
summary(modlm)
##
## Call:
## lm(formula = cnt ~ atemp + windspeed + hum, data = bike_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4846.9 -1006.8 -127.4 1128.9 3545.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3877.8 399.8 9.699 < 2e-16 ***
## atemp 7530.4 397.4 18.950 < 2e-16 ***
## windspeed -5010.5 845.6 -5.925 5.75e-09 ***
## hum -3152.7 449.8 -7.009 7.67e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1418 on 508 degrees of freedom
## Multiple R-squared: 0.4665, Adjusted R-squared: 0.4633
## F-statistic: 148.1 on 3 and 508 DF, p-value: < 2.2e-16
The corresponding predictions and test MSE can be obtained as follows by making use of the predict
function:
= predict(modlm,
predlm newdata = bike_test)
= mean((bike_test$cnt-predlm)^2)
MSE_lm MSE_lm
## [1] 2094849
Note that the test MSE of the linear regression model is higher than the KNN MSE with \(k=50\).
6.1.6 Comparison of KNN with the multiple linear model with quadratic terms
For introducing some flexibility we try to extend the previous linear model by including quadratic terms of the regressors: \[ Y = \beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3 + \beta_4 X_1^2 + \beta_5 X_2^2 + \beta_6 X_3^2+\epsilon \]
This is still a linear model in the parameters (not in the variables). Polynomial regression models are able to produce non-linear fitted functions.
This model can be implemented by using the poly
function (with degree=2
in this case) in the formula specification. This specifies a model including \(X\) and \(X^2\).
= lm(cnt ~ poly(atemp,2) +
modlmpoly poly(windspeed,2)+
poly(hum,2), data=bike_training)
summary(modlmpoly)
##
## Call:
## lm(formula = cnt ~ poly(atemp, 2) + poly(windspeed, 2) + poly(hum,
## 2), data = bike_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2842.4 -956.0 -117.9 1059.5 2893.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4522.47 55.23 81.886 < 2e-16 ***
## poly(atemp, 2)1 26839.05 1289.41 20.815 < 2e-16 ***
## poly(atemp, 2)2 -14526.42 1297.14 -11.199 < 2e-16 ***
## poly(windspeed, 2)1 -7517.18 1325.41 -5.672 2.38e-08 ***
## poly(windspeed, 2)2 -1259.03 1274.83 -0.988 0.324
## poly(hum, 2)1 -13058.60 1324.06 -9.863 < 2e-16 ***
## poly(hum, 2)2 -7814.03 1309.96 -5.965 4.60e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1250 on 505 degrees of freedom
## Multiple R-squared: 0.588, Adjusted R-squared: 0.5831
## F-statistic: 120.1 on 6 and 505 DF, p-value: < 2.2e-16
As before we compute the test MSE:
= predict(modlmpoly,
predlmpoly newdata = bike_test)
= mean((bike_test$cnt-predlmpoly)^2)
MSE_lm2 MSE_lm2
## [1] 1872116
6.1.7 Final comparison
We can now compare the MSE of the best KNN model and of the 2 linear regression models:
MSE_KNN_final
## [1] 1725030
MSE_lm
## [1] 2094849
MSE_lm2
## [1] 1872116
Results show that the multiple linear regression model has the worst performance. The polynomial model and the KNN with \(k=50\) have a similar performance.
6.2 KNN for classification problems
For KNN classification we will use the iris
dataset which is included in R (see ?iris
). It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The variable Species
is the response categorical variables with 3 categories.
As usual we start with visualizing the data with a plot. In this case we represent Species
as a function of Sepal.Length
and Sepal.Width
:
%>%
iris ggplot()+
geom_point(aes(Sepal.Length, Sepal.Width,col=Species))+
geom_smooth(aes(Sepal.Length, Sepal.Width,col=Species))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The following plot instead considers Species
as a function of Petal.Length
and Petal.Width
by using separate plots:
%>%
iris ggplot()+
geom_point(aes(Petal.Length, Petal.Width,col=Species))+
facet_wrap(~Species)
We split the dataset into 2 subsets. In other words, we are not properly tuning our algorithm, mainly because we don’t have a lot of data. In particular, we assume to know that the best value of the tuning parameter is \(k=1\) (and no tuning is required). The dataset will be:
- training set: containing 70% of the observations
- test set: containing 30% of the observations
set.seed(1, sample.kind="Rejection")
= sample(1:2,
index nrow(iris),
replace = TRUE,
prob = c(0.7, 0.3)
)
table(index)
## index
## 1 2
## 106 44
head(index)
## [1] 1 1 1 2 1 2
# create 2 separate dataframes
= iris[index == 1, ]
iris_training = iris[index == 2, ] iris_test
6.2.1 KNN classification with \(k=1\)
We now implement KNN classification with \(k=1\) by using the knn
function from the class
package (see ?knn
). Note that there are two functions both named knn
in the class
and FNN
package. In order to specify that we want to use the function in the class
package we use class::knn
. As before we have to specify the regressor dataframe for the training and test observations and the response variable vector (cl
). The option prob
makes it possible to get the probability of the majority class.
= class::knn(train = dplyr::select(iris_training,-Species),
knniris test = dplyr::select(iris_test,-Species),
cl = iris_training$Species,
k = 1, #just 1 neighbors
prob=T) # we are interest in having a look at the winning class probabilities
knniris
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] versicolor versicolor virginica virginica virginica virginica
## [37] virginica virginica virginica versicolor virginica virginica
## [43] virginica virginica
## attr(,"prob")
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1 1 1
## Levels: setosa versicolor virginica
Predictions are contained in the knniris
object; in this case all the probabilities are equal to one because just 1 neighbor is considered (and thus the majority class will have a 100% probability).
To compare predictions \(\hat y_0\) and observed values \(y_0\) it is useful to create the confusion matrix which is a double-entry matrix with predicted and observed categories and the corresponding frequencies:
table(pred=knniris, obs=iris_test$Species)
## obs
## pred setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 16 1
## virginica 0 0 11
In the main diagonal we can read the number of correctly classified observations. Outside from the main diagonal we have the number of missclassified observations.
The percentage test error rate is computed as follows:
#test error rate
mean(knniris!=iris_test$Species)*100
## [1] 2.272727
In this case we see that 2.27% of the observations are missclassified.
6.3 Exercises Lab 4
6.3.1 Exercise 1
Use the Weekly
data set contained in the ISLR
package. See ?Weekly
.
library(ISLR)
library(tidyverse)
?Weeklyhead(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
glimpse(Weekly)
## Rows: 1,089
## Columns: 9
## $ Year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
## $ Lag1 <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0…
## $ Lag2 <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0…
## $ Lag3 <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -…
## $ Lag4 <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, …
## $ Lag5 <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514,…
## $ Volume <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.154…
## $ Today <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, 1…
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up…
How many data are available for each year? What is the temporal frequency of the data?
Produce the boxplot which represents
Today
returns as a function ofYear
. Do you observe any difference across years?Provide the scatterplot matrix for
Today
andLag1
. Do you observe strong correlations?Plot the distribution of the variable
Direction
(with a barplot) for year 2010. Use percentage and not absolute frequencies.Plot the distribution of the variable
Direction
(with a barplot) separately for all the available years. Use percentage and not absolute frequencies. Attention: percentages should be computed by considering the number of observations available for each year and not the total number of observations.Consider for training the data available from 1990 to 2008, for validation the data for year 2009, and for testing the data for year 2010. The variables
Lag1
,Lag2
andLag3
are the predictors, whileDirection
is the response variable. How many data do you have in each set?Compute the validation error rate for the KNN classifier with \(k=1\).
Compute the validation error rate for the KNN classifier with \(k\) from 1 to 50. Use the
for
loop and save the results in a vector.Plot the validation error rate as a function of \(k\). Which value of \(k\) do you suggest to use?
Given the chosen value of \(k\), compute the predictions for the test set by using for training the remaining data. Which is the test error rate?
6.3.2 Exercise 2
In this exercise we will develop a classifier to predict if a given car gets high or low gas mileage using the Auto
dataset. The data are included in the ISLR
package (see Auto
).
library(ISLR)
?Auto
- Explore the data.
- Create using the
ifelse
function a binary variable, namedmpg01
, that is equal to 1 ifmpg
is bigger than its median, and a 0 otherwise. Add the variablempg01
to the existing dataframe. How many observations are classified with 1? - Study how
cylinders
,weight
,displacement
andhorsepower
vary according to the factormpg01
. - Consider as training set the cars released in even years. Use as regressor the following variables:
cylinders
,weight
,displacement
andhorsepower
. How many data do you have in the training and test datasets? Suggestion: use the modulus operator%%
to check if a year is even (e.g.4%%2
and5%%2
). - Run KNN with \(k=3\) (best tuned value). Which is the value of the test error rate? Provide also the confusion matrix.
6.3.3 Exercise 3
Consider the Carseats
data set available from the ISLR
library. Read the help to understand which variables are available (and if they are qualitative or quantitative).
library(ISLR)
library(tidyverse)
?Carseats
Plot
Sales
as a function ofPrice
.Split the dataset in 3 separate sets: the training set with 70% of the observations, 15% for validation and 30% for the test. Use 14 as seed.
Implement KNN regression to predict
Sales
usingPrice
. Consider all the values of \(k\) between 1 and 200. Which is the best value of \(k\) in terms of validation error?Given the best value of \(k\) chosen before, compute the test MSE using as training data the remaining data.
Fit a regression model to predict
Sales
usingPrice
(the validation set is not required). Compute the test MSE.Compare the regression model and the KNN regression (with the chosen value of \(k\)). Which is the best model?
6.3.4 Exercise 4
This exercise uses simulated data.
Use the
rnorm
function to create a vector namedx
containing \(n=100\) values simulated from a N(0,1) distribution. Set the seed equal to 1. These values represent the predictor’s values.Use the
rnorm
function to create a vector namedeps
containing \(n=100\) values simulated from a N(0,0.025) distribution (variance=0.025). Set the seed equal to 2. These values represent the error’s values (\(\epsilon\)).Using
x
andeps
, obtain a vectory
(the response values) according to the following (true) model. Create also a data frame containing the values ofx
andy
. \[ Y = -1+0.5X+\epsilon \]Which is the length of
y
? What are the values of \(\beta_0\) and \(\beta_1\) in this model?Plot the simulated data.
Fit a least squares linear model to predict
y
usingx
. Compare the true values of the parameters with the corresponding estimates.Display the least squares line in the scatterplot obtained before. Add to the plot also the straight line corresponding to the true model (suggestion: use
geom_abline
).
Fit a polynomial regression model to predict
y
usingx
andx^2
. Is there evidence that the quadratic term improves the model fit?Repeat points 1.-3. changing the data generation process in such a way that there is less noise in the data (the values of
x
don’t change). Consider for example a variance for the error equal to 0.001 (use a new name for the response data frame, for exampledata2
). Describe your results.Repeat points 1.-3. changing the data generation process in such a way that there is more noise in the data (the values of
x
don’t change). Consider for example a variance for the error equal to 1 (use a new name for the response data, for exampledata3
). Describe your results.