1 Supervised and unsupervised learning
1.1 Introduction
The methods in machine learning can be broadly divided into supervised and unsupervised learning methods.
In supervised learning, we have an outcome variable Y (also called output, response, dependent variable) and a set of predictors X (also called features, dependent variables, covariates). The objective is to estimate the function f(X) that connects Y and X. For example,
Y=f(X)+ε
(if Y is categorical we generally consider the association of X and the Pr(Y=k))
Having a dataset with observations of Y and X (called training set), we can use these data to find an estimate ˆf(X) according to some optimisation principle.
If we specify a functional for f(X), such as the linear regression model:f(X)=β0+β1X1+...+βpXp, we say that we are using a parametric method. If instead we allow the data to estimate the functional form of f(X), such as k-nearest neighbourhood regression, we call the method non-parametric.
In unsupervised learning , the goal is to model the underlying structure or distribution in the data without having the outcome Y. In other words, we want to analyse how the features X are clustered or associated. Examples of these methods are principle components analysis and k-means clustering.
1.2 Readings
Read the following chapters of An introduction to statistical learning:
- 2.1 What is statistical learning?
If you are using R for the first time, I encourage you to complete the section
- 2.3 Introduction to R
1.3 R review
Task 1 - Read a dataset and create a new variable
Read the bmd.csv dataset in R. You can download and read the data or use the path https://www.dropbox.com/s/7wjsfdaf0wt2kg2/bmd.csv?dl=1
If you have downloaded the file, you need to substitute the url above by the path to file in your computer, e.g., “c:\my files\bmd.csv”
The variable id is the identification of the subjects in the dataset, and the variable age is their age. What is the age of the subject id=197?
bmd.data$age
is the vector with the ages of all the subjects. Within that
vector we need to find the one where bmd.data$id == 197
. Notice that double
equal sign “==” is a logic statement.
If we write bmd.data$id = 197
we are
assigning the value 197 to the variable id and every subject in the dataset
gets a new value 197 for that variable.
## [1] 69.72845
## [1] 69.72845
## [1] 69.72845
Let’s now create a new variable age.months which is age in months (the original is in years).
# we just need to multiply the original by 12
bmd.data$age.months <- bmd.data$age * 12
# we can use = instead of <-
bmd.data$age.months = bmd.data$age * 12
Finally, let’s recode the variable age into age categories <=50, (50,60], (60,70] and >60.
# we can use the function cut() to
#create the categories
bmd.data$age.cat <- cut(bmd.data$age,
c(0,50,60,70, 100))
# alternative
bmd.data$age.cat2[bmd.data$age<50] <- 1
bmd.data$age.cat2[bmd.data$age>=50 & bmd.data$age<60] <- 2
bmd.data$age.cat2[bmd.data$age>=60 & bmd.data$age<70] <- 3
bmd.data$age.cat2[bmd.data$age>=70] <- 4
#frequencies of the categories
table(bmd.data$age.cat)
##
## (0,50] (50,60] (60,70] (70,100]
## 24 45 48 52
TRY IT YOURSELF:
- Get the ages for all the male subjects, i.e.,
bmd.data$sex == "M"
.
- What is the length of the vector above? Or, in other words, how many subjects are male?
- Using the variables weight_kg and height_cm, compute the body mass index? weight in Kg(heigh in m)2
Task 2 - Histogram
There are several packages specific to graphs
ggplot2
, lattice
, plotly
, highcharter
, sunburstR
, dygraphs
, rgl
,…
The R Graph Gallery is an excellent
learning resource for complex plots and visualisation
We will start with basic functions. Let’s produce the histogram of the variable bmi created in task 1
Let’s edit the title and x-axis label
And we can also plot the distribution for males and females, separately
#histogram for males
hist(bmd.data$bmi[bmd.data$sex=="M" ],
breaks = 10, # number of cutoff for the bars
main= "My 1st histogram",
xlab= "Body Mass Index",
col=rgb(0,0,1,.5)) # color of the bars
#histogram for females
hist(bmd.data$bmi[bmd.data$sex=="F" ],
breaks = 10,
main= "My 1st histogram",
xlab= "Body Mass Index",
col=rgb(1,0,0,.5),
add=T) # superimposes the histograms
# Add legend
legend("topright",
legend=c("Males","Females"),
col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)),
pt.cex=2, pch=15 )
Task 3 - Scatter and boxplot
To see the relation between bone mineral density (BMD) and age we can produce a scatter plot:
Or a boxplot between bone mineral density (BMD) by categories of age created in task 1
And with a bit more work:
# BMD
boxplot(bmd.data$bmd ~ bmd.data$age.cat,
main = "A nice boxplot (at least for a colorblind)",
xlab = "Age categories",
ylab = "Bone Mineral Density",
col = terrain.colors(4) )
# Add data points
mylevels <- levels(bmd.data$age.cat)
levelProportions <- summary(bmd.data$age.cat)/nrow(bmd.data)
for(i in 1:length(mylevels)){
thislevel <- mylevels[i]
thisvalues <- bmd.data[bmd.data$age.cat == thislevel, "bmd"]
#take the x-axis indices and add a jitter,
#proportional to the N in each level
myjitter <- jitter(rep(i, length(thisvalues)),
amount=levelProportions[i]/2)
points(myjitter, thisvalues,
pch=20, col=rgb(0,0,0,.9))
}
TRY IT YOURSELF:
- Plot a scatter for the relation of bmi and bmd.
- Plot a scatter for the relation of bmi and bmd with the dots painted by the variable sex.
See the solution code
plot(bmd.data$bmi,bmd.data$bmd,
pch=20,
col= ifelse(bmd.data$sex=="M", #blue for males
"blue", #red for females
"red")
)
# Add legend
legend("topright",
legend=c("Males","Females"),
col=c("blue", "red"),
pt.cex=3, pch=20 )