1.4 Scripts and datasets

The snippets of code of the notes are conveniently collected in the following scripts. To download them, simply save the link as a file in your browser.

Chapter 1: 01-intro.R.
Chapter 2: 02-lm-i.R.
Chapter 3: 03-lm-ii.R.
Chapter 4: 04-lm-iii.R.
Chapter 5: 05-glm.R. Generation of Figures 5.12–5.23: hypothesisGlm.R.
Chapter 6: 06-npreg.R.
Appendices A and B: 07-appendix.R.

The following is a handy list of all the relevant datasets used in the course together with brief descriptions. The list is sorted according to the order of appearance of the datasets in the notes. To download them, simply save the link as a file in your browser.

wine.csv. The dataset is formed by the auction Price of $27$ red Bordeaux vintages, five vintage descriptors (WinterRain, AGST, HarvestRain, Age, Year), and the population of France in the year of the vintage (FrancePop).
least-squares.RData. Contains a single data.frame, named leastSquares, with 50 observations of the variables x, yLin, yQua, and yExp. These are generated as $X\sim\mathcal{N}(0,1),$ $Y_\mathrm{lin}=-0.5+1.5X+\varepsilon,$ $Y_\mathrm{qua}=-0.5+1.5X^2+\varepsilon,$ and $Y_\mathrm{exp}=-0.5+1.5\cdot2^X+\varepsilon,$ with $\varepsilon\sim\mathcal{N}(0,0.5^2).$ The purpose of the dataset is to illustrate the least squares fitting.
least-squares-3D.RData. Contains a single data.frame, named leastSquares3D, with $50$ observations of the variables x1, x2, x3, yLin, yQua, and yExp. These are generated as $X_1,X_2\sim\mathcal{N}(0,1),$ $X_3=X_1+\mathcal{N}(0,0.05^2),$ $Y_\mathrm{lin}=-0.5 + 0.5 X_1 + 0.5 X_2 +\varepsilon,$ $Y_\mathrm{qua}=-0.5 + X_1^2 + 0.5 X_2+\varepsilon,$ and $Y_\mathrm{exp}=-0.5 + 0.5 e^{X_2} + X_3+\varepsilon,$ with $\varepsilon\sim\mathcal{N}(0,1).$ The purpose of the dataset is to illustrate the least squares fitting with several predictors.
assumptions.RData. Contains the data frame assumptions with $200$ observations of the variables x1, …, x9 and y1, …, y9. The purpose of the dataset is to identify which regression y1 ~ x1, …, y9 ~ x9 fulfills the assumptions of the linear model. The moreAssumptions.RData dataset has the same structure.
assumptions3D.RData. Contains the data frame assumptions3D with $200$ observations of the variables x1.1, …, x1.8, x2.1, …, x2.8 and y.1, …, y.8. The purpose of the dataset is to identify which regression y.1 ~ x1.1 + x2.1, …, y.8 ~ x1.8 + x2.8 fulfills the assumptions of the linear model.
Boston.xlsx. The dataset contains $14$ variables describing $506$ suburbs in Boston. Among those variables, medv is the median house value, rm is the average number of rooms per house, and crim is the per capita crime rate. The full description is available in ?MASS::Boston.
cpus.txt and gpus.txt. The datasets contain $102$ and $35$ rows, respectively, of commercial CPUs and GPUs appeared since the first models up to nowadays. The variables in the datasets are Processor, Transistor count, Date of introduction, Manufacturer, Process, and Area.
la-liga-2015-2016.xlsx. Contains 19 performance metrics for the 20 football teams in La Liga 2015/2016.
challenger.txt. Contains data for $23$ space-shuttle launches. There are $8$ variables. Among them: temp (the temperature in Celsius degrees at the time of launch), and fail.field and fail.nozzle (indicators of whether there were an incidents in the O-rings of the field joints and nozzles of the solid rocket boosters).
species.txt. Contains data for $90$ country parcels in which the Biomass, pH of the terrain (categorical variable), and number of Species were measured.
heart.txt. Contains data for $226$ patients suspected of having a future heart attack. The variables are CK (level of creatinine kinase), and ha and ok (number of patients that suffered a heart attack and did not suffer it, respectively).
Chile.txt. Contains data for $2700$ respondents on a survey for the voting intentions in the 1988 Chilean national plebiscite. There are $8$ variables: region, population, sex, age, education, income, statusquo (scale of support for the status quo), and vote. vote is a factor with levels A (abstention), N (against Pinochet), U (undecided), and Y (for Pinochet). Retrieved from data(Chile, package = "carData").