## 1.6 Datasets for the course

This is a handy list with a small description and download link for all the relevant datasets used in the course. To download them, simply **save the link as a file** in your browser.

`pisa.csv`

(download). Contains 65 rows corresponding to the countries that took part on the PISA study. Each row has the variables`Country`

,`MeanMath`

,`MathShareLow`

,`MathShareTop`

,`ReadingMean`

,`ScienceMean`

,`GDPp`

,`logGDPp`

and`HighIncome`

. The`logGDPp`

is the logarithm of the`GDPp`

, which is taken in order to avoid scale distortions.`US_apportionment.xls`

(download). Contains the 50 US states entitled to representation in the US House of Representatives. The recorded variables are`State`

,`Population2010`

and`Seats2013–2023`

.`EU_apportionment.txt`

(download). Contains 28 rows with the member states for the EU (`Country`

), the number of seats assigned under different years (`Seats2011`

,`Seats2014`

), the Cambridge Compromise apportionment (`CamCom2011`

), and the states population (`Population2010`

,`Population2013`

).`least-squares.RData`

(download). Contains a single`data.frame`

, named`leastSquares`

, with 50 observations of the variables`x`

,`yLin`

,`yQua`

and`yExp`

. These are generated as \(X\sim\mathcal{N}(0,1)\), \(Y_\mathrm{lin}=-0.5+1.5X+\varepsilon\), \(Y_\mathrm{qua}=-0.5+1.5X^2+\varepsilon\) and \(Y_\mathrm{exp}=-0.5+1.5\cdot2^X+\varepsilon\), with \(\varepsilon\sim\mathcal{N}(0,0.5^2)\). The purpose of the dataset is to illustrate the least squares fitting.`assumptions.RData`

(download). Contains the data frame`assumptions`

with 200 observations of the variables`x1`

, …,`x9`

and`y1`

, …,`y9`

. The purpose of the dataset is to identify which regression`y1 ~ x1`

, …,`y9 ~ x9`

fulfills the assumptions of the linear model. The dataset`moreAssumptions.RData`

(download) has the same structure.`cpus.txt`

(download) and`gpus.txt`

(download). The datasets contain 102 and 35 rows, respectively, of commercial CPUs and GPUs appeared since the first models up to nowadays. The variables in the datasets are`Processor`

,`Transistor count`

,`Date of introduction`

,`Manufacturer`

,`Process`

and`Area`

.`hap.txt`

(download). Contains data for 20 advanced economies in the time period 1946–2009, measured for 31 variables. Among those, the variable`dRGDP`

represents the real GDP growth (as a percentage) and`debtgdp`

represents the percentage of public debt with respect to the GDP.`wine.csv`

(download). The dataset is formed by the auction`Price`

of 27 red Bordeaux vintages, five vintage descriptors (`WinterRain`

,`AGST`

,`HarvestRain`

,`Age`

,`Year`

) and the population of France in the year of the vintage,`FrancePop`

.`Boston.xlsx`

(download). The dataset contains 14 variables describing 506 suburbs in Boston. Among those variables,`medv`

is the median house value,`rm`

is the average number of rooms per house and`crim`

is the per capita crime rate. The full description is available in`?Boston`

.`assumptions3D.RData`

(download). Contains the data frame`assumptions3D`

with 200 observations of the variables`x1.1`

, …,`x1.8`

,`x2.1`

, …,`x2.8`

and`y.1`

, …,`y.8`

. The purpose of the dataset is to identify which regression`y.1 ~ x1.1 + x2.1`

, …,`y.8 ~ x1.8 + x2.8`

fulfills the assumptions of the linear model.`challenger.txt`

(download). Contains data for 23 Space-Shuttle launches. The data consists of 23 shuttle flights. There are 8 variables. Among them:`temp`

, the temperature in Celsius degrees at the time of launch, and`fail.field`

and`fail.nozzle`

, indicators of whether there were an incidents in the O-rings of the field joints and nozzles of the solid rocket boosters.`eurojob.txt`

(download). Contains data for employment in 26 European countries. There are 9 variables, giving the percentage of employments in 9 sectors:`Agr`

(Agriculture),`Min`

(Mining),`Man`

(Manufacture),`Pow`

(Power),`Con`

(Construction),`Ser`

(Sevices),`Fin`

(Finance),`Soc`

(Social) and`Tra`

(Transport).`Chile.txt`

(download). Contains data for 2700 respondents on a survey for the voting intentions in the 1988 Chilean national plebiscite. There are 8 variables:`region`

,`population`

,`sex`

,`age`

,`education`

,`income`

,`statusquo`

(scale of support for the status quo) and`vote`

.`vote`

is a factor with levels`A`

(abstention),`N`

(against Pinochet),`U`

(undecided),`Y`

(for Pinochet). Available in`R`

through the package`car`

and`data(Chile)`

.`USArrests.txt`

(download). Arrest statistics for`Assault`

,`Murder`

and`Rape`

in each of the 50 US states in 1973. The percent of the population living in urban areas,`UrbanPop`

, is also given. Available in`R`

through`data(USArrests)`

.`USJudgeRatings.txt`

(download). Lawyers’ ratings of state judges in the US Superior Court. The dataset contains 43 observations of 12 variables measuring the performance of the judge when conducting a trial. Available in`R`

through`data(USJudgeRatings)`

.`la-liga-2015-2016.xlsx`

(download). Contains 19 performance metrics for the 20 football teams in La Liga 2015/2016.`pisaUS2009.csv`

(download). Reading score of 3663 US students in the PISA test, with 23 variables informing about the student profile and family background.