1.5 Datasets for the course

This is a handy list with a small description and download link for all the relevant datasets used in the course. To download them, simply save the link as a file in your browser.

  • pisa.csv (download). Contains 65 rows corresponding to the countries that took part on the PISA study. Each row has the variables Country, MeanMath,MathShareLow, MathShareTop, ReadingMean, ScienceMean, GDPp, logGDPp and HighIncome. The logGDPp is the logarithm of the GDPp, which is taken in order to avoid scale distortions.

  • US_apportionment.xlsx (download). Contains the 50 US states entitled to representation in the US House of Representatives. The recorded variables are State, Population2010 and Seats2013–2023.

  • EU_apportionment.txt (download). Contains 28 rows with the member states for the EU (Country), the number of seats assigned under different years (Seats2011, Seats2014), the Cambridge Compromise apportionment (CamCom2011) and the states population (Population2010, Population2013).

  • least-squares.RData (download). Contains a single data.frame, named leastSquares, with 50 observations of the variables x, yLin, yQua and yExp. These are generated as \(X\sim\mathcal{N}(0,1)\), \(Y_\mathrm{lin}=-0.5+1.5X+\varepsilon\), \(Y_\mathrm{qua}=-0.5+1.5X^2+\varepsilon\) and \(Y_\mathrm{exp}=-0.5+1.5\cdot2^X+\varepsilon\), with \(\varepsilon\sim\mathcal{N}(0,0.5^2)\). The purpose of the dataset is to illustrate the least squares fitting.

  • assumptions.RData (download). Contains the data frame assumptions with 200 observations of the variables x1, …, x9 and y1, …, y9. The purpose of the dataset is to identify which regression y1 ~ x1, …, y9 ~ x9 fulfills the assumptions of the linear model. The dataset moreAssumptions.RData (download) has the same structure.

  • cpus.txt (download) and gpus.txt (download). The datasets contain 102 and 35 rows, respectively, of commercial CPUs and GPUs appeared since the first models up to nowadays. The variables in the datasets are Processor, Transistor count, Date of introduction, Manufacturer, Process and Area.

  • hap.txt (download). Contains data for 20 advanced economies in the time period 1946–2009, measured for 31 variables. Among those, the variable dRGDP represents the real GDP growth (as a percentage) and debtgdp represents the percentage of public debt with respect to the GDP.

  • wine.csv (download). The dataset is formed by the auction Price of 27 red Bordeaux vintages, five vintage descriptors (WinterRain, AGST, HarvestRain, Age, Year) and the population of France in the year of the vintage (FrancePop).

  • Boston.xlsx (download). The dataset contains 14 variables describing 506 suburbs in Boston. Among those variables, medv is the median house value, rm is the average number of rooms per house and crim is the per capita crime rate. The full description is available in ?Boston.

  • assumptions3D.RData (download). Contains the data frame assumptions3D with 200 observations of the variables x1.1, …, x1.8, x2.1, …, x2.8 and y.1, …, y.8. The purpose of the dataset is to identify which regression y.1 ~ x1.1 + x2.1, …, y.8 ~ x1.8 + x2.8 fulfills the assumptions of the linear model.

  • challenger.txt (download). Contains data for 23 Space-Shuttle launches. The data consists of 23 shuttle flights. There are 8 variables. Among them: temp, the temperature in Celsius degrees at the time of launch, and fail.field and fail.nozzle, indicators of whether there were an incidents in the O-rings of the field joints and nozzles of the solid rocket boosters.

  • eurojob.txt (download). Contains data for employment in 26 European countries. There are 9 variables, giving the percentage of employments in 9 sectors: Agr (Agriculture), Min (Mining), Man (Manufacture), Pow (Power), Con (Construction), Ser (Services), Fin (Finance), Soc (Social) and Tra (Transport).

  • Chile.txt (download). Contains data for 2700 respondents on a survey for the voting intentions in the 1988 Chilean national plebiscite. There are 8 variables: region, population, sex, age, education, income, statusquo (scale of support for the status quo) and vote. vote is a factor with levels A (abstention), N (against Pinochet), U (undecided), Y (for Pinochet). Available in R through the package car and data(Chile).

  • USArrests.txt (download). Arrest statistics for Assault, Murder and Rape in each of the 50 US states in 1973. The percent of the population living in urban areas, UrbanPop, is also given. Available in R through data(USArrests).

  • USJudgeRatings.txt (download). Lawyers’ ratings of state judges in the US Superior Court. The dataset contains 43 observations of 12 variables measuring the performance of the judge when conducting a trial. Available in R through data(USJudgeRatings).

  • la-liga-2015-2016.xlsx (download). Contains 19 performance metrics for the 20 football teams in La Liga 2015/2016.

  • pisaUS2009.csv (download). Reading score of 3663 US students in the PISA test, with 23 variables informing about the student profile and family background.