Data Analytics Module
Lecturer: Hans van der Zwan
Lab 02
Topic: summary statistics


EXERCISE 2.1

Use the HP_LONDON_JAN19.xlsx file to generate Table 1.1

Table 1
Summary statistics properties sold in London in January 2019

TYPE

COUNT

AVERAGE

MEDIAN

SD

PERC_ABOVE_1mln

FLAT

782

592,700

445,000

612,700

9.1

DETACHED

37

1,498,800

1,250,000

873,100

62.2

SEMI-DETACHED

125

982,200

805,000

642,400

32.8

TERRACED

442

876,100

630,000

919,700

23.5

Note. Average, median and standard deviation (SD) are in GB Pounds. The last column gives the percentages of properties sold with a selling price above 1 mln GPB. Underlying data are retrieved from http://landregistry.data.gov.uk/app/ppd/.


EXERCISE 2.2

The file hp_london_January16171819.xlsx contains information about the properties sold in London in January 2016, January 2017, January 2018 and January 2019.

Analyze the data in this file:

  1. compare the number of properties sold in January 2016, 2017, 2018 and 2019.
  2. plot side-by-side and stacked barplots with the number of properties sold in January 2016, 2017, 2018 and 2019 divided by property type.
  3. analyze the selling prices using boxplots and histograms.
  4. create a table with the most common summary statistics (count, minimum, median, maximum, avergae and standard deviation) to describe the selling prizes of Flats for the different years.


EXERCISE 2.3

Gather the GDP per capita data per country from the World Bank open data website. Copy the data for the year 2017 to a new worksheet.
Gather the life expectancy data from the same website. Copy the data for the year 2017 to a new worksheet.
Gather the population data from the same website. Copy the data for the year 2017 in a new worksheet.
Join the GDP per capita data for 2017 with the life expectancy data and with the population data.

  1. Create a bubble chart of these data for 2017, with the GDP data mapped on the x-axis, the life expectancy data on the y-axis and the population data mapped on the bubble size.
  2. Create a table with summary statistics for the GDP per capita in the different continents (for this the continent has to be added as a variable; the continent for each country can be found in the worksheet Metadata-country).
  3. Same as (ii) for the life-expectancy.


EXERCISE 2.4

Use the file 2019_PM10_NL.csv.
This file contains hourly PM10 levels in 2019 for … stations in The Netherlands; data source: https://luchtmeetnet.nl.
Choose one of these four locations: (1) Amsterdam-Stadhouderskade, (2) Rotterdam-Schiedamsevest, (3) Den Haag-Amsterdamse Veerkade or (4) Utrecht-Kardinaal de Jong weg. COpy paste the data for this loation to a new worksheet in Excel.

  1. Check if the dataset contains negative PM10 values. If such is the case, remove them.
  2. Explore the measured PM10 levels using graphical analysis. Detect outliers and if there are any analyze them.
  3. Summarize the data per day using the average PM10 level per day. Plot a time series graph of these averages.
  4. Create a boxplot in which the measured PM10 levels on Sunday, Monday, etc are compared.
  5. Create a table with summary statistics of the PM10 levels per weekday.
  6. Create a boxplot with the measured PM10 levels on the different hours of the day.
  7. Create a table with summary statistics of the PM10 levels per hour of the day.

  1. one way to calculate the median per category in MS Excel is by the use of a so called array function. See this website for an example.