2 Summary statistics

Data Analytics Module
Lecturer: Hans van der Zwan
Handout 02
Topic: summary statistics

Preparation class
See module description

Graphs show the form of the distribution of the data and are a very usefull tool in exploring a dataset. Besides graphs, statistics that summarize the (distribution of the) data, are used to transform data into information. The five-number summary, which forms the basis for a boxplot, is a good example of summarizing data.
The most important statistics are statistics which measure central tendency and spread. Which summary statistics are appropriate depends on the type of a variable (categorical or numerical) and the context of the data to be studied.

2.1 Central tendency metrics

Metrics for central tendency: (arithmetric) mean and median. For grouped data, also the mode can be used to measure the central tendency. With individual cases the mode is in most cases not a good metric for central tendency, because an individual observation can have the highest frequency without being in the center of the distribution.

2.2 Measuring spread

Most common metrics to measure variation in the data: range, interquartile range (IQR), variance and standard deviation.

2.3 Other statistics used to summarize the data

There are many other statistics that can be used to describe a data set. Some examples:
- the number of observations in the data set; actually this is almost always the first statistic reported;
- the proportion of the data that has a certain property; e.g. the proportion of voters supporting the proposal in the last tweet of president Trump;
- 5% trimmed mean: the mean of the data after removing the 5% lowest and the 5% highest values;
- coefficent of variation (CV): \(\frac{SD}{MEAN}\);
- skewness coefficient.

2.4 Aggregating and summarizing data

In many cases data are summarized per group. E.g. houseprices in the Netherlands can be summarized per province or per municipality, healthcare costs can be sumarized per age group etc.

Example: table with summary statistics of houses sold in London in January 2019

Table 1
Summary statistics houses sold in London in January 2019

TYPE

COUNT

AVERAGE

MEDIAN

SD

PERC_ABOVE_1mln

FLAT

782

592,700

445,000

612,700

5.1

DETACHED

37

1,498,800

1,250,000

873,100

1.7

SEMI-DETACHED

125

982,200

805,000

642,400

3.0

TERRACED

442

876,100

630,000

919,700

7.5

Note. Average, median and standard deviation (SD) are in GB Pounds. The last column gives the percentages of properties sold with a selling price above 1 mln GPB. Underlying data are retrieved from http://landregistry.data.gov.uk/app/ppd/.

2.5 Summarizing data using MS Excel

MS Excel comes with a lot of statistical functions which can be used to create a data summary. For an overview of statistical functions in Excel, see here.

Another possibility to create an overview of the most common summary statistics is using the analyses toolpak. On the internet, one can find many instruction manuals how to use this add-in, see for instance here.

2.6 Standardizing variables

In many cases it is useful to transform values to another unit of measurement. E.g. if in a data set there is a temperature variable (TEMP) measured in degrees Fahrenheit and for one reason or another the unit should be degrees Celsius, the following variable transformation can be used: TEMP_CELCIUS = (TEMP – 32) x 5/9.

A common transformation used is standardizing; the values of a variable (X-variable) are transformed into a new variable (Z-variable) which is the location of the observation relative to the mean, expressed in standard deviations: \(z = \frac{x\;-\;mean}{standard\;deviation}\) .
The z-value gives information about the place of a certain observation in the distribution.

2.7 Collecting data: open data sources

There are many open data sources with a variety of data available at the internet. Some interesting examples are listed below.
- Statistics Netherlands (CBS): https://opendata.cbs.nl/#/CBS/en/
- World Bank: https://data.worldbank.org/
- EU Open data portal: https://data.europa.eu/euodp/en/home
- Open data US government: https://www.data.gov/
- Data banks accessible via THUAS University website
- Lexis Nexis - Company.info

Open data sources offer data in a variety of different formats. A common used format is the csv (comma seperated values) format. Be aware that sometimes the semi-colon instead of the comma is used as a seperator in such a file.

2.8 Long and wide data format

In many cases the same data can be presensted in different ways. A common used transformation is from long to wide data format or the other way around.

2.9 Some MS Excel tips and tricks

Some commonlhy used possibilities when working with MS Excel are: - the use of $-signs to copy formulas with cells fixed - naming cells
- advanced filter
- pivot tables
- common statistical functions
- transforming data from long to wide format and vice versa

2.10 Example research case

Below an example of an outline for a quantitative research is given.

Casus: comparison of the air quality in the four major Dutch cities

1. Introduction
In May 2019 the municipality administration of Amsterdam announced that from 2030 on cars that run on diesel or petrol are not allowed in the city, [https://www.cnet.com/roadshow/news/amsterdam-ban-gas-diesel-cars-motorcycles-2030/]. This because of the effect of these vehicles on the air quality in the city.
Although there were a lot of negative reactions on this intention, it is well known that bad air quality has a negative influence on people’s health, see for instance this WHO publication. The question can be asked if the other major cities in The Netherlands should consider measures similar to these of Amsterdam.
The above considerations lead to the research question: Are the four largest cities in the Netherlands different in terms of air quality? To answer this question, it has been subdivided in a couple of sub questions:
(Q1) How can air quality be measured?
(Q2) What are standards for healthy air quality?
(Q3) How does air quality relate to human health?
(Q4) Which patterns can be identified in the air quality in the four major cities over the year?
(Q5) Which patterns can be identified in the air quality in the four major cities during a day?
(Q6) Are there significant differences between the quality of the air in Amsterdam, Rotterdam, Utrecht and The Hague.

2 Theoretical background
Description of results from literature study.

3 Emperical part: research methodology
Assume that, based on the answer on Q1 and Q2, the PM10 level has been used as metric for air quality in this research. To answer the field research questions (Q4, Q5 and Q6) use has been made of secundary data which is collected by the Netherlands National Institute for Public Health and the Environment (RIVM) and published on [luchtmeetnet.nl]. This research is based on the published figures from 2018.

3.1 PM10 data 2018 for Amsterdam, Rotterdam and The Hague
PM10 levels as measured at six locations in Amsterdam city centre, two locations in The Hague city centre, two locations in Utrecht and four locations in Rotterdam city centre, are used.

Insert: - table with locations
- map with position of these locations

3.2 Data collecting and cleaning
At all locations the PM10 level is measured every hour. The data set contains negative values. These are due to problems with the measurement instruments and are replaced by NA (not available). A graphical data analysis has been performed to detect outliers. Outliers are examined and it has been discussed how to deal with these values.

3.3 Operationalization (1): pattern over the year
Patterns over the year are analyzed through:

  • graphical analysis of time series
    • summarizing the data per day using the average PM10 level per day per city
    • plotting a time series graph of these averages for the different cities
    • plotting other graphs that give insight in the yearly pattern
  • summary statistics per day for the different days of the week
  • summary statistics per week
  • summary statistics per month

3.4 Operationalization (2): daily pattern identification
Identify daily patterns in the data has been done by:

  • graphical analysis of hourly data (how to do that with hourly data from different locations from a whole year?)
  • summary statistics per city for the different hours per day

3.5 Operationalization (3): comparing the major cities
The major citeis are compared using:

  • graphical analysis
    • boxplots per city
  • interpreting the different graphs
  • comparing the data per city with the standards set by the WHO
  • binomial test if the number of days with an unhealthy level is higher in Amsterdam than in the other two cities
  • t test to test if the average of the daily averages is higher in Amsterdam than in the other two cities

4 Results data analysis

4.1 Data per city

  • table with number of valid measurements per station and per city
  • table with problems in the data set; for instance days with only a small couple of valid measurements
  • graph with time series of daily averages per city
  • table with summary statistics per weekday and comments
  • tables with summary statistics per week and month and comments
  • graphs to analyze daily patterns in the data (e.g. boxplot)
  • table with summary statistics per hour

4.2 Comparing air quality in the four major Dutch cities
- graphical analysis: boxplots (and histograms)
- comments - comparing the cities with the WHO standards; table with number of days for the different healthy levels defined by the WHO
- results binomial tests en t-tests

4.3 Conclusion

Present the conclusions from the research and answer the central question.

5 Conclusions and recommendations
Summarize the main findings and recommendations. (no new findings in this part).

2.11 Homework group assignment

Option 1.
See the research example above about air pollution in the four major Dutch cities. Choose one of these cities and investigate the pattern over the year (see section Operationalization (1): pattern over the year).

Option 2.
The file hp_london_January161718.xlsx contains information about the houses sold in London in January 2016, January 2017 and in January 2018.

Analyze the data in this file:
(i) compare the number of properties sold in January 2016, 2017 and 2018.
(ii) plot side-by-side and stacked barplots with the number of properties sold in Janaury 2016, 2017 and 2018, divided by property type.
(iii) analyze the selling prices using boxplots and histograms.
(iv) create tables with the most common summary statistics to analyze the selling prizes of the different property types over the years.

Option 3.
Analyze the data in your own data set using graphs and summary statistics.