Chapter 1 Data

1.1 Tables and .csv files

The data comes in many forms. The most usual form is the table with features/variables in columns and observations/records in rows.

The most basic file format for a table with data is a .csv.

There are two prevailing .csv versions:

  • USA/UK version – a comma (,) is the column separator, a period (.) is the decimal separator

  • European version – semicolon (;) is a column separator, a comma (,) is the decimal separator

Figure 1.1 shows an example from Wikipedia:

Excerpt from the wikipedia page describing two csv versions: comma-based (the USA/UK one) and semicolon-based (the European one).

Figure 1.1: Two prevailing csv format versions (Wikipedia)

1.2 Data for exercises

In the exercises below, we are going to use several datasets :

  1. statistics2018.csv - Statistics 2018 (second semester) results in one of the student groups

  2. statistics2021.csv - Statistics (first semester) results in one of the student groups last year

  3. covidpl.csv - data on covid deaths in Poland between March and October 2020

  4. SpeedRadarData.csv - data collected by a speed radar set up by parents in front of one of the schools in Warsaw (see this Facebook post)

  5. orders.csv - data on orders in an internet shop

  6. football2021.csv - salaries of football players in Bundesliga, La Liga, Ligue 1, Premier League and Serie A in 2021 (data collected by Edd Webster)

1.3 Exercises

Exercise 1.1 Download the data and answer the questions below for each dataset:

  1. What is the csv version of the dataset (US/UK or European)? You can view text files in Notepad++.

  2. How many observations (records) are there?

  3. How many variables (columns) are there?

Exercise 1.2 What was the maximum total score in Statistics 2021 group? What was the ID of the student who got this score? What was the minimum? What were the average and median scores?

Exercise 1.3 In 2021 and 2018, what was the proportion (you can state the percentage share) of students who passed? The minimum total score to pass was 60 points.

Exercise 1.4 What was the average age of a person who died of Covid on the list? What was the proportion of people younger than 50 years?

Exercise 1.5 What was the average age at death for women? For men? What was the average age of a person who died of Covid in mazowieckie?

Exercise 1.6 In the data collected by the speed radar: how many two-wheelers were there? What was their average speed? What was their median speed?

Exercise 1.7 What was the average order size? How many orders were there below 10 dollars?

Exercise 1.8 What was the average yearly salary of a goalkeeper? What was the maximum salary of a forward player?