Appendix

Textbook

Introduction to Data Science

Data Analysis and Prediction Algorithms with R

Available online: link

Descriptive Statistics

To understand large amounts of data we need to reduce them to comparable metrics that accurately describe the nature of our data. We call those metrics descriptive statistics and it’s generally a good place to start when analyzing any data since they allow us to get an overview of our observations. Some of the most used descriptive statistics are:

Minimum & Maximum: The smallest and largest values in our data are of interest since they show the full extent of how much our data can vary. The span between the largest and smallest observation is the Range. The range can be used to compare two data sets on how extreme a single observation can deviate from the median (in both directions).

Median: The median is the middle observation of the data set when it is arranged in ascending order. It is of interest since it denotes the boundary between the lower half and upper half of the data and thus gives a good indication for the structure of the data. It is often similar, or equal, to the mean. Note, however, that since (unlike the mean) it is not affected by extreme values the two statistics can differ drastically and should therefore not capriciously be used interchangeably.

Quartile: Like the median, the quartiles denote the boundaries between the lowest 25%, the highest 25% and the rest of the data. Together with the median they give an indication for the structure of the data and of their potential skewness. The skewness of our data compares how far from the median the top half of the data is compared to the bottom half and shows if one half is generally farther away than the other. If the data set is very skewed the difference between the mean and the median will be large.

Mean: A statistic that measures the centrality of the data, and allows us to compare the magnitude or impact of the variable. One way to think about the mean is “If I removed all variation from my observations, so that they were all the same, what would they have to be in order for me to reach the same aggregate result as in my actual data?”

Standard Deviation: A statistic that shows the average spread of the data around the mean which allows us to compare how much deviation in impact our variable has.

Bar Plots

One of the most straightforward ways to illustrate the difference between observations is to represent the values as rectangles next to each other. If every observation is given the same width, then the height of the rectangles will show that difference. Comparing two or more observations from different categories in this way is done with a bar graph. Bar graphs are used to get a quick overview of the difference in magnitude between observations from different categorical variables.

Box Plots

There are many aspects to consider when comparing data sets. One efficient way of illustrating many of the descriptive statistics calculated in A1 is through a box plot. Box plots use the five-number summary (minimum, first quartile, median, third quartile and maximum) to illustrate the centrality, spread, skewness, range, and the existence of extreme values. It is one of the few plots that clearly identifies extreme values.

Scatter Plots

The Scatter Plot is a good way to show several aspects of two linked data sets. One key to being able to make a meaningful scatter plot is that there must exist only one way to pair the two data sets with each other. A scatter plot tells us the range and distribution of the data sets and it indicates extreme values and if there is a possible correlation or a causal relationship between the two variables. Scatter plots are thus often used to illustrate possible relationships (or the lack thereof) between two variables.

The correlation between two variables shows to what extent a change, in a particular direction with a specific magnitude, in one variable is consistently associated with a change, in a particular direction with a specific magnitude, of the other variable. Unlike the concept of causality, correlation only demonstrates an observational association in the fluctuation of the two variables and make no claims on the nature of that association.

A causal relationship is where a change in one variable affects the change in the other variable. Note that a scatter plot can at best suggest such a relationship and further investigation would be necessary to show its existence.

Histograms

Histograms compare the magnitude of the data distributed over a numerical variable. Like bar plots they show this difference through the height of rectangles, however unlike bar plots the order that the data are presented in fixed. Histograms are used when comparing numerical data, while bar plots are used for categorical data.

Time Series Plots

Time series plots are used to illustrate and compare a variable over time. Since it shows changes in the variable associated with changes in time the sequence is of the observations is important. Time series are used to observe trends and to predict what future observations might be.

Pie Charts

Pie charts are mainly utilized to draw attention to differences in proportions between variables. Since the pie represents the whole, the various sizes of the segments give us a clear indication of which variables are overrepresented and which are underrepresented. This makes the pie chart an excellent choice when comparing the relative sizes of variables.