Chapter 11 Measures of variation

11.0.1 Range

This measure of variation is commonly used, and many consider to be the most basic method. If x1 and x2 are the maximum and minimum values in a dataset, then the range is obtained by subtracting x2 from x1.

11.0.2 Variance and Standard deviation

The variance is obtained by finding the arithmetic mean of the set of (di)2, i=1:n, where di is denoted as the difference between xi and the mean. The standard deviation is obtained by taking the square root of the variance.

11.0.2.0.1 Population variance

It is important to note that when calculating the population variance σ2, the number of data points n is used as the denominator as in the following formula: σ2=ni=1(di)2n where di is expressed as: di=xiμ1 and μ represents the population mean.

11.0.2.0.2 Sample variance

When calculating the sample variance (where the dataset in question represents data sampled from the population), the following equation is used: s2=n1i=1(di)2n1where di is expressed as: di=xiˉx and ˉx represents the sample mean.

In the fictional chocolate biscuit example, the sample variance obtained for a sample of biscuits in a given shipment would be calculated differently from the variance calculated if every biscuit in that shipment was evaluated.

11.0.3 Quartiles

If the data points are ordered from the lowest to the highest value, the lower quartile, Q1 denotes the data point below which 25% of the data lies. The second quartile, Q2, which is also the median, refers to 50% of the data, and the third quartile,Q3, 75%. The interquartile range, calculated by subtracting Q3 from Q1 is another useful measure of the spread of the data.

In Figure 2 the three quartiles associated with the data on chocolate chips in the biscuit sample before the policy. Figure 2 is a boxplot which is one way in which the distribution of data can be presented. The plot is associated with five numbers. From top to bottom these are: the maximum observed value, the third quartile, the second quartile, the first quartile and the minimum observed value.

thebiscuits = read.csv("cookie_second_recipe.csv")
before = thebiscuits[which(thebiscuits$timeframe=="2017 - before no eating policy"),]
p<-ggplot(before, aes(y=chocolate_chips,x="Chocolates")) +
geom_boxplot(width=0.3)+
theme(text = element_text(size=50))+
ylab("Number of chocolate chips")+
ggtitle("Figure 2: Boxplot for chocolate chips")+
stat_summary(geom="text", fun.y=quantile,
aes(label=sprintf("%1.1f", ..y..)),
position=position_nudge(x=0.45), size=40)
p