Chapter 14 Descriptive Statistics for a Vector
Let us look at functions used in describing distributions for a vector. These functions include the mean, median, standard deviation, and more. We will use the dataset called ldeaths which is built into R. This dataset is a vector and gives the monthly deaths from bronchitis, emphysema and asthma, for all sexes, in the UK from 1974 - 1979.
14.1 Describing Distribution
To find the mean, use the function mean( ).
## [1] 2056.625
To find the variance, use the function var( ).
## [1] 371911.8
To find the standard deviation, use the function sd( ).
## [1] 609.8457
You can also compute the standard deviation by taking the square root of the variance.
## [1] 609.8457
In R, the function, range( ) shows the minimum and maximum value in the dataset. It is not the same idea of range used in statistics.
## [1] 1300 3891
To calculate the range, as defined in statistics, we subtract the minimum value from the maximum value.
## [1] 2591
The following functions all give the five-number summary.
In the function, summary( ), the mean is included in the result.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 1552 1870 2057 2552 3891
The function, fivenum( ), will list the results but will not show any labels. Keep in mind, the results follow the order: minimum, first quartile, median, third quartile and maximum.
## [1] 1300.0 1549.5 1870.0 2553.0 3891.0
The function, quantile( ), returns distribution of the corresponding percentile.
## 0% 25% 50% 75% 100%
## 1300.00 1551.75 1870.00 2552.50 3891.00
The function, boxplot.stats( ) returns a number of results including the five number summary, length of the vector, confidence interval and outlier.
## $stats
## [1] 1300.0 1549.5 1870.0 2553.0 3891.0
##
## $n
## [1] 72
##
## $conf
## [1] 1683.143 2056.857
##
## $out
## numeric(0)
To narrow the result to the five-number summary, append $stats to the function, boxplot.stats( ).
## [1] 1300.0 1549.5 1870.0 2553.0 3891.0
The results produced by the function, boxplot.stats( )$stats, is also used to graph the boxplot in basic R.
As you can see, depending on what function is used, the results for the first and third quartile can be quite different. That is because there are several ways of calculating the first and third quartiles and different functions calculate quartiles differently. The function, fivenum( ), calculates the five number summary as we do in our course.
The interquartile range function, IQR( ), takes the quantile( ) function’s value in the 25% and subtracts it from the value in the 75% to get the interquartile range.
## [1] 1000.75
This result may or may not be different from the computation of the interquartile range used in our course. It all depends on the result of the first and third quartiles of the function, quantile( ). To get our course’s IQR result, add the argument, type = 2 to the IQR( ) function.
## [1] 1003.5
14.2 Calculating Outliers
From the previous chapters, we saw that the dataset, rivers, has numerous outliers. Let us calculate the outliers.
Using Boxplot Function
One way to extract the outliers is to use the function, boxplot.stats( )$out.
## [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770
The result shows 11 outliers. Note that the function boxplot.stats( ) calculates the first and third quartile differently from how our course does it. So, sometimes the outliers may be different from our computed outliers.
Using Fivenum Function
Let us take a look at another way of calculating outliers that give the same result as in our course. We know that suspected outliers fall more than 1.5*IQR below Q1 or above Q3. In other words,
- anything below Q1 – 1.5*IQR is a suspected outlier
- anything above Q3 + 1.5*IQR is a suspected outlier
Remember to use the type = 2 way of calculating IQR. For Q1, we will use the second entry that appears in the result of fivenum(rivers). For Q3, we will use the fourth entry that appears in the result of fivenum(rivers). Let us calculate suspected outliers.
## [1] -245
## [1] 1235
Take a look at the five-number summary to see if we have any outliers.
## [1] 135 310 425 680 3710
The minimum value, 135, is bigger than the calculated lower inner fence. However, the maximum value, 3710 is greater than the upper inner fence, 1235. This means that there must be at least one outlier.
Forming Subsets
In order to list the outliers, we need to make a subset of the dataset, rivers. The function is subset(object_to_be-subsetted, logical_expression).
Let us make subset of dataset, rivers, containing all elements greater than upper inner fence. We will call the subset, rivers_sub.
## [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770
We can also use [ ] to form a subset.
## [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770
This method also shows 11 outliers.
14.3 No Outliers
Let us take a look at the dataset, ldeaths.
## [1] 1300.0 1549.5 1870.0 2553.0 3891.0
## [1] 44.25
## [1] 4058.25
We see that the lower inner fence, 44.25, is smaller than the minimum, 1300.0 and the upper inner fence, 4058.25, is larger than the maximum, 3891.0. Therefore, there are no outliers.
If we had used the function, boxplot.stats( )$out, this is how the result will look if there are no outliers.
## numeric(0)