# Chapter 14 Descriptive Statistics for a Vector

Let us look at functions used in describing distributions for a vector. These functions include the mean, median, standard deviation, and more. We will use the dataset called ldeaths which is built into R. This dataset is a vector and gives the monthly deaths from bronchitis, emphysema and asthma, for all sexes, in the UK from 1974 - 1979.

## 14.1 Describing Distribution

To find the mean, use the function mean( ).

mean(ldeaths)
##  2056.625

To find the variance, use the function var( ).

var(ldeaths)    
##  371911.8

To find the standard deviation, use the function sd( ).

sd(ldeaths) 
##  609.8457

You can also compute the standard deviation by taking the square root of the variance.

sqrt(var(ldeaths))  
##  609.8457

In R, the function, range( ) shows the minimum and maximum value in the dataset. It is not the same idea of range used in statistics.

range(ldeaths)  
##  1300 3891

To calculate the range, as defined in statistics, we subtract the minimum value from the maximum value.

max(ldeaths) - min(ldeaths) 
##  2591

The following functions all give the five-number summary.

In the function, summary( ), the mean is included in the result.

summary(ldeaths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1300    1552    1870    2057    2552    3891

The function, fivenum( ), will list the results but will not show any labels. Keep in mind, the results follow the order: minimum, first quartile, median, third quartile and maximum.

fivenum(ldeaths)
##  1300.0 1549.5 1870.0 2553.0 3891.0

The function, quantile( ), returns distribution of the corresponding percentile.

quantile(ldeaths)
##      0%     25%     50%     75%    100%
## 1300.00 1551.75 1870.00 2552.50 3891.00

The function, boxplot.stats( ) returns a number of results including the five number summary, length of the vector, confidence interval and outlier.

boxplot.stats(ldeaths)
## $stats ##  1300.0 1549.5 1870.0 2553.0 3891.0 ## ##$n
##  72
##
## $conf ##  1683.143 2056.857 ## ##$out
## numeric(0)

To narrow the result to the five-number summary, append $stats to the function, boxplot.stats( ). boxplot.stats(ldeaths)$stats
##  1300.0 1549.5 1870.0 2553.0 3891.0

boxplot.stats(rivers)$out ##  1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770 The result shows 11 outliers. Note that the function boxplot.stats( ) calculates the first and third quartile differently from how our course does it. So, sometimes the outliers may be different from our computed outliers. ### Using Fivenum Function Let us take a look at another way of calculating outliers that give the same result as in our course. We know that suspected outliers fall more than 1.5*IQR below Q1 or above Q3. In other words, • anything below Q1 – 1.5*IQR is a suspected outlier • anything above Q3 + 1.5*IQR is a suspected outlier Remember to use the type = 2 way of calculating IQR. For Q1, we will use the second entry that appears in the result of fivenum(rivers). For Q3, we will use the fourth entry that appears in the result of fivenum(rivers). Let us calculate suspected outliers. # Lower inner fence fivenum(rivers) - 1.5*IQR(rivers, type = 2) ##  -245 # Upper inner fence of rivers fivenum(rivers) + 1.5*IQR(rivers, type = 2) ##  1235 Take a look at the five-number summary to see if we have any outliers. fivenum(rivers) ##  135 310 425 680 3710 The minimum value, 135, is bigger than the calculated lower inner fence. However, the maximum value, 3710 is greater than the upper inner fence, 1235. This means that there must be at least one outlier. ### Forming Subsets In order to list the outliers, we need to make a subset of the dataset, rivers. The function is subset(object_to_be-subsetted, logical_expression). Let us make subset of dataset, rivers, containing all elements greater than upper inner fence. We will call the subset, rivers_sub. rivers_sub <- subset(rivers, rivers > 1235) rivers_sub # List all outliers of dataset, rivers ##  1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770 We can also use [ ] to form a subset. rivers_sub2 <- rivers[rivers > 1235] rivers_sub2 # List all the outliers of the dataset, rivers ##  1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770 This method also shows 11 outliers. ## 14.3 No Outliers Let us take a look at the dataset, ldeaths. fivenum(ldeaths) ##  1300.0 1549.5 1870.0 2553.0 3891.0  # Lower inner fence fivenum(ldeaths) - 1.5*IQR(ldeaths, type = 2) ##  44.25  # Upper inner fence fivenum(ldeaths) + 1.5*IQR(ldeaths, type = 2) ##  4058.25 We see that the lower inner fence, 44.25, is smaller than the minimum, 1300.0 and the upper inner fence, 4058.25, is larger than the maximum, 3891.0. Therefore, there are no outliers. If we had used the function, boxplot.stats( )$out, this is how the result will look if there are no outliers.

boxplot.stats(ldeaths)\$out
## numeric(0)