Chapter 14 Descriptive Statistics for a Vector

Let us look at functions used in describing distributions for a vector. These functions include the mean, median, standard deviation, and more. We will use the dataset called ldeaths which is built into R. This dataset is a vector and gives the monthly deaths from bronchitis, emphysema and asthma, for all sexes, in the UK from 1974 - 1979.

14.1 Describing Distribution

To find the mean, use the function mean( ).

## [1] 2056.625

To find the variance, use the function var( ).

## [1] 371911.8

To find the standard deviation, use the function sd( ).

## [1] 609.8457

You can also compute the standard deviation by taking the square root of the variance.

## [1] 609.8457

In R, the function, range( ) shows the minimum and maximum value in the dataset. It is not the same idea of range used in statistics.

## [1] 1300 3891

To calculate the range, as defined in statistics, we subtract the minimum value from the maximum value.

## [1] 2591

The following functions all give the five-number summary.

In the function, summary( ), the mean is included in the result.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    1552    1870    2057    2552    3891

The function, fivenum( ), will list the results but will not show any labels. Keep in mind, the results follow the order: minimum, first quartile, median, third quartile and maximum.

## [1] 1300.0 1549.5 1870.0 2553.0 3891.0

The function, quantile( ), returns distribution of the corresponding percentile.

##      0%     25%     50%     75%    100% 
## 1300.00 1551.75 1870.00 2552.50 3891.00

The function, boxplot.stats( ) returns a number of results including the five number summary, length of the vector, confidence interval and outlier.

## $stats
## [1] 1300.0 1549.5 1870.0 2553.0 3891.0
## 
## $n
## [1] 72
## 
## $conf
## [1] 1683.143 2056.857
## 
## $out
## numeric(0)

To narrow the result to the five-number summary, append $stats to the function, boxplot.stats( ).

## [1] 1300.0 1549.5 1870.0 2553.0 3891.0

The results produced by the function, boxplot.stats( )$stats, is also used to graph the boxplot in basic R.

As you can see, depending on what function is used, the results for the first and third quartile can be quite different. That is because there are several ways of calculating the first and third quartiles and different functions calculate quartiles differently. The function, fivenum( ), calculates the five number summary as we do in our course.

The interquartile range function, IQR( ), takes the quantile( ) function’s value in the 25% and subtracts it from the value in the 75% to get the interquartile range.

## [1] 1000.75

This result may or may not be different from the computation of the interquartile range used in our course. It all depends on the result of the first and third quartiles of the function, quantile( ). To get our course’s IQR result, add the argument, type = 2 to the IQR( ) function.

## [1] 1003.5

14.2 Calculating Outliers

From the previous chapters, we saw that the dataset, rivers, has numerous outliers. Let us calculate the outliers.

Using Boxplot Function

One way to extract the outliers is to use the function, boxplot.stats( )$out.

##  [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770

The result shows 11 outliers. Note that the function boxplot.stats( ) calculates the first and third quartile differently from how our course does it. So, sometimes the outliers may be different from our computed outliers.

Using Fivenum Function

Let us take a look at another way of calculating outliers that give the same result as in our course. We know that suspected outliers fall more than 1.5*IQR below Q1 or above Q3. In other words,

  • anything below Q1 – 1.5*IQR is a suspected outlier
  • anything above Q3 + 1.5*IQR is a suspected outlier

Remember to use the type = 2 way of calculating IQR. For Q1, we will use the second entry that appears in the result of fivenum(rivers). For Q3, we will use the fourth entry that appears in the result of fivenum(rivers). Let us calculate suspected outliers.

## [1] -245
## [1] 1235

Take a look at the five-number summary to see if we have any outliers.

## [1]  135  310  425  680 3710

The minimum value, 135, is bigger than the calculated lower inner fence. However, the maximum value, 3710 is greater than the upper inner fence, 1235. This means that there must be at least one outlier.

Forming Subsets

In order to list the outliers, we need to make a subset of the dataset, rivers. The function is subset(object_to_be-subsetted, logical_expression).

Let us make subset of dataset, rivers, containing all elements greater than upper inner fence. We will call the subset, rivers_sub.

##  [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770

We can also use [ ] to form a subset.

##  [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770

This method also shows 11 outliers.

14.3 No Outliers

Let us take a look at the dataset, ldeaths.

## [1] 1300.0 1549.5 1870.0 2553.0 3891.0
## [1] 44.25
## [1] 4058.25

We see that the lower inner fence, 44.25, is smaller than the minimum, 1300.0 and the upper inner fence, 4058.25, is larger than the maximum, 3891.0. Therefore, there are no outliers.

If we had used the function, boxplot.stats( )$out, this is how the result will look if there are no outliers.

## numeric(0)