Chapter 12 Summary Statistics

Now that we have the psych package loaded, we can use the describe() function. The sumamry() function is in the base package, so we could have used that already.

summary() is a generic function, but can be quite handy to get a feel for your data. However, it does give quite different output depending on the type of data you give it. Try it out on some of the different data frames you’ve created today.

summary(exp)

##        ID              dose       effect     
##  Min.   : 1.0   Placebo  :5   Min.   :1.000  
##  1st Qu.: 4.5   Low_dose :5   1st Qu.:2.000  
##  Median : 8.0   High_dose:5   Median :3.000  
##  Mean   : 8.0                 Mean   :3.467  
##  3rd Qu.:11.5                 3rd Qu.:4.500  
##  Max.   :15.0                 Max.   :7.000

describe() is a little different, and it is mostly intended to be used when your data is in interval or ratio scale. It calculates the same descriptive statistics for any type of variable you give it. Try it out!

describe(exp)

##        vars  n mean   sd median trimmed  mad min max range skew kurtosis
## ID        1 15 8.00 4.47      8    8.00 5.93   1  15    14 0.00    -1.44
## dose*     2 15 2.00 0.85      2    2.00 1.48   1   3     2 0.00    -1.69
## effect    3 15 3.47 1.77      3    3.38 1.48   1   7     6 0.34    -0.97
##          se
## ID     1.15
## dose*  0.22
## effect 0.46

Did you notice any *’s beside the variable names in your output? The describe()function converts factors and logical variables in order to do the calculations. They are then marked with the *, and the output generally won’t make much sense.

What if you wanted to see descriptive statistics by group? This is pretty easy to do in R. Instead of describe(), you just use the describeBy() function. These two functions are very similar, but in describeBy() you need to specify your group variable. Let’s try it out:

describeBy(exp$effect, group = exp$dose)

## 
##  Descriptive statistics by group 
## group: Placebo
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 5  2.2 1.3      2     2.2 1.48   1   4     3 0.26    -1.96 0.58
## -------------------------------------------------------- 
## group: Low_dose
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 5  3.2 1.3      3     3.2 1.48   2   5     3 0.26    -1.96 0.58
## -------------------------------------------------------- 
## group: High_dose
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 5    5 1.58      5       5 1.48   3   7     4    0    -1.91 0.71

Now you have separate descriptive statistics for each of the dose groups.

There are other ways of doing this, such as by using the by() and aggregate() functions. We won’t be covering these today.

12.1 Data Cleaning

A tedious part of data analysis is addressing the problem of miscoded data that need to be converted to NA or some other value.

In our exp_lowscore example, we can use the scrub function to change the values of 7 to NA for us:

library(psych)
clean_lowscore <-scrub(exp_lowscore, where = 3, min = rep(1, 9), max = rep(6, 9))

This function can be very helpful when working with large data sets, and can be applied to full data frames or selected columns.