Summarizing Data

In R, you can easily obtain summary statistics of the data. Below are some of the examples.

data$score1             # Note that 'score' variable is a continuous variable
##  [1] 35 23 14 17 23 35 27 33 32 31 34 27 51 36 39 45 31 40 25 32
mean(data$score1)       # returns the mean of 'score' variable
## [1] 31.5
median(data$score1)     # returns the median of 'score' variable
## [1] 32
var(data$score1)        # returns the variance of 'score' variable
## [1] 78.36842
sd(data$score1)         # returns the standard deviation of 'score' variable
## [1] 8.852594
max(data$score1)        # returns the maximum value of 'score' variable
## [1] 51
min(data$score1)        # returns the minimum value of 'score' variable
## [1] 14
range(data$score1)      # returns the range of 'score' variable
## [1] 14 51
summary(data$score1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   26.50   32.00   31.50   35.25   51.00
data$group            # Note that 'group' variable is a categorical variable
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
table(data$group)     # counts the number of observations for each 'group'
## 
##  1  2 
## 10 10
cov(scores)           # covariance matrix of the 'scores' data matrix (score1, score2)
##          score1    score2
## score1 78.36842  51.31579
## score2 51.31579 151.68421
cor(scores)           # correlation matrix of the 'scores' data matrix (score1, score2)
##           score1    score2
## score1 1.0000000 0.4706632
## score2 0.4706632 1.0000000

We can calculate the row or column means of the matrix by using apply() function.

# apply(matrix, margin, function)  # if margin=1, the function is applied row-wise; if 2, column-wise
apply(scores, 1, sum)      # returns the row sums of the 'scores' matrix
##  [1] 80 37 40 42 50 82 64 83 47 68 82 43 96 62 76 86 56 57 40 59
apply(scores, 2, mean)     # returns the column means of the 'scores' matrix
## score1 score2 
##   31.5   31.0
apply(scores, 2, sd)       # returns the standard deviations for each column
##    score1    score2 
##  8.852594 12.316014

Summary statistics of a specific variable for each category of a categorical variable can be produced by using tapply() function. tapply() requires you to type in a vector, an index (i.e., factor(s)), and a function to apply. Note that the vector and factor must have the same length.

# tapply(vector object, index, function)
tapply(data$score1, as.factor(data$group), mean)  # returns the mean of 'score1' for each 'group'
##  1  2 
## 27 36

To write the codes more efficiently, you can use with() function. For example:

# with(data, expression)
with(data, mean(score1))   # returns the mean of 'score1' in 'data'
## [1] 31.5
with(data, tapply(score1, group, mean))   # returns the mean of 'score1' for each 'group' in 'data'
##  1  2 
## 27 36