5.1 Summarizing categorical data

To summarize a categorical variable, compute the frequency (N) and proportion (%) of each value of that variable, along with the number of missing values. For example, summarize the variable income.

# Frequency (N)
# Include "useNA = "ifany"" to see the number of missing values
table(mydat$income, useNA = "ifany")

## 
##            < $25,000 $25,000 to < $55,000             $55,000+                 <NA> 
##                   76                   86                   67                   21

# Proportion (%) of all values
prop.table(table(mydat$income, useNA = "ifany"))

## 
##            < $25,000 $25,000 to < $55,000             $55,000+                 <NA> 
##                0.304                0.344                0.268                0.084

# Proportion (%) of non-missing values
prop.table(table(mydat$income))

## 
##            < $25,000 $25,000 to < $55,000             $55,000+ 
##               0.3319               0.3755               0.2926

It would be nice to have all this information in one summary. Let’s write a function that does that.

myfun_cat <- function(x) {
  # Count the number of missing values
  nmiss <- sum(is.na(x))
  # Frequency
  n     <- table(x)
  # Proportion
  p     <- prop.table(n)
  # Putting it together
  OUT   <- cbind(n, p)
  # Add nmiss, but first pad to have the right number of rows
  nmiss <- c(nmiss, rep(NA, nrow(OUT)-1))
  OUT   <- cbind(OUT, nmiss)
  return(OUT)
}

myfun_cat(mydat$income)

##                       n      p nmiss
## < $25,000            76 0.3319    21
## $25,000 to < $55,000 86 0.3755    NA
## $55,000+             67 0.2926    NA

# NOTE: nmiss is NOT the number of missing values at each level
# (which actually does not make sense)
# It is the number of missing values of income overall.