5.1 Summarizing categorical data
To summarize a categorical variable, compute the frequency (N) and proportion (%) of each value of that variable, along with the number of missing values. For example, summarize the variable income
.
# Frequency (N)
# Include "exclude = NULL" to see the number of missing values
table(mydat$income, exclude = NULL)
##
## < $25,000 $25,000 to < $55,000 $55,000+ <NA>
## 76 86 67 21
# Proportion (%) of all values
prop.table(table(mydat$income, exclude = NULL))
##
## < $25,000 $25,000 to < $55,000 $55,000+ <NA>
## 0.304 0.344 0.268 0.084
# Proportion (%) of non-missing values
prop.table(table(mydat$income))
##
## < $25,000 $25,000 to < $55,000 $55,000+
## 0.3319 0.3755 0.2926
It would be nice to have all this information in one summary. Let’s write a function that does that.
<- function(x) {
myfun_cat # Count the number of missing values
<- sum(is.na(x))
nmiss # Frequency
<- table(x)
n # Proportion
<- prop.table(n)
p # Putting it together
<- cbind(n, p)
OUT # Add nmiss, but first pad to have the right number of rows
<- c(nmiss, rep(NA, nrow(OUT)-1))
nmiss <- cbind(OUT, nmiss)
OUT return(OUT)
}
myfun_cat(mydat$income)
## n p nmiss
## < $25,000 76 0.3319 21
## $25,000 to < $55,000 86 0.3755 NA
## $55,000+ 67 0.2926 NA
# NOTE: nmiss is NOT the number of missing values at each level
# (which actually does not make sense)
# It is the number of missing values of income overall.