12 Creating summary statistics

Data Wrangling Recipes in R: Creating summary statistics: Tristan Naidoo & Hilary Watt

12.1 Reducing data (summarize) from several lines per cluster to one summary line

This section uses syntax from Tidyverse’s dplyr. They require either library(tidyverse) or library(dplyr).

Use this when you have one line per repeated measurement and want to reduce this to a single summary measure per group.

For example, consider the data below which measures two consecutive husband anxiety (gad) and depression scores (phq).

print(ad_df)

To reduce the data to a single line per husband the syntax below can be used:

# To summarize the tibble, we need to choose what to do for each variables. 
# We can ignore familyid since we are grouping by it.
ad_family_grouped_df <- group_by(ad_df, familyid) 

ad_family_summarised_df <- summarize(
    ad_family_grouped_df,   # name dataframe to be transformed
    timept = max(timept), # string in the group starting with the latest letter
    phq = mean(phq, na.rm=TRUE), # mean of phq within familyid
    gad = mean(gad, na.rm=TRUE), # mean of gad within familyid
    age_husband = max(age_husband, na.rm=TRUE), # the oldest the husband
    ethnic = min(ethnic, na.rm=TRUE) # string in the group starting with the 
                                     # earliest letter
    ) 

We decide what to do for each variable in the summarize statement. You should replace familyid with your grouping variable (which could be gp_practice_no or familyno or similar). Although our data does not have any missing values, for purposes of illustration, in each function used in the summarize statement you can exclude rows with missing values by specifying na.rm=TRUE. Only columns specified in the summarize statement will be returned. The result of this statement is shown below. We now have one row per husband where the data has been combined using the functions specified in the summarize statement.

print(ad_family_summarised_df)

We can specify some other summary measurement if we prefer. For example, we may want to report the min phq and the max gad. The summarize statement is very flexible; you can use any function that returns a single value given a set of values.

ad_family_summarised_df <- summarize(
    ad_family_grouped_df, # name dataframe to be transformed
    timept = max(timept), # string in the group starting with the latest letter
    phq = min(phq, na.rm=TRUE), # min of phq within familyid
    gad = max(gad, na.rm=TRUE), # max of gad within familyid
    age_husband = max(age_husband, na.rm=TRUE), # the oldest the husband
    ethnic = min(ethnic, na.rm=TRUE) # string in the group starting with the
                                     # earliest letter
    ) 

print(ad_family_summarised_df)

We can choose what to name each output column in the summarize statement. In the statements above we chose the same names as the original columns. If we want to report two summary measures using the same column, then it’s good practice to rename both columns.

ad_family_summarised_df <- summarize(
    ad_family_grouped_df, # name dataframe to be transformed
    timept = max(timept), # string in the group starting with the latest letter
    mean_phq = mean(phq, na.rm=TRUE), # mean of phq within familyid
    min_phq = min(phq, na.rm=TRUE), # min of phq within familyid
    gad = max(gad, na.rm=TRUE), # max of gad within familyid
    age_husband = max(age_husband, na.rm=TRUE), # the oldest the husband
    ethnic = min(ethnic, na.rm=TRUE) # string in the group starting with the
                                     # earliest letter
    ) 

print(ad_family_summarised_df)

See help(summarize) for more information and for useful summary functions.

The mode function is not routinely available, but here is a function that gives the mode

# The function below, counts the number of rows at each level of the factor and
# returns the factor with the max
mode <- function(x){levels(x)[(which.max(tabulate(x)))]} 

12.2 Summary statistics variables without changing data structure

This section uses syntax from Tidyverse’s dplyr. They require either library(tidyverse) or library(dplyr).

Generating new variables which are summary statistics of one variable, or summary statistics for each variable across a number of variables can be done using group_by and mutate. The syntax below creates a new column called mean_phq which returns the mean phq for each family.

ad_mutate_df <- mutate(
    ad_family_grouped_df, # name of dataframe to be transformed
    mean_phq=mean(phq) # new column we want to create
    ) 

Unlike summarize, we only need to specify the new column we want to create. All existing columns will also be included in the output. The value will be the same for each group that the summary statistic is calculated for.

print(ad_mutate_df)

We can also create new variables which are functions of more than one existing variable. In addition, we can change the grouping variable. The syntax below returns the average sum of the phq and gad score by ethnicity.

# Grouping by ethnicity
ad_ethnic_grouped_df <- group_by(ad_df, ethnic)

ad_mutate_df <- mutate(
    ad_ethnic_grouped_df, # name of dataframe to be transformed
    mean_phq_gad = mean(phq+gad) # new column we want to create
    )  

The mean_phq_gad variable is now constant across each ethnic group since that was our grouping variable.

print(ad_mutate_df)

To use the mutate statement row-wise, you need to use the rowwise statement. This tells the mutate statement that you want to apply your function to each row. The syntax below calculates the mean of the phq and gad score per row. We use c() since we need to give the mean function a vector of values.

ad_rowwise_df <- rowwise(ad_df)

ad_mutate_df <- mutate(
    ad_rowwise_df, # name of dataframe to be transformed
    mean_phq_gad = mean(c(phq, gad)) # new column we want to create
)

Now each row has a row-specific mean_phq_gad score.

print(ad_mutate_df)

See help(mutate) for more information and for useful mutate functions.

12.3 Tidying up data with (occasional) duplication of patientid

See earlier section 8.1 which cover checking for and removing duplicate ID variables. As previously acknowledged, some datasets accidentally contain a few lines of data with the same patientid. If we want some information from each line, then be aware that we can use the above section on reducing datasets to one line per cluster for this purpose.

Note that section 8.1 gives syntax to inspect rows of data that have duplicated ID variables.

12.4 Exercises

  1. With the puppies.csv dataset, find the mean birth weight of each litter, using the mutate and group_by (or rowwise) command with the mean function.

  2. Read the wife_deprn.csv dataset in and (without reshaping/ pivoting) use the rowwise and mutate command with the mean function to find mean depression score in each woman.

  3. Read the wife_deprn.csv dataset in again and use the summarize command to find the mean depression score for each woman, and also record the number of scores used to create the average (hint, you can use the n() function to count the number of values in a group). Use help(summarize) to see what other default summary statistics are available. Notice that you now have only one line of data per woman, and probably fewer variables.

  4. With ihddata.csv, you want variables which say whether any member of the household has had CVD added to this dataset. How would you go about producing this?

  5. Suppose you want to create a new variable which says whether each household member has had either diagnosed high BP, or CVD. How do you create this?

  6. Look at the couple_deprn_anxiety.csv dataset. Note that there are a few lines of data with one familyid (=255), yet we only expect one line per family. Tidy up the dataset, so that there is just one line for this familyid – the summarize command might be useful here.


Datasets are available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt and Tristan Naidoo. Imperial College London.