Chan & Raffa 2017

Excerpt from published methods

A three-step cleaning process was used to identify outlier weights recorded in the EHR. The first step was to exclude 70,636 (0.27%) weights in the EHR with values less than 22.7 kg (50 lbs) or greater than 340.2 kg (750 lbs), as these extreme values are implausible. The second step was to calculate each Veteran’s BMI using the average height, in order to identify the 45,318 weights that were similarly implausible (defined as a BMI ≤10 or BMI ≥100) for exclusion. Then, for the third step, each Veteran’s mean weight and standard deviation were determined from the remaining weights available. Weights in the EHR that were greater than 3 standard deviations from the mean, an additional 162,349 weights, were excluded. Overall, following the three-step cleaning procedure, a total of 278,303 (1.05%) weights were excluded from the analysis leaving 26,263,946 weights available for inclusion.

Algorithm in R Code

#' @title Chan 2017 Measurment Cleaning Algorithm
#' @param DF object of class `data.frame`, containing `id` and `measures`
#' @param id string corresponding to the name of the column of patient identifiers in `DF`
#' @param measures string corresponding to the name of the column of measurements in `DF`
#' @param tmeasures string corresponding to the name of the column of measurement collection dates or times in `DF`. If `tmeasures` is a date object, there may be more than one weight on the same day, if it precise datetime object, there may not be more than one weight on the same day
#' @param outliers object of type `list` with numeric inputs corresponding to the upper and lower bound for each time entry. Default is `list(LB = 50, UB = 750)`
#' @param SDthreshold numeric scalar to be multiplied by the `SDMeasures` per `id`. E.g., from Chan 2017, "...weights greater than 3 standard deviations above the mean..." implies `SDthreshold`= 3
Chan2017.f <- function(DF,
                       id,
                       measures,
                       tmeasures,
                       outliers = list(LB = 50, UB = 750),
                       SDthreshold = 3) {
  
  if (!require(dplyr))      install.packages("dplyr")
  if (!require(data.table)) install.packages("data.table")
  
  tryCatch(
    if (!is.numeric(DF[[measures]])) {
      stop(
        print("measure data must be a numeric vector")
      )
    }
  )
  
  tryCatch(
    if (!is.list(outliers)) {
      stop(
        print("outliers must be placed into a list object")
      )
    }
  )
  
  # convert to data.table
  DT <- data.table::as.data.table(DF)
  setkeyv(DT, id)
  
  # Step 1: Set outliers to NA
  DT[,
     measures_aug_ := ifelse(get(measures) < outliers[[1]]
                             | get(measures) > outliers[[2]], 
                             NA,
                             get(measures))
     ]
  
  # calc mean of measures per group
  DT[, meanMeasures := mean(measures_aug_, na.rm = TRUE), by = id]
  # calc SD of weight per group
  DT[, SDMeasures := sd(measures_aug_,   na.rm = TRUE), by = id]
  
  # calc UB and LB
  DT[, LB := meanMeasures - (SDthreshold * SDMeasures)]
  DT[, UB := meanMeasures + (SDthreshold * SDMeasures)]
  
  # Step 2: outliers bounded by SDthreshold
  DT[,
     measures_aug_ := ifelse(measures_aug_ < LB | measures_aug_ > UB,
                             NA, 
                             measures_aug_)
     ]
  
  DT <- DT %>% select(-UB, -LB)
  DT
}

Algorithm in SAS Code

Example in R

Displaying a Vignette of 16 selected patients with at least 1 weight observation removed.

Distribution of Weight Measurements between Raw and Algorithm-Processed Values


 Descriptive statistics by group 
group: Input
   vars       n   mean   sd median trimmed   mad min    max  range skew
X1    1 1175995 207.82 48.6  202.3  204.62 44.18   0 1486.2 1486.2 0.98
   kurtosis   se
X1      5.6 0.04
------------------------------------------------------------ 
group: Output
   vars       n   mean    sd median trimmed   mad min   max range skew kurtosis
X1    1 1170114 207.86 48.27  202.4  204.66 44.18  54 727.5 673.5 0.81     1.45
     se
X1 0.04

Left boxplot is raw data from 2016, PCP visit subjects while the right boxplot describes the output from running Chan2017.f()