Noel 2012

Excerpt from published methods

… we used BMI derived from heights and weights obtained during routine clinical encounters. These data are stored in facility information systems and uploaded into the Corporate Data Warehouse; validation work indicates that some of these height and weight values probably reflect data entry errors. Therefore, we used an iterative process to eliminate or control for height and weight outliers while avoiding suspect BMI values. In specifying the original cohort of obese primary care patients, we removed biologically “implausible” values (i.e., weights $70 lbs or$ 700 lbs and heights $48 inches or$ 84 inches). We then divided each of the five study years into quarters and determined the median value for weights recorded during each quarter for every patient, yielding up to 20 quarterly median weights.

Translation in Pseudocode

DEFINE weight_ij for person i IN 1:I and weight j @ time t
Divide time vector into q Fiscal Quarters
FOR i IN 1:I
    FOR j IN 1:J
        IF (weight_ij > 70 lbs. & weight_ij < 700 lbs.)
            weight_ij := weight @ t_ij
    END FOR
    FOR q IN 1:Q
        weight_iq := MEDIAN({weight_i}_j) (all weights within quarter q)
    END FOR
END FOR

This algorithm is very similar to Jackson et al. 2015, where the windows are defined by fiscal quarters. If you knew the exact beginning and end dates of the Fiscal quarters you would like to use, you could use Jackson2015.f, to get something similar. Here I will build the algorithm to handle any Fiscal years and let the function/algorithm convert to fiscal quarters automatically.

Algorithm in R Code

#' @title Noel 2012 Measurment Cleaning Algorithm
#' @param DF object of class `data.frame`, containing `id` and `measures`
#' @param id string corresponding to the name of the column of patient identifiers in `DF`
#' @param measures string corresponding to the name of the column of measurements in `DF`
#' @param tmeasures string corresponding to the name of the column of measurement collection dates or times in `DF`. If `tmeasures` is a date object, there may be more than one weight on the same day, if it precise datetime object, there may not be more than one weight on the same day
#' @param outliers object of type `list` with numeric inputs corresponding to the upper and lower bound for each time entry. Default is `list(LB = 70, UB = 700)`
#' @param fiscal_start integer to be passed to `lubridate::quarter()`. Defaults to 10, indicating October, the Federal Fiscal Year starting month.
#TODO: add @param collapse = FALSE - aggregate to Fiscal Quarter? Default is FALSE, returning `DF` appended with `Qmedian` to be collapsed after calling the function.
Noel2012.f <- function(DF,
                       id,
                       measures,
                       tmeasures,
                       outliers = c(70, 700),
                       fiscal_start = 10) {
  
  if (!require(dplyr))      install.packages("dplyr")
  if (!require(data.table)) install.packages("data.table")
  
  tryCatch(
    if (!is.numeric(DF[[measures]])) {
      stop(
        print("measure data must be a numeric vector")
      )
    }
  )
  
  tryCatch(
    if (!is.numeric(outliers)) {
      stop(
        print("outliers must be numeric")
      )
    }
  )
  
  # convert to data.table
  DT <- data.table::as.data.table(DF)
  
  # Step 1: Set outliers to NA
  DT[,
     output := ifelse(get(measures) < outliers[1]
                      | get(measures) > outliers[2], 
                      NA,
                      get(measures))
     ]
  
  # Step 2: Set Fiscal Years and Quarters
  DT[,
     FYQ := lubridate::quarter(get(tmeasures),
                               with_year = TRUE,
                               fiscal_start = fiscal_start)
     ]
  
  # Step 3: aggregate median weight by ID, Fiscal Year and Quarter
  key_cols <- c(id, "FYQ")
  setkeyv(DT, key_cols)
  DT[, `:=` (Qmedian = median(output, na.rm = TRUE)), keyby = key_cols]
  DT
}

Algorithm in SAS Code

#TODO

Example in R

Noel2012.df <- Noel2012.f(DF,
                          id = "PatientICN",
                          measures = "Weight",
                          tmeasures = "WeightDateTime")

Displaying a Vignette of 16 selected patients.

Distribution of Weight Measurements between Raw and Algorithm-Processed Values


 Descriptive statistics by group 
group: Input
   vars       n   mean   sd median trimmed   mad min    max  range skew
X1    1 1175995 207.82 48.6  202.3  204.62 44.18   0 1486.2 1486.2 0.98
   kurtosis   se
X1      5.6 0.04
------------------------------------------------------------ 
group: Output
   vars      n   mean    sd median trimmed   mad min   max range skew kurtosis
X1    1 683008 207.25 46.13    202  204.22 42.25  70 588.9 518.9  0.8     1.44
     se
X1 0.06

Left boxplot is raw data from 2008, PCP visit subjects while the right boxplot describes the output from running Noel2012.f()