Kazerooni 2016

Excerpt from published methods

Patients were excluded from this study if they were missing weight values from any of the three measuring points of the study (pre-, mid-, and post-weight) … Pre-weight was defined as the most recent weight within 30 days before starting topiramate therapy. Mid-study period was defined as the earliest weight taken between 3 and 6 months post initiation of therapy. Post-study period was defined as the earliest weight taken between 6 and 12 months post initiation of therapy.

Translation in pseudocode

DEFINE time t_ij for person i IN 1:I, j weights In 1:J {pre-trt, mid-trt, post-trt}
FOR i IN 1:I
    FOR j IN 1:J
        weight_i1 := weight @ t_0 (treatment start) - 30 days
        weight_i2 := {t_0 + 90 days <= weight <= t_0 + 180 days}
        weight_i3 := {t_0 + 180 days < weight <= t_0 + 365 days}
    END FOR
    IF (weight_ij IS NULL) (weight measure missing at jth time point)
        EXCLUDE person i
END FOR

Algorithm in R Code

#' @title Kazerooni et al. 2016 Weight Cleaning Algorithm
#' @param DF object of class `data.frame`, containing `id` and `measures`
#' @param id string corresponding to the name of the column of patient identifiers in `DF`
#' @param measures string corresponding to the name of the column of measures in `DF`, e.g., numeric weight data if using to clean weight data.
#' @param tmeasures string corresponding to the name of the column of measure dates and/or times in `DF`
#' @param startPoint string corresponding to the name of the column in `DF` holding the time at which subsequent measurement dates will be assessed, should be the same for each person. Eg., if t = 0 (t[1]) corresponds to an index visit held by the variable 'VisitDate', then `startPoint` should be set to 'VisitDate'
#' @param t numeric vector of time points to collect measurements, eg. `c(0, 182.5, 365)` for measure collection at t = 0, t = 180 (6 months from t = 0), and t = 365 (1 year from t = 0). Default is `c(0, 182.5, 365)` according to Kazerooni et al. 2016
#' @param windows numeric list of two vectors of measurement collection windows to use around each time point in `t`. E.g. Kazerooni et al. 2016 use `c(30, 0, 0)` for the lower bound and `c(0, 0, 185)` for the upper bound at t of `c(0, 90, 180)`, implying that the closest measurement to t[1] (=0) will be within the window [-30, 0], then the closest to t[2] (=90) will be within [90, 180], t[3] (=180) within (180, 365]
Kazerooni2016.f <- function(DF,
                            id,
                            measures,
                            tmeasures,
                            startPoint,
                            t = c(0, 90, 180),
                            windows = list(LB = c(30, 0, 0),
                                           UB = c(0, 90, 185))) {
  
  if (!require(dplyr)) install.packages("dplyr")
  if (!require(rlang)) install.packages("rlang")
  
  tryCatch(
    if (class(DF[[tmeasures]])[1] != class(DF[[startPoint]])[1]) {
      stop(
            print(
                  paste0("date type of tmeasures (",
                  class(DF[[tmeasures]]),
                  ") != date type of startPoint (",
                  class(DF[[startPoint]])[1],
                  ")"
                 )
          )
      )
    }
  )
  
  tryCatch(
    if (class(t) != "numeric") {
      stop(
        print("t parameter must be a numeric vector")
      )
    }
  )
  
  tryCatch(
    if (!is.list(windows)) {
      stop(
        print("windows must be placed into a list object")
      )
    }
  )
  
  # compute difference in time between t0 and all t_j
  id         <- rlang::sym(id)
  tmeasures  <- rlang::sym(tmeasures)
  startPoint <- rlang::sym(startPoint) 
  
  DF <- DF %>%
    mutate(
      time = as.numeric(
        difftime(
          !!tmeasures, !!startPoint,
          tz = "utc", units = "days"
        )
      )
    )
  
  # loop through each time point in `t`, place into list
  meas_tn <- vector("list", length(t)) # set empty list
  for (i in 1:length(t)) {
    meas_tn[[i]] <- DF %>%
      filter(time >= t[i] - windows$LB[i] & time <= t[i] + windows$UB[i]) %>%
      group_by(!!id) %>%
      arrange(abs(time - t[i])) %>%
      slice(1)
  }
  
  # count number of time points available for each subject i
  do.call(rbind, meas_tn) %>%
    arrange(!!id, !!tmeasures) %>%
    group_by(!!id) %>%
    filter(max(row_number()) >= 3) %>% # must have all 3 time points
    ungroup()
}

Algorithm in SAS Code

#TODO

Example in R

Kazerooni2016.df <- Kazerooni2016.f(DF,
                                    id = "PatientICN",
                                    measures = "Weight",
                                    tmeasures = "WeightDateTime",
                                    startPoint = "VisitDateTime")

Distribution of Weight Measurements between Raw and Algorithm-Processed Values


 Descriptive statistics by group 
group: Input
   vars       n   mean   sd median trimmed   mad min    max  range skew
X1    1 1175995 207.82 48.6  202.3  204.62 44.18   0 1486.2 1486.2 0.98
   kurtosis   se
X1      5.6 0.04
------------------------------------------------------------ 
group: Output
   vars     n mean    sd median trimmed   mad min    max  range skew kurtosis
X1    1 71961  209 47.95    204  206.02 44.33   0 1233.7 1233.7 0.88     4.27
     se
X1 0.18

We won’t show a vignette for Kazerooni & Lim 2016, as it just excludes people, retaining only those with a certain number of measurements.

Left boxplot is raw data from 2016, PCP visit subjects while the right boxplot describes the output from running Kazerooni2016.f()