Goodrich 2016
Goodrich et al uses a more conservative approach to weight cleaning, where, instead of setting outliers to missing, they completely remove the individual from consideration if they fail to meet the four criteria (see excerpt from published methods).
Excerpt from published methods
…cohort criteria required participants to have a baseline weight documented within 1 month before or after MOVE! enrollment (index date) and at least one follow-up weight at 6 or 12 months after enrollment…Participant records with implausible values for weights (<80 lb or >500 lb) and heights (<48 inches or >84 inches) were excluded, as were those with implausible 6- and 12-month weight changes (>100 lb)
Breakdown of weight measurement removal from cohort:
- Missing enrollment weight (n = 12,589)
- no 6 or 12 month weights (n = 12,870)
- 100+ lb. weight change (n = 3,351)
- Outlier baseline weights (n = 80)
Translation in pseudocode
DEFINE time t_ij for person i IN 1:I, j weights IN 1:J {0mo., 6mo., 12mo.}
FOR i IN 1:I
weight_i1 := weight @ t_i1 +/- 30 days (baseline)
IF (weight_i1 == NA)
EXCLUDE person i
IF (weight_i1 <= 80 lbs. OR weight_i1 >= 500 lbs.)
EXCLUDE person i
FOR j IN 2:J
weight_ij := weight @ t_ij +/- 60 days
IF (weight_ij <= 80 lbs. OR weight_ij >= 500 lbs.)
EXCLUDE person i
END FOR
IF (abs(weight_{i, j+1} - weight_{ij}) > 100 lbs.)
EXCLUDE person i
END FOR
Algorithm in R Code
This code piggybacks off Janney et al 2016, adding only the weight change portion and tweaking the outlier and window functions to suit Goodrich et al.’s specific research needs.
#----------------------- weight change moving forward --------------------------
#' @param DF object of class `data.frame`, containing `id` and `measures`
#' @param id string corresponding to the name of the column of patient identifiers in `DF`
#' @param measures string corresponding to the name of the column of measures in `DF`, e.g., numeric weight data if using to clean weight data
#' @param tmeasures string corresponding to the name of the column of measure dates and/or times in `DF`
#' @param wtchng_thresh numeric scalar used as a cutoff for higher than (or lower than) expected weight changes from one time point j to time point j + 1, by person or group. Default is 100.
lookForwardAndRemove.f <- function(DF,
id,
measures,
tmeasures,
wtchng_thresh = 100) {
if (!require(data.table)) install.packages("data.table")
if (!require(dplyr)) install.packages("dplyr")
# convert to data.table
DT <- data.table::as.data.table(DF)
setkeyv(DT, id)
setorderv(DT, c(id, tmeasures))
# fast lead with data.table
DT[,
"forward" := shift(get(measures),
n = 1L,
fill = NA,
type = "lead"),
by = id
]
# remove weight changes > wtchng_thresh
DT$outlier <- abs(DT[[measures]] - DT[["forward"]]) > wtchng_thresh
DT <- DT %>%
mutate(
output = case_when(
outlier ~ NA_real_,
is.na(outlier) ~ DT[[measures]],
TRUE ~ DT[[measures]]
)
)
DT
}
#-------------------- Add weight change to Janney2016.f ---------------------
#' @title Goodrich et al. 2016 Measurment Cleaning Algorithm
#' @param DF object of class data.frame, containing id and weights
#' @param id string corresponding to the name of the column of patient IDs in `DF`
#' @param measures string corresponding to the name of the column of measures in `DF`, e.g., numeric weight data if using to clean weight data
#' @param tmeasures string corresponding to the name of the column of measure dates and/or times in DF
#' @param startPoint string corresponding to the name of the column in `DF` holding the time at which subsequent measurement dates will be assessed, should be the same for each person. Eg., if t = 0 (t[1]) corresponds to an index visit held by the variable 'VisitDate', then startPoint should be set to 'VisitDate'
#' @param t numeric vector of time points to collect measurements, eg. c(0, 182.5, 365) for measure collection at t = 0, t = 180 (6 months from t = 0), and t = 365 (1 year from t = 0). Default is c(0, 182.5, 365) according to Janney et al. 2016
#' @param windows numeric vector of measurement collection windows to use around each time point in t. Eg. Janney et al. 2016 use c(30, 60, 60) for t of c(0, 182.5, 365), implying that the closest measurement t = 0 will be collected 30 days prior to and 30 days post startPoint. Subsequent measurements will be collected 60 days prior to and 60 days post t0+182.5 days, and t0+365 days
#' @param outliers optional. object of type list with numeric inputs corresponding to the upper and lower bound for each time entry in parameter `t`. Default is list(LB = c(80, 80, 80), UB = c(500, 500, 500)) for t = c(0, 182.56, 365), differing between baseline and subsequent measurment collection dates. If not specified then only the subsetting and window functions will be applied.
#' @param wtchng_thresh numeric scalar used as a cutoff for higher than (or lower than) expected weight changes from one time point j to time point j + 1, by person or group. Default is 100.
Goodrich2016.f <- function(DF,
id,
measures,
tmeasures,
startPoint,
t = c(0, 182, 365),
windows = c(30, 60, 60),
outliers = list(LB = c(80, 80, 80),
UB = c(500, 500, 500)),
wtchng_thresh = 100,
excludeSubject = FALSE){
WindowsAndOutliers.df <-
Janney2016.f(
DF,
id,
measures,
tmeasures,
startPoint,
t = t,
windows = windows,
outliers = outliers
)
lookForwardAndRemove.df <-
lookForwardAndRemove.f(
DF = WindowsAndOutliers.df,
id = id,
measures = "Weight_OR",
tmeasures = tmeasures,
wtchng_thresh = wtchng_thresh
)
if (excludeSubject) {
excluded.df <- lookForwardAndRemove.df %>%
filter(is.na(output)) %>%
select(id) %>%
distinct() %>%
mutate(FlagForRemoval = 1) %>%
right_join(lookForwardAndRemove.df, by = "PatientICN") %>%
filter(is.na(FlagForRemoval)) %>%
select(-FlagForRemoval)
return(excluded.df)
} else {
return(lookForwardAndRemove.df)
}
}
Algorithm in SAS Code
Example in R
The way I designed this algorithm allows for multiple time points and windows and thus, includes as special cases,
- Hoerster et al. 2014
- Littman et al. 2015
An example Without excluding people due to outliers:
Goodrich2016.df <- Goodrich2016.f(
DF,
id = "PatientICN",
measures = "Weight",
tmeasures = "WeightDateTime",
startPoint = "VisitDateTime"
)
Displaying a Vignette of 16 selected patients with at least 1 weight observation removed.
Descriptive statistics by group
group: Input
vars n mean sd median trimmed mad min max range skew
X1 1 1175995 207.82 48.6 202.3 204.62 44.18 0 1486.2 1486.2 0.98
kurtosis se
X1 5.6 0.04
------------------------------------------------------------
group: Output
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 199786 206.13 45.42 201 203.17 41.51 80 500 420 0.8 1.42
se
X1 0.1
Left boxplot is raw data from the sample of 2016, PCP visit subjects while the right boxplot describes the output from running Goodrich2016.f()