Janney 2016
Excerpt from published methods
Baseline, 6- and 12-mo body weight measures were retrieved for each MOVE! veteran from the VHA patient care databases. Baseline weight was measured within 30 d of MOVE! enrollment. The closest weight within a 60-d window of the follow-up target date (180 d for 6 mo and 365 d for 12 mo) was selected/defined as weight for 6- and 12-mo follow-ups. Weight was coded missing if not available at baseline, 6 mo, and/or 12 mo. Outliers were defined as baseline weight less than 91 lb or greater than 600 lb; 6- or 12-mo weight less than 72 lb or greater than 650 lb; weight change from baseline greater than 100 lb …
Translation in pseudocode
DEFINE time t_ij for person i IN 1:I, j weights IN 1:J {0mo., 6mo., 12mo.}
FOR i IN 1:I
weight_i1 := weight @ t_i1 +/- 30 days (baseline)
IF (weight_i1 < 91 lbs. OR weight_i1 > 600 lbs.)
weight_i1 := NA
FOR j IN 2:J
weight_ij := weight @ t_ij +/- 60 days
IF (weight_ij < 72 lbs. OR weight_ij > 650 lbs.)
weight_ij := NA
END FOR
END FOR
Algorithm in R Code
Though the Pseudocode is written with for loops, this is not a proper R programming technique since it’s very slow, especially given the amount of data this project involves. There are ways to speed up our code but we can try those options later.
#-------------------------------- set windows --------------------------------#
#' @title Windows.f Measurment Cleaning Algorithm
#' @param DF object of class `data.frame`, containing `id` and `measures`
#' @param id string corresponding to the name of the column of patient identifiers in `df`
#' @param measures string corresponding to the name of the column of measures in `df`, e.g., numeric weight data if using to clean weight data.
#' @param tmeasures string corresponding to the name of the column of measure dates and/or times in `df`
#' @param startPoint string corresponding to the name of the column in `df` holding the time at which subsequent measurement dates will be assessed, should be the same for each person. Eg., if t = 0 (t[1]) corresponds to an index visit held by the variable 'VisitDate', then `startPoint` should be set to 'VisitDate'
#' @param t numeric vector of time points to collect measurements, eg. `c(0, 182.5, 365)` for measure collection at t = 0, t = 180 (6 months from t = 0), and t = 365 (1 year from t = 0). Default is `c(0, 182.5, 365)` according to Janney et al. 2016
#' @param windows numeric vector of measurement collection windows to use around each time point in t. Eg. Janney et al. 2016 use `c(30, 60, 60)` for t of `c(0, 182.5, 365)`, implying that the closest measurement t = 0 will be collected 30 days prior to and 30 days post startPoint. Subsequent measurements will be collected 60 days prior to and 60 days post t0+182.5 days, and t0+365 days.
windows.f <- function(DF,
id,
measures,
tmeasures,
startPoint,
t = c(0, 182, 365),
windows = c(30, 60, 60)) {
if (!require(dplyr)) install.packages("dplyr")
if (!require(data.table)) install.packages("data.table")
if (!require(rlang)) install.packages("rlang")
tryCatch(
if (class(DF[[tmeasures]])[1] != class(DF[[startPoint]])[1]) {
stop(
print(
paste0("date type of tmeasures (",
class(DF[[tmeasures]]),
") != date type of startPoint (",
class(DF[[startPoint]])[1],
")"
)
)
)
}
)
tryCatch(
if (class(t) != "numeric") {
stop(
print("t parameter must be a numeric vector")
)
}
)
tryCatch(
if (class(windows) != "numeric") {
stop(
print("windows parameter must be a numeric vector")
)
}
)
tmeasures <- rlang::sym(tmeasures)
startPoint <- rlang::sym(startPoint)
DF <- DF %>%
mutate(
time = as.numeric(
difftime(
!!tmeasures, !!startPoint,
tz = "utc", units = "days"
)
)
)
# convert to data.table
DT <- data.table::as.data.table(DF)
setkeyv(DT, id)
# loop through each time point in `t`, place into list
meas_tn <- vector("list", length(t)) # set empty list
for (i in 1:length(t)) {
# find measurement closest to t[i]
x <- DT[DT[, .I[abs(t[i] - time) == min(abs(t[i] - time))],
by = eval(id)]$V1]
# apply time outlier window/filter
x <- x[time >= (t[i] - windows[i]) & time <= (t[i] + windows[i]), ]
# reconcile duplicates
x <- x[x[, .I[1], by = eval(id)]$V1] # take the first
# tidy up
meas_tn[[i]] <- x %>%
as.data.frame() %>%
mutate(measureTime = paste0("t_", t[i]))
}
id <- rlang::sym(id)
# return result
do.call(rbind, meas_tn) %>%
arrange(!!id, !!tmeasures)
}
#------------------------------ Remove Outliers ------------------------------#
#' @title outliers.f Measurment Cleaning Algorithm - removing outliers. To be applied following the result of a call to windows.f()
#' @param DF object of class data.frame, containing id and weights
#' @param measures string corresponding to the name of the column of measures in df, e.g., numeric weight data if using to clean weight data.
#' @param t numeric vector of time points to collect measurements, eg. c(0, 182.5, 365) for measure collection at t = 0, t = 180 (6 months from t = 0), and t = 365 (1 year from t = 0). Default is c(0, 182.5, 365) according to Janney et al. 2016
#' @param outliers object of type list with numeric inputs corresponding to the upper and lower bound for each time entry in parameter `t`. Default is list(LB = c(91, 72, 72), UB = c(600, 650, 650)) for t = c(0, 182.56, 365), differing between baseline and subsequent measurement collection dates
outliers.f <- function(DF,
measures,
t = c(0, 182.5, 365),
outliers = list(LB = c(91, 72, 72),
UB = c(600, 650, 650))) {
if (!require(dplyr)) install.packages("dplyr")
if (!require(data.table)) install.packages("data.table")
tryCatch(
if (class(DF[[measures]]) != "numeric") {
stop(
print("weight data must be a numeric vector")
)
}
)
tryCatch(
if (class(outliers) != "list" & !is.null(outliers)) {
stop(
print("outliers must be placed into a list object")
)
}
)
bounds.df <- data.frame(
t = paste0("t_", t),
LB = outliers[[1]],
UB = outliers[[2]],
stringsAsFactors = FALSE
)
a <- rlang::sym(measures)
meas_updated <- rlang::quo_name(paste0(measures, "_OR"))
DF <- DF %>%
left_join(bounds.df, by = c("measureTime" = "t")) %>%
mutate(!!meas_updated := ifelse(!!a < LB | !!a > UB, NA, !!a))
DF
}
#------------------------- Windows + Outlier Removal -------------------------#
#' @title Janney2016.f Weight Measurment Cleaning Algorithm
#' @param DF object of class data.frame, containing id and weights
#' @param id string corresponding to the name of the column of patient IDs in `DF`
#' @param measures string corresponding to the name of the column of measures in `DF`, e.g., numeric weight data if using to clean weight data
#' @param tmeasures string corresponding to the name of the column of measure dates and/or times in `DF`
#' @param startPoint string corresponding to the name of the column in `DF` holding the time at which subsequent measurement dates will be assessed, should be the same for each person. Eg., if t = 0 (t[1]) corresponds to an index visit held by the variable 'VisitDate', then startPoint should be set to 'VisitDate'
#' @param t numeric vector of time points to collect measurements, eg. c(0, 182.5, 365) for measure collection at t = 0, t = 180 (6 months from t = 0), and t = 365 (1 year from t = 0). Default is c(0, 182.5, 365) according to Janney et al. 2016
#' @param windows numeric vector of measurement collection windows to use around each time point in t. Eg. Janney et al. 2016 use c(30, 60, 60) for t of c(0, 182.5, 365), implying that the closest measurement t = 0 will be collected 30 days prior to and 30 days post startPoint. Subsequent measurements will be collected 60 days prior to and 60 days post t0+182.5 days, and t0+365 days
#' @param outliers optional. object of type list with numeric inputs corresponding to the upper and lower bound for each time entry in parameter `t`. Default is list(LB = c(91, 72, 72), UB = c(600, 650, 650)) for t = c(0, 182.56, 365), differing between baseline and subsequent measurment collection dates. If not specified then only the subsetting and window functions will be applied.
Janney2016.f <- function(DF,
id,
measures,
tmeasures,
startPoint,
t = c(0, 182.5, 365),
windows = c(30, 60, 60),
outliers = list(LB = c(91, 72, 72),
UB = c(600, 650, 650))) {
windowsApplied.df <- windows.f(DF = DF,
id = id,
measures = measures,
tmeasures = tmeasures,
startPoint = startPoint,
t = t,
windows = windows)
outliersRemoved.df <- outliers.f(DF = windowsApplied.df,
measures = measures,
t = t,
outliers = outliers)
# return fully cleaned data
outliersRemoved.df
}
Algorithm in SAS Code
/******************************** Set Windows *******************************/
/*
title: Windows: Weight Measurment Cleaning Algorithm;
param: df: table containing id and weights;
param: id: string corresponding to the name of the column of patient IDs in df
param: measures: string corresponding to the name of the column of measures in
df, e.g., numeric weight data if using to clean weight data.
param: tmeasures: string corresponding to the name of the column of measure
dates and/or times in df
param: startPoint: string corresponding to the name of the column in df holding
the time at which subsequent measurement dates will be assessed,
should be the same for each person. Eg., if t = 0 corresponds
to an index visit held by the variable 'VisitDate', then
startPoint should be set to 'VisitDate'
param: TWindows: table with time points to collect data and windows
around each window.
Columns should be:
column: t: numeric vector of time points to collect measurements,
e.g., c(0, 182.5, 365) for measure collection at t = 0,
t = 180 (6 months from t = 0), and t = 365
(1 year from t = 0). Default is c(0, 182.5, 365) according
to Janney et al. 2016
column: windows: numeric vector of measurement collection windows to use
around each time point in t. E.g., Janney et al. 2016
use (30, 60, 60) for t of (0, 182.5, 365), implying
that the closest measurement t = 0 will be collected
30 days prior to and 30 days post startPoint.
Subsequent measurements will be collected 60 days prior
to and 60 days post t0 + 182.5 days, and t0 + 365 days
*/
%MACRO windows(df = ,
id = ,
measures = ,
tmeasures = ,
startpoint = ,
TWindows = );
DATA dt;
SET &df;
time = (&tmeasures - &startPoint) / 86400;
RUN;
PROC SQL NOPRINT;
SELECT t INTO :timePoints SEPARATED BY ' '
FROM TWindows;
CREATE TABLE t1 AS
SELECT a.*,
b.windows,
%scan(&timePoints, 1, ' ') AS t FORMAT BEST.
FROM dt AS a, (
SELECT *
FROM TWindows
WHERE t = %scan(&timePoints, 1, ' ')
) AS b
GROUP BY &id
HAVING ABS(b.t - a.time) eq MIN(ABS(b.t - a.time))
ORDER BY &id, &tmeasures;
CREATE TABLE t2 AS
SELECT a.*,
b.windows,
%scan(&timePoints, 2, ' ') AS t FORMAT BEST.
FROM dt AS a, (
SELECT *
FROM TWindows
WHERE t = %scan(&timePoints, 2, ' ')
) AS b
GROUP BY &id
HAVING ABS(b.t - a.time) eq MIN(ABS(b.t - a.time))
ORDER BY &id, &tmeasures;
CREATE TABLE t3 AS
SELECT a.*,
b.windows,
%scan(&timePoints, 3, ' ') AS t FORMAT BEST.
FROM dt AS a, (
SELECT *
FROM TWindows
WHERE t = %scan(&timePoints, 3, ' ')
) AS b
GROUP BY &id
HAVING ABS(b.t - a.time) eq MIN(ABS(b.t - a.time))
ORDER BY &id, &tmeasures;
QUIT;
DATA WindowsSet;
SET t1 t2 t3;
IF time >= (t - windows) AND time <= (time + windows);
RUN;
PROC SORT DATA = WindowsSet; by PatientICN t; RUN;
PROC DATASETS nolist; DELETE T1 T2 T3 Dt; QUIT;
%MEND windows;
/****************************** Remove Outliers *****************************/
/*
title: Outliers: Measurment Cleaning Algorithm - to be applied following a call
to %WINDOWS
param: df: table containing id and weights
param: measures: string corresponding to the name of the column of measures in
df, e.g., numeric weight data if using to clean weight data.
param: TOutliers: table with time points to collect data and outliers for each
time point.
Columns should be:
column: t: numeric vector of time points to collect measurements,
e.g., c(0, 182.5, 365) for measure collection at t = 0,
t = 180 (6 months from t = 0), and t = 365
(1 year from t = 0). Default is c(0, 182.5, 365) according
to Janney et al. 2016
column: windows: numeric vector of measurement collection windows to use
around each time point in t. E.g., Janney et al. 2016
use (30, 60, 60) for t of (0, 182.5, 365), implying
that the closest measurement t = 0 will be collected
30 days prior to and 30 days post startPoint.
Subsequent measurements will be collected 60 days prior
to and 60 days post t0 + 182.5 days, and t0 + 365 days
column: upper: upper bound of plausible measurements, one for each time
point t.
column: lower: lower bound of plausible measurements.
*/
%MACRO outliers(df = , measures = , TOutliers = );
PROC SQL;
CREATE TABLE OutliersRemoved AS
SELECT a.*, b.lower, b.upper
FROM &df AS a JOIN &TOutliers AS b
ON a.t = b.t
WHERE &measures >= lower AND &measures <= upper;
QUIT;
%MEND outliers;
/*********************** Set Windows + Remove Outliers **********************/
/*
title: Janney2016 Weight Measurment Cleaning Algorithm;
param: df: table containing id and weights;
param: id: string corresponding to the name of the column of patient IDs in df
param: measures: string corresponding to the name of the column of measures in
df, e.g., numeric weight data if using to clean weight data.
param: tmeasures: string corresponding to the name of the column of measure
dates and/or times in df
param: startPoint: string corresponding to the name of the column in df holding
the time at which subsequent measurement dates will be assessed,
should be the same for each person. Eg., if t = 0 corresponds
to an index visit held by the variable 'VisitDate', then
startPoint should be set to 'VisitDate'
param: table: table with time points to collect data, windows around each time
point, and lower and upper bounds for outliers at each time.
Columns should be:
column: t: numeric vector of time points to collect measurements,
e.g., c(0, 182.5, 365) for measure collection at t = 0,
t = 180 (6 months from t = 0), and t = 365
(1 year from t = 0). Default is c(0, 182.5, 365) according
to Janney et al. 2016
column: windows: numeric vector of measurement collection windows to use
around each time point in t. E.g., Janney et al. 2016
use (30, 60, 60) for t of (0, 182.5, 365), implying
that the closest measurement t = 0 will be collected
30 days prior to and 30 days post startPoint.
Subsequent measurements will be collected 60 days prior
to and 60 days post t0 + 182.5 days, and t0 + 365 days
column: upper: upper bound of plausible measurements, one for each time
point t.
column: lower: lower bound of plausible measurements.
*/
%MACRO Janney2016(df = ,
id = ,
measures = ,
tmeasures = ,
startPoint = ,
table = );
%windows(df = &df.,
id = &id.,
measures = &measures.,
tmeasures = &tmeasures.,
startPoint = &startPoint.,
TWindows = &table.);
%outliers(df = WindowsSet,
measures = &measures.,
TOutliers = &table.);
%MEND WindowCleaner;
Example in R
The way I designed this algorithm allows for multiple time points and windows and thus, includes as special cases,
- Baker et al. 2015
- Braun et al. 2016
- Janney et al. 2016
- Garvin et al. 2015
- Garvin, Marion 2015
- Garvin, Hardy 2013
- Grabarczyk et al. 2017
- Huizinga et al 2010
- Kahwati et al. 2011
- Romanova et al. 2013
- Xiao et al. 2017
Each of these differing either in their outlier bounds or selection of time points:
Algorithm/Publication | Time Points | Windows | Outliers |
---|---|---|---|
Baker et al. 2015 | Arbitrary Visits | +/- 14 days | N/A |
Braun et al. 2016 | 0mo., 12mo. | \(t_12\): +/- 60 days | N/A |
Janney et al. 2016 | 0mo., 6mo., 12mo. | \(t_0\): +/- 30 days | (91, 600) |
\(t_{6mo., 12mo.}\): +/- 60 days | (72, 650) | ||
Garvin et al. 2015 | Collect All | N/A | N/A |
Garvin & Marion 2015 | Collect All | N/A | N/A |
Garvin & Hardy 2016 | 0wk., 2wk., 4wk., 26wk., 52wk., 104wk. | N/A | N/A |
Grabarczyk et al. 2017 | 0wk., 12wk., 24wk., 36wk. | N/A | N/A |
Huizinga et al. 2010 | 0mo., 12mo. | \(t_12\): +/- 90 days | (50, 700) |
Kahwati et al. 2011 | 0mo., 6mo. | \(t_0\): +/- 30 days | N/A |
\(t_{6mo.}\): +/- 60 days | N/A | ||
Romanova et al. 2013 | -1yr, 0, 0.25yr, 0.5yr, 1yr, 2yr, 3yr | \(t_{-1yr, 1yr, 2yr, 3yr}\): +/- 3mo. | N/A |
\(t_{0.25, 0.5}\): +/- 1mo. | N/A | ||
Xiao et al. 2017 | -1yr, 0yr, 0.25yr, 2yr | \(t_0\): +/- 14 days | N/A |
\(t_{-1yr, 2yr}\): +/- 90 days | N/A | ||
\(t_{0.25yr}\): +/- 30 days | N/A |
Run algorithm on our sample of weights weightSamples. For simplicity, we will subset the data to include only patients from 2016.
# subset data for sample
DF <- weightSamples %>%
filter(SampleYear == "2016" & !is.na(Weight)) %>%
distinct(PatientICN, WeightDate, Weight, .keep_all = TRUE)
# function call
janney2016.df <- Janney2016.f(
DF = DF,
id = "PatientICN",
measures = "Weight",
tmeasures = "WeightDateTime",
startPoint = "VisitDateTime"
)
Displaying a Vignette of 16 selected patients with at least 1 weight observation removed.
Distribution of Weight Measures between Raw & Algorithm-Processed values
Descriptive statistics by group
group: Input
vars n mean sd median trimmed mad min max range skew
X1 1 1175995 207.82 48.6 202.3 204.62 44.18 0 1486.2 1486.2 0.98
kurtosis se
X1 5.6 0.04
------------------------------------------------------------
group: Output
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 199830 206.15 45.46 201 203.17 41.51 78.4 546 467.6 0.81 1.49
se
X1 0.1
Left boxplot is raw data from the sample of 2016, PCP visit subjects while the right boxplot describes the output from running windowCleaner.f()