Littman 2012
Excerpt from published methods
For weight, height, and body mass index, we first removed biologically implausible values (weight <75 lb or >600 lb, height <49 in or >94 in, and body mass index >80 kg/m ). Next, we applied algorithms to identify measures that were plausible but appeared to be erroneous on the basis of a review of all recorded weights and heights during the relevant time period. After reviewing records that had large standard deviations (SDs) (explained in more detail below), we used the algorithm that follows to exclude values that were likely erroneous while keeping values that were plausible. We excluded any weight measurements that met the following 2 criteria: 1) the difference between the mean weight and weight in question was greater than the SD and 2) the SD was greater than 10% of the mean. For example, 1 participant’s weight in pounds was recorded as 300 and 160 lb, both measured on December 7, 2005, 310 lb measured on June 12, 2006, 276 lb measured on August 8, 2006, 291 lb measured on August 15, 2006, and 291 lb measured on September 13, 2007, resulting in mean (SD) of 271.3 (55.7) lb. The weight of 160 lb recorded on December 7, 2005, was considered erroneous and dropped because the difference between the index weight and mean weight was greater than the SD (271 − 160 = 113.3 lb) and the SD was greater than 10% of the mean of all weights ([55.7/271.3] × 100 = 20.5%).
Translation in Pseudocode
DEFINE person i IN 1:I, j date IN 1:J, weight k IN 1:K (additional recorded weight, same day)
FOR i IN 1:I
FOR j IN 1:J
FOR k IN 1:K
weight_ijk := weight for ith person on jth date and kth weight
END FOR
END FOR
IF (weight_ijk < 75 lbs OR weight_ijk > 600 lbs)
weight_ijk := NA
MEAN_i := MEAN({weight_i}_jk)
SD_i := SD({weight_i}_jk)
IF (ABS(MEAN_i - weight_ijk) > SD_i & SD_i > 0.10 * MEAN_i)
weight_ijk := NA
END FOR
Algorithm in R Code
This algorithm below assumes all weight data has already been collected, in contrast with the pseudocode.
#' @title Littman 2012 Measurment Cleaning Algorithm
#' @param DF object of class data.frame, containing `id` and `measures`
#' @param id string corresponding to the name of the column of patient identifiers in `df`
#' @param measures string corresponding to the name of the column of measurements in `df`
#' @param tmeasures string corresponding to the name of the column of measurement collection dates or times in `df`. If `tmeasures` is a date object, there may be more than one weight on the same day, if it precise datetime object, there may not be more than one weight on the same day
#' @param outliers object of type `list` with numeric inputs corresponding to the upper and lower bound for each time entry. Default is `list(LB = c(75), UB = c(600))`
#' @param SDthreshold numeric scalar to be multiplied by the `meanMeasures` per `id`. E.g., from Littman 2012, "...We excluded any weight measurements that met the following 2 criteria: 1) the difference between the mean weight and weight in question was greater than the SD and 2) the SD was greater than 10% of the mean...." implies `SDthreshold`= 0.10
#' @return input data.frame with additional columns: `InputMeasurement`, the original weight data; `OutputMeasurement`, algorithm output; `meanWeight`, mean(weights) per `id`; `SDWeight`, SD(weights) per `id`; `SD_threshold_`, Mean(weights) * `SDthreshold`
Littman2012.f <- function(DF,
id,
measures,
tmeasures,
outliers = list(LB = c(75), UB = c(600)),
SDthreshold = 0.10) {
if (!require(dplyr)) install.packages("dplyr")
if (!require(data.table)) install.packages("data.table")
tryCatch(
if (!is.numeric(DF[[measures]])) {
stop(
print("measure data must be a numeric vector")
)
}
)
tryCatch(
if (!is.list(outliers)) {
stop(
print("outliers must be placed into a list object")
)
}
)
# Compute the mean and sd for each persons weight, if more than 1 value
# if not more than 1 value, mean is finite, sd is undefined
# convert to data.table
DT <- data.table::as.data.table(DF)
setkeyv(DT, id)
# first set outliers to NA
DT[,
measures_aug_ := ifelse(get(measures) < outliers[[1]]
| get(measures) > outliers[[2]],
NA,
get(measures))
]
# calc mean of measures per group
DT[, meanMeasures := mean(measures_aug_, na.rm = TRUE), by = id]
# calc SD of weight per group
DT[, SDMeasures := sd(measures_aug_, na.rm = TRUE), by = id]
# calc SD threshold
DT[, SD_threshold_ := SDthreshold * meanMeasures, by = id]
# exclude any measurements that meet the following 2 criteria:
# 1) the difference between the meanMeasures and measures_aug_ in
# question is greater than the SDMeasures
# AND
# 2) the SDMeasures was greater than SD_threshold_ of the mean
DT[, cond1 := ifelse(abs(measures_aug_ - meanMeasures) > SDMeasures, T, F)]
DT[, cond2 := ifelse(SDMeasures > SD_threshold_, T, F)]
DT[, measures_aug_ := ifelse((cond1 & cond2), NA, measures_aug_)]
# return augmented measurements
DF <- as.data.frame(DT)
names(DF)[names(DF) == measures] <- "InputMeasurement"
names(DF)[names(DF) == "measures_aug_"] <- "OutputMeasurement"
DF
}
Algorithm in SAS Code
/*
title: Littman 2012 Weight Measurment Algorithm
param: df: table containing id and weights
param: id: string corresponding to the name of the column of patient IDs in df
param: weights: string corresponding to the name of the column of weights in df
param: tweights: string corresponding to the name of the column of weight dates
or times in df. If tweights is a date object, there may be more than one
weight on the same day, if it precise datetime object, there may not be
more than one weight on the same day.
param: outliers: numeric inputs corresponding to the upper and lower bound
param: SDthreshold: numeric scalar to be multiplied by the `meanWeight`
per `id`. E.g., from Littman 2012, "...We excluded any weight
measurements that met the following 2 criteria: 1) the difference
between the mean weight and weight in question was greater than the SD
and 2) the SD was greater than 10% of the mean...." implies
`SDthreshold`= 0.10
*/
%MACRO Littman2012(df = ,
id = ,
measures = ,
tmeasures = ,
outliers = ,
SDthreshold = );
* Step 1: set outliers to NA;
DATA dt;
SET &df;
IF &measures < %scan(&outliers, 1, ' ')
OR &measures > %scan(&outliers, 2, ' ')
THEN measures_aug_ = .;
ELSE measures_aug_ = &measures;
RUN;
* Step 2: compute mean, sd, SD threshold of measures per group id;
PROC MEANS DATA = dt NOPRINT NWAY;
CLASS &id;
VAR measures_aug_;
OUTPUT OUT = dt2 (DROP = _TYPE_ _FREQ_)
MEAN = meanMeasures
STDDEV = SDMeasures;
RUN;
PROC SQL;
CREATE TABLE descByGroup AS
SELECT a.*,
b.meanMeasures,
b.SDMeasures,
(&SDthreshold * b.meanMeasures) AS SD_threshold_
FROM dt AS a INNER JOIN dt2 AS b ON a.&id = b.&id;
QUIT;
/*
Step 3: exclude any measurements that meet the following 2
criteria: 1) the difference between the meanMeasures
and measures_aug_ in question is greater than the SDMeasures
AND 2) the SDMeasures is greater than the SD_threshold_
*/
DATA outputData;
SET descByGroup;
IF (ABS(measures_aug_ - meanMeasures) > SDMeasures)
THEN cond1 = 1;
ELSE cond1 = 0;
IF (SDMeasures > SD_threshold_)
THEN cond2 = 1;
ELSE cond2 = 0;
IF (cond1 = 1 AND cond2 = 1)
THEN measures_aug_ = .;
RENAME measures_aug_ = outputMeasurement;
RUN;
%MEND Littman2012;
Example in R
Using a small data frame with the example given in the supplemental materials to Littman 2012:
PatientICN | Weight | WeightDate |
---|---|---|
1010528308 | 300 | 2005-12-07 |
1010528308 | 160 | 2005-12-07 |
1010528308 | 310 | 2006-06-12 |
1010528308 | 276 | 2006-08-08 |
1010528308 | 291 | 2006-08-15 |
1010528308 | 291 | 2007-09-01 |
Then applying the algorithm with the default settings
PatientICN | InputMeasurement | WeightDate | OutputMeasurement | meanMeasures | SDMeasures | SD_threshold_ | cond1 | cond2 |
---|---|---|---|---|---|---|---|---|
1010528308 | 300 | 2005-12-07 | 300 | 271.33 | 55.69 | 27.13 | FALSE | TRUE |
1010528308 | 160 | 2005-12-07 | NA | 271.33 | 55.69 | 27.13 | TRUE | TRUE |
1010528308 | 310 | 2006-06-12 | 310 | 271.33 | 55.69 | 27.13 | FALSE | TRUE |
1010528308 | 276 | 2006-08-08 | 276 | 271.33 | 55.69 | 27.13 | FALSE | TRUE |
1010528308 | 291 | 2006-08-15 | 291 | 271.33 | 55.69 | 27.13 | FALSE | TRUE |
1010528308 | 291 | 2007-09-01 | 291 | 271.33 | 55.69 | 27.13 | FALSE | TRUE |
Next example will run on the sample
Littman2012.df <- Littman2012.f(DF,
id = "PatientICN",
measures = "Weight",
tmeasures = "WeightDateTime")
Displaying a Vignette of 16 selected patients with at least 1 weight observation removed.
Distribution of raw weight data versus algorithm processed data
Descriptive statistics by group
group: Input
vars n mean sd median trimmed mad min max range skew
X1 1 1175995 207.82 48.6 202.3 204.62 44.18 0 1486.2 1486.2 0.98
kurtosis se
X1 5.6 0.04
------------------------------------------------------------
group: Output
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 1161661 207.97 48.14 202.6 204.77 44.03 75.2 546 470.8 0.81 1.41
se
X1 0.04
Left boxplot is raw data from the sample of 2016, PCP visit subjects while the right boxplot describes the output from running Littman2012.f()