## Buta 2018

Probably the most simplistic and probably silliest, method

### Excerpt from published methods

For each participant, we used VA electronic health record (EHR) data to obtain all BMI measurements available from the date of their first medical visit through September 30, 2010. Because we were interested in BMI change over time, we excluded participants who had no or only one BMI measurement available (260,193 participants left) … BMI was computed based on height and weight data routinely collected and recorded in EMR records during VA clinical visits. We removed from analyses a small percentage (0.03%) of biologially implausible BMI values (BMI<11 or BMI>70).

The fundamental aspect of this algorithm is the removal of veterans from the cohort if they only have $$\leq 1$$ measurement available.

Arterburn et al. 2013 use the same algorithm, but only for a single time point.

### Translation in Pseudocode

DEFINE time t_ij for person i IN 1:I, j BMI IN 1:J
FOR i IN 1:I
FOR j IN 1:J
BMI_ij := BMI @ t_ij
END FOR
Let k = Number of Non-missing j BMI for person i
IF (k <= 1)
EXCLUDE person i
END FOR

### Algorithm in R Code

#' @title Buta et al. 2018 Measurment Cleaning Algorithm
#' @param DF object of class data.frame, containing id and measures
#' @param id string corresponding to the name of the column of patient identifiers in DF
#' @param measures string corresponding to the name of the column of measurements in DF
#' @param tmeasures string corresponding to the name of the column of measurement collection dates or times in DF. If tmeasures is a date object, there may be more than one weight on the same day, if it precise datetime object, there may not be more than one weight on the same day
#' @param outliers numeric vector corresponding to the upper and lower bound of measure for each time entry. Default is c(11, 70) for BMI measurements according to "Buta et al. 2018".
Buta2018.f <- function(DF,
id,
measures,
tmeasures,
outliers = c(11, 70)) {

if (!require(data.table)) install.packages("data.table")
if (!require(dplyr))      install.packages("dplyr")

tryCatch(
if (!is.numeric(DF[[measures]])) {
stop(
print("measure data must be a numeric vector")
)
}
)

tryCatch(
if (!is.numeric(outliers)) {
stop(
print("outliers must be numeric")
)
}
)

# remove NA measures
DF <- DF[!is.na(DF[[measures]]), ]

# Remove Outlier Measures
DF <- DF[DF[[measures]] >= outliers[1] & DF[[measures]] <= outliers[2], ]

# convert to data.table
DT <- data.table::as.data.table(DF)
data.table::setkeyv(DT, id)

# Set Order
data.table::setorderv(DT, c(id, tmeasures), c(1, 1))

# find max count for person i, if k = 1, remove person
DT[, k := .(.N), by = .(PatientICN)]
DT <- DT[k > 1, -c("k")]
DT
}

### Algorithm in SAS Code

#TODO

### Example in R

Need BMI for this algorithm and thus need to add height values; The concensus is to use the mode of height for each person to calculate BMI.

# Attach Height and compute BMI
DF <- DF %>%
left_join(
ModeOfHeight %>%
select(-freq),
by = "PatientICN"
) %>%
mutate(BMI = 703 * Weight / (Height ^ 2))
Buta2018.df <- Buta2018.f(DF,
id = "PatientICN",
measures = "BMI",
tmeasures = "WeightDateTime")

Distribution of Weight Measurements between Raw and Algorithm-Processed Values


Descriptive statistics by group
group: Input
vars       n   mean   sd median trimmed   mad min    max  range skew
X1    1 1175995 207.82 48.6  202.3  204.62 44.18   0 1486.2 1486.2 0.98
kurtosis   se
X1      5.6 0.04
------------------------------------------------------------
group: Output
vars       n   mean    sd median trimmed   mad min max range skew kurtosis
X1    1 1131996 207.91 48.29  202.6  204.73 44.18  60 540   480  0.8     1.38
se
X1 0.05

We won’t show a vignette for Buta et al. 2018 as it just excludes people and keeps only those with a certain number of measurements.

Left boxplot is raw data from 2016, PCP visit subjects while the right boxplot describes the output from running Buta2018.f()