Maciejewski 2016

Excerpt from published methods

Among the [cases], approximately 3.5% of weights were measured on the same day. If the standard deviation of the same-day weights was less than or equal to 2 lb, then the mean was taken. Otherwise, the standard deviation of each same-day weight with prior/post weight measurements was calculated and the same-day weight leading to the smallest standard deviation was retained. After sorting weight measures by date for each individual, rolling standard deviations were calculated using consecutive groups of three weight measures for each individual. The first group consisted of weight measures 1-3, the second 2-4, and so forth. The first and last groups were evaluated separately because the first (and last) weight measure could only be included in one group and the second (and next to last) could only be included in two groups. If the first (or last) two groups’ standard deviations were greater than 35 lb, the second (or next to last) weight measure was deleted. If the first (or last) standard deviation was greater than 35 lb, then paired standard deviations were calculated for each pair within the first (or last) group. If two of the paired standard deviations were greater than 45 lb and the remaining was less than 10 lb, the offending weight measure was deleted. After these deletions, the weight measures were reassembled in date order for each individual and rolling standard deviations were recalculated and assigned to the central weight measure of the group. Clusters of high standard deviations, indicating a potential outlier, were identified by flagging consecutive standard deviations greater than 10 lb. For each cluster of high standard deviations of three or more the interior weight measures were deleted keeping only the first and last measures of the cluster. Approximately 1.2% of weight measurements were identified as outliers and were deleted. Standard deviation cutoffs were determined via iterative trial and error driven by clinical plausibility of the specific measure rather than a standard rule and with guidance from clinical practitioners familiar with context of surgery and expected outcomes. Before the same-day and outlier cleaning, the cohort … had 89,757 measurements; after these cleaning steps, 85,556 (95.3%) weight measurements remained.

There is an extra step not mentioned in the methods but detailed elsewhere. All weights must first be between 50 and 700 lbs. (inclusive), then different weights recorded on the same day are dealt with as described above.

Batch et al. (2018) used this same algorithm in their paper Outcome by Gender in the Veterans Health Administration Motivating Overweight/Obese Veterans Everywhere Weight Management Program.

Translation in pseudocode

DEFINE person i IN 1:I, j dates IN 1:J, weight measurements k IN 1:K
FOR (i IN 1:I)
    FOR (j IN 1:J)
        FOR (k IN 1:K)
            weight_ijk := weight for ith person, jth date and kth measurement
        END FOR
    END FOR
    #TODO: Finish this - the way this is documented in the methods is confusing
END FOR

Algorithm in R Code

#' @title Maciejewski 2016 Weight Measurment Algorithm
#' @param DF object of class data.frame, containing id and weights
#' @param id string corresponding to the name of the column of patient IDs in df
#' @param weights string corresponding to the name of the column of weights in df
Maciejewski2016.f <- function(DF, id, weights, timepoints) {
  # TODO: skip this? Why reinvent the wheel.
}

Algorithm in SAS Code

In an e-mail dated 9/17/2018 12:43 PM

…We came up with this idea of looking at rolling standard deviations when needing to clean weight data from bariatric surgery patients that had huge amounts of variability over time. For the macros, you need to run the code that defines both the %dolag and the %prepost macros but you will only ever need to explicitly use the %prepost macro. (The %prepost macro calls the %dolag macro.) [Redacted] modified the weight cleaning macros from their original versions to accept a few parameters instead of relying on the input data set having variables with certain names. And, note that the fixed numerical values were what we have found to work well in our bariatric surgery cohorts, they may need to be refined for other types of cohorts. In the LMHO study ([Redacted]’s current bariatric surgery study), [Redacted] executed the weight cleaning algorithm macros by running the following SAS code (after first defining the macros in my SAS session): %prepost(DSN = matches, IDVar = PERSON_ID, DateVar = WeightRecordedDate, AnalysisVar = Weight); The code to plot the weight trajectories was run in the same SAS session as (but after the successful execution of) the weight cleaning macros. This is because the DATA Steps that create the plotting data set rely on some data sets that the weight cleaning macros create internally…

/********************** pre-processing ***************************************
two steps before applying the below algorithm:
    1. weights must be between 50 and 700 lbs.
    2. Same day weights must be processed according to the published methods:
       "...If the standard deviation of the same-day weights was less than or 
        equal to 2 lb, then the mean was taken. Otherwise, the standard 
        deviation of each same-day weight with prior/post weight measurements
        was calculated and the same-day weight leading to the smallest standard 
        deviation was retained."
*****************************************************************************/

/***************************** SameDayMeasures() *****************************
* `&DSN.` is just the name of the data set that contains the data to be
  cleaned, in "long" format (i.e., one record per date per patient).
       
* `&IDVar.` is the name of the variable that uniquely identifies each
  patient in the data set (e.g., ScrSSN, PatientICN, PERSON_ID, StudyID).
     
* `&DateVar.` is the name of the variable that contains the date on which
  the value of &ANALYSISVAR. was recorded, i.e., the date of the weight
  measurement record. NB: All of the following code was used with data
  sets where there was only ever one weight measurement record per
  calendar date per patient.
     
* `&AnalysisVar.` is the name of the numeric variable in &DSN. that these
  macros are supposed to "clean". In this case, it should be the name of
  the variable that contains weight measurement values.

* `&clearSpace.`  if value is "TRUE" then all intermediate datasets are
   removed from the workspace. If literally any other valid string "" that
   does not evaluate to "TRUE" is passed as the parameter, then all
   intermediate datasets are retained. Default is TRUE.
*****************************************************************************/

%MACRO SameDayMeasures(DSN = , 
                                 IDVar = , 
                                 DateVar = , 
                                 AnalysisVar = , 
                                 clearSpace = "TRUE");
   
    /* Remove outliers and missing values first */
    
    DATA df_step1;
        SET &DSN;
            IF &AnalysisVar < 50 OR &AnalysisVar > 700 THEN &AnalysisVar = .;
            IF &AnalysisVar = . THEN DELETE;
        KEEP &IDVar &DateVar &AnalysisVar;
    RUN;
    
    /* Identify same day weights */
     
    PROC SQL;
        CREATE TABLE df_step2 AS
            SELECT DISTINCT *
                FROM df_step1;

        CREATE TABLE SameDayAnalysisVar AS
            SELECT a.&IDVar,
                   a.&DateVar,
                   a.&AnalysisVar,
                   b.Cnt
                FROM df_step2 AS a
                    LEFT JOIN (
                               SELECT &IDVar, &DateVar, COUNT(*) AS Cnt
                                    FROM df_step2
                                    GROUP BY &IDVar, &DateVar
                                    HAVING COUNT(*) > 1
                               ) AS b
                        ON a.&IDVar = b.&IDVar
                           AND
                           a.&DateVar = b.&DateVar
                WHERE Cnt ne .
                ORDER BY &IDVar, &DateVar;
    QUIT;

    * enumerate AnalysisVar by IDVar and DateVar;
    DATA SameDayAnalysisVar;
        SET SameDayAnalysisVar;
            BY &IDVar &DateVar;
                IF first.&IDVar OR first.&DateVar THEN Count = 0;
                    Count + 1;
        DROP Cnt;
    RUN;

    PROC MEANS DATA = SameDayAnalysisVar NOPRINT NWAY;
        CLASS &IDVar &DateVar;
        VAR &AnalysisVar;
        OUTPUT OUT = AnalysisVarStats (DROP = _TYPE_ _FREQ_)
               MEAN = Mean_AnalysisVar 
               STD = SD_AnalysisVar;
    RUN;

    * AnalysisVar2SD: of the same-day measures, 
               those with SD < 2, we keep the mean;
    * AnalysisVarFlagged: of the same-day measures, 
               those with SD > 2, must be processed further;
    DATA AnalysisVar2SD (DROP = SD_AnalysisVar)
         AnalysisVarFlagged;
        SET AnalysisVarStats;
            IF SD_AnalysisVar <= 2 THEN OUTPUT AnalysisVar2SD;
            ELSE OUTPUT AnalysisVarFlagged;
    RUN;

    PROC SORT DATA = AnalysisVar2SD; BY &IDVar &DateVar; RUN;
    PROC SORT DATA = df_step2;       BY &IDVar &DateVar; RUN;

    DATA df_step3;
        MERGE df_step2 AnalysisVar2SD;
        BY &IDVar &DateVar;
            IF Mean_AnalysisVar ne . THEN &AnalysisVar = Mean_AnalysisVar;
        DROP Mean_AnalysisVar;
    RUN;

    PROC SQL;
        CREATE TABLE MeanFixed AS
            SELECT DISTINCT *
                FROM df_step3
                ORDER BY &IDVar, &DateVar;
    QUIT;

    /*
    Calculate the standard deviation of each same-day weight with prior/post 
    weight measurements, retain the same-day weight leading to the smallest
    standard deviation
    */

    PROC SQL;
        CREATE TABLE AnalysisVarFlagged2 AS
            SELECT a.*
                FROM MeanFixed AS a
                    RIGHT JOIN (
                                SELECT DISTINCT &IDVar
                                FROM AnalysisVarFlagged
                                ) AS b
                        ON a.&IDVar = b.&IDVar;
    QUIT;

    PROC SORT DATA = AnalysisVarFlagged2; BY &IDVar &DateVar; RUN;

    DATA AnalysisVarFlagged2;
        SET AnalysisVarFlagged2;
            BY &IDVar &DateVar;
                IF first.&IDVar OR first.&DateVar THEN Cnt = 0;
                    Cnt + 1;

    PROC TRANSPOSE DATA = AnalysisVarFlagged2
                   OUT = AnalysisVarWide (KEEP = &IDVar &DateVar &AnalysisVar:)
                   PREFIX = &AnalysisVar;
        BY &IDVar &DateVar;
        ID Cnt;
        VAR &AnalysisVar;
    RUN;

    PROC SQL noprint;
        CREATE TABLE AnalysisVarFlagged3 AS
            SELECT a.*,
                   (a.&DateVar - b.&DateVar) AS Diff
                FROM AnalysisVarWide AS a
                    LEFT JOIN AnalysisVarFlagged AS b
                        ON a.&IDVar = b.&IDVar 
                ORDER BY a.&IDVar , a.&DateVar;

        SELECT name
            INTO :list SEPARATED BY ' '
            FROM dictionary.columns
            WHERE LIBNAME = 'WORK' AND MEMNAME = 'ANALYSISVARFLAGGED3';

        CREATE TABLE backward AS
            SELECT &IDVar,
                   %SCAN(&list, 3, ' ') AS Before,
                   Diff
                FROM AnalysisVarFlagged3
                WHERE Diff < 0
                ORDER BY &IDVar, Diff DESC;

        CREATE TABLE forward AS
            SELECT &IDVar,
                   %SCAN(&list, 3, ' ') AS After,
                   Diff
                FROM AnalysisVarFlagged3
                WHERE Diff > 0
                ORDER BY &IDVar, Diff;

        CREATE TABLE MultipleAnalysisVars AS
            SELECT &IDVar,
                   &DateVar,
                   %SCAN(&list, 3, ' '),
                   %SCAN(&list, 4, ' '),
                   %SCAN(&list, 5, ' '),
                   %SCAN(&list, 6, ' ')
                FROM AnalysisVarFlagged3
                where Diff = 0;
    QUIT;

    DATA backward;
        SET backward;
            BY &IDVar;
                IF first.&IDVar;
    RUN;

    DATA forward;
        SET forward;
            BY &IDVar;
                IF first.&IDVar;
    RUN;

    PROC SQL;
        CREATE TABLE SameDayAnalysisVars AS
            SELECT a.*,
                   b.before,
                   c.after
                FROM MultipleAnalysisVars AS a
                    LEFT JOIN backward AS b
                        ON a.&IDVar = b.&IDVar
                    LEFT JOIN forward AS c
                        ON a.&IDVar = c.&IDVar;
    QUIT;

    * find min(SD) of AnalysisVar{i} vs. Before and After measures;
    DATA AnalysisVar1 (KEEP = &IDVar &DateVar %SCAN(&list, 3, ' '))
         AnalysisVar2 (KEEP = &IDVar &DateVar %SCAN(&list, 4, ' '))
         AnalysisVar3 (KEEP = &IDVar &DateVar %SCAN(&list, 5, ' '))
         AnalysisVar4 (KEEP = &IDVar &DateVar %SCAN(&list, 6, ' '))
         dump;
        SET SameDayAnalysisVars;
            SD1 = STD(of %SCAN(&list, 3, ' '), Before, After);
            SD2 = STD(of %SCAN(&list, 4, ' '), Before, After);
            IF %SCAN(&list, 5, ' ') ne . THEN
                DO;
                    SD3 = STD(of %SCAN(&list, 5, ' '), Before, After);
                END;
            IF %SCAN(&list, 6, ' ') ne . THEN
                DO;
                    SD4 = STD(of %SCAN(&list, 6, ' '), Before, After);
                END;
            MinAnalysisVar = MIN(of SD1-SD4);
        IF SD1      = MinAnalysisVar THEN OUTPUT AnalysisVar1;
        ELSE IF SD2 = MinAnalysisVar THEN OUTPUT AnalysisVar2;
        ELSE IF SD3 = MinAnalysisVar THEN OUTPUT AnalysisVar3;
        ELSE IF SD4 = MinAnalysisVar THEN OUTPUT AnalysisVar4;
        ELSE OUTPUT dump;
    RUN;

    * join back to phase 1 AnalysisVar;
    PROC SQL;
        CREATE TABLE Phase1Output AS
            SELECT DISTINCT a.&IDVar,
                   a.&Datevar,
                   CASE WHEN b.%SCAN(&list, 3, ' ') NE . THEN b.%SCAN(&list, 3, ' ')
                        WHEN c.%SCAN(&list, 4, ' ') NE . THEN c.%SCAN(&list, 4, ' ')
                        WHEN d.%SCAN(&list, 5, ' ') NE . THEN d.%SCAN(&list, 5, ' ')
                        WHEN e.%SCAN(&list, 6, ' ') NE . THEN e.%SCAN(&list, 6, ' ')
                        ELSE a.&AnalysisVar
                    END AS AnalysisVarFixed
                FROM MeanFixed AS a
                    LEFT JOIN AnalysisVar1 AS b
                        ON a.&IDVar = b.&IDVar
                           AND
                           a.&DateVar = b.&DateVar
                    LEFT JOIN AnalysisVar2 AS c
                        ON a.&IDVar = c.&IDVar
                           AND
                           a.&DateVar = c.&DateVar
                    LEFT JOIN AnalysisVar3 AS d
                        ON a.&IDVar = d.&IDVar
                           AND
                           a.&DateVar = d.&DateVar
                    LEFT JOIN AnalysisVar4 AS e
                        ON a.&IDVar = e.&IDVar
                           AND
                           a.&DateVar = e.&DateVar;

    /****** clear workspace if clearSpace = 'TRUE' ******/
                             
    DATA sets;
        SET sashelp.class;
    RUN;

    PROC SQL noprint;
        SELECT CASE WHEN &clearSpace = 'TRUE'
                        THEN 'PROC DELETE 
                                  DATA = MeanFixed
                                           df_step1
                                           df_step2
                                           df_step3
                                           AnalysisVar2SD
                                           AnalysisVarFlagged
                                           AnalysisVarFlagged2
                                           AnalysisVarFlagged3
                                           AnalysisVarStats
                                           AnalysisVarWide
                                           SameDayAnalysisVar
                                           Backward
                                           Forward
                                         MultipleAnalysisVars
                                           SameDayAnalysisVars
                                           AnalysisVar1
                                           AnalysisVar2
                                           AnalysisVar3
                                           AnalysisVar4
                                           Dump; 
                               RUN;'
                    ELSE ' ' END
            INTO :deleteStatement
            FROM sets;
    QUIT;

    &deleteStatement;
%MEND;

/********************************** dolag() **********************************
* `&DSN.` is just the name of the data set that contains the data to be
  cleaned, in "long" format (i.e., one record per date per patient).
       
* `&IDVAR.` is the name of the variable that uniquely identifies each
  patient in the data set (e.g., ScrSSN, PatientICN, PERSON_ID, StudyID).
     
* `&DATEVAR.` is the name of the variable that contains the date on which
  the value of &ANALYSISVAR. was recorded, i.e., the date of the weight
  measurement record. NB: All of the following code was used with data
  sets where there was only ever one weight measurement record per
  calendar date per patient.
     
* `&ANALYSISVAR.` is the name of the numeric variable in &DSN. that these
  macros are supposed to "clean". In this case, it should be the name of
  the variable that contains weight measurement values.
*****************************************************************************/

%macro dolag(DSN = , IDVar = , DateVar = , AnalysisVar = );

  %LET DSN = &DSN.;
  %LET IDVar = &IDVAR.;
  %LET DateVar = &DATEVAR.;
  %LET AnalysisVar = &ANALYSISVAR.;

  * get lagged values, calculate standard deviations, flag SD > 20;
  DATA &DSN.;
    SYSECHO "DATA Step: Get lagged values, calculate SDs, flag SD > 20";
    SET &DSN.;
        BY &IDVAR. &DATEVAR.;
            before = lag2(&ANALYSISVAR.);
            center = lag(&ANALYSISVAR.);
            after = &ANALYSISVAR.;
            lagid1 = lag(&IDVAR.);
            lagid2 = lag2(&IDVAR.);
            lagdt = lag(&DATEVAR.);
            IF (lagid2 = &IDVAR.) THEN stdev = std(before, center, after);
            IF (stdev > 20) THEN highsd = 1;
    KEEP &IDVAR. &DATEVAR. before center after lag: stdev highsd;
  RUN;

  /* save last observation for each individual */
  DATA &DSN.l(rename = (a = after b = before c = center));
    SYSECHO "DATA Step: Save last observation for each &IDVAR.";
    SET &DSN.;
        BY &IDVAR. &DATEVAR.;
            IF last.&IDVAR.;
            b = center;
            c = after;
            a = .;
            stdev = .;
    KEEP &IDVAR. &DATEVAR. b c a stdev;
  RUN;

  /* reset values when person_id changes */
  DATA &DSN.;
    SYSECHO "DATA Step: Reset values for each new &IDVAR.";
    SET &DSN.;
        BY &IDVAR. &DATEVAR.;
            DROP &DATEVAR.;
            IF (lagid2 NE &IDVAR.) THEN before = .;
            IF (lagid1 NE &IDVAR.) THEN center = .;
  RUN;
  
  /* move each SD to center of measures */
  DATA &DSN.;
    SYSECHO "DATA Step: Move each SD to center of measures";
    SET &DSN.;
        IF (lagid1 = &IDVAR.) THEN &DATEVAR. = lagdt;
    DROP lag:;
    * IF (NOT missing(&DATEVAR.));
  RUN;

  DATA &DSN.;
    SYSECHO "DATA Step: Keep only non-missing &DATEVAR.";
    SET &DSN.;
        IF (&DATEVAR. NE .);
  RUN;

  PROC SORT DATA = &DSN.;
    SYSECHO "PROC SORT: By &IDVAR. &DATEVAR.";
    BY &IDVAR. &DATEVAR.;
  RUN;

  /* tack on last observation for each individual and flag first and last 
       observations for each individual */
  DATA &DSN.;
    SYSECHO "DATA Step: Add last obs, flag first and last obs per &IDVAR.";
    SET &DSN. &DSN.l;
    BY &IDVAR. &DATEVAR.;
  RUN;
%mend dolag;

/********************************* prepost() *********************************
* `&DSN.` is just the name of the data set that contains the data to be
  cleaned, in "long" format (i.e., one record per date per patient).
       
* `&IDVAR.` is the name of the variable that uniquely identifies each
  patient in the data set (e.g., ScrSSN, PatientICN, PERSON_ID, StudyID).
     
* `&DATEVAR.` is the name of the variable that contains the date on which
  the value of &ANALYSISVAR. was recorded, i.e., the date of the weight
  measurement record. NB: All of the following code was used with data
  sets where there was only ever one weight measurement record per
  calendar date per patient.
     
* `&ANALYSISVAR.` is the name of the numeric variable in &DSN. that these
  macros are supposed to "clean". In this case, it should be the name of
    the variable that contains weight measurement values.
******************************************************************************/

%macro prepost(DSN = , IDVar = , DateVar = , AnalysisVar = );

  %LET DSN = &DSN.;
  %LET IDVar = &IDVAR.;
  %LET DateVar = &DATEVAR.;
  %LET AnalysisVar = &ANALYSISVAR.;

  PROC SORT DATA = &DSN.;
    SYSECHO "PROC SORT: By &IDVAR. &DATEVAR.";
    BY &IDVAR. &DATEVAR.;
  RUN;

  %dolag(DSN = &DSN.,
         IDVar = &IDVAR.,
         DateVar = &DATEVAR.,
         AnalysisVar = &ANALYSISVAR.);

  /* number position within individuals measures from beginning */
  DATA &DSN.;
    SYSECHO "DATA Step: Determine chronological order of records";
    SET &DSN.;
    BY &IDVAR. &DATEVAR.;
        RETAIN fctr 0;
          IF (first.&IDVAR.) THEN fctr = 0;
        fctr + 1;
  RUN;

  /* number position within individuals measures from end */
  PROC SORT data = &DSN.;
    SYSECHO "PROC SORT: By &IDVAR. DESCENDING &DATEVAR.";
    BY &IDVAR. descending &DATEVAR.;
  RUN;
  
  DATA &DSN.;
    SYSECHO "DATA Step: Determine reverse chronological order of records";
    SET &DSN.;
    BY &IDVAR. descending &DATEVAR.;
        RETAIN lctr 0;
          IF (first.&IDVAR.) THEN lctr = 0;
        lctr + 1;
  RUN;

  *Re-sort with dates in ascending order;
  PROC SORT DATA = &DSN.;
    SYSECHO "PROC SORT: By &IDVAR. &DATEVAR.";
    BY &IDVAR. &DATEVAR.;
  RUN;

  /* find those with highsd in first three observations */
  PROC FREQ DATA = &DSN.;
    SYSECHO "PROC FREQ: Identify &IDVAR.s with high SD in first 3 obs";
    TABLES &IDVAR. / NOPRINT OUT = skipfirst;
    WHERE ((fctr IN (2 3 4)) AND (highsd = 1));
  RUN;

  DATA skipfirst;
    SYSECHO "DATA Step: Keep &IDVAR.s with high SD in all first 3 obs";
    SET skipfirst;
        IF count = 3;
    KEEP &IDVAR.;
  RUN;

  /* allow those with highsd in first three observations to skip first obs processing */
  DATA first&DSN.;
    SYSECHO "DATA Step: Exclude &IDVAR.s from first obs processing";
    MERGE &DSN. skipfirst (IN = ins);
    BY &IDVAR.;
      IF NOT (ins);
  RUN;

  /* select sd > 35 when in first two sd calculations */
  DATA first&DSN.;
    SYSECHO "DATA Step: Find SD > 35 in first 2 SD calculations";
    SET first&DSN.;
    BY &IDVAR. &DATEVAR.;
      IF ((fctr IN (2 3)) AND (stdev > 35));
  RUN;

  DATA first&DSN.;
    SYSECHO "DATA Step: Flag ""first 3"" obs for deletion";
    SET first&DSN.;
        BY &IDVAR. &DATEVAR.;
        KEEP &IDVAR. del;
        /* if first two sds > 35 mark second obs for deletion */
        IF not(first.&IDVAR. AND last.&IDVAR.) THEN DO;
            del = 2;
            IF (last.&IDVAR.) THEN output;
        END;
        ELSE DO;
            sd_after = std(before, center);
            sd_before = std(center, after);
            sd_center = std(before, after);
        /* 
           if only first sd > 35 check sds of each pair within group of 
           3 - flag value causing each of its sds to be greater than 45 
           when remaining sd < 10 
        */
        IF (MIN(sd_after, sd_before, sd_center) NE sd_after) THEN DO;
            IF ((sd_before < sd_center) AND 
                (sd_before < 10) AND 
                (sd_center ge 45) AND 
                (sd_after ge 45))
            THEN del = 1;
            ELSE DO;
                IF ((sd_center < 10) AND 
                    (sd_before ge 45) AND 
                    (sd_after ge 45)) 
                THEN del = 2;
            END;
        END;
        IF (del IN (1 2)) THEN OUTPUT;
        END;
  RUN;

  /* find those with highsd in last three observations */
  PROC FREQ DATA = &DSN.;
    SYSECHO "PROC FREQ: Identify &IDVAR.s with high SD in last 3 obs";
    TABLES &IDVAR. / NOPRINT OUT = skiplast;
    WHERE ((lctr IN (2 3 4)) AND (highsd = 1));
  RUN;
  
  DATA skiplast;
    SET skiplast;
        IF count = 3;
    KEEP &IDVAR.;
  RUN;

  /* allow those to skip first obs processing */
  DATA last&DSN.;
    MERGE &DSN. skiplast(in = ins);
    BY &IDVAR.;
      IF not(ins);
  RUN;
  
  DATA last&DSN.;
    SET last&DSN.;
    BY &IDVAR. &DATEVAR.;
        IF ((lctr IN (2 3)) AND (stdev > 35));
  RUN;

  DATA last&DSN.;
    SET last&DSN.;
    BY &IDVAR. &DATEVAR.;
    KEEP &IDVAR. del;
        /* if last two sds > 35 mark second from last obs for deletion */
        IF not(first.&IDVAR. AND last.&IDVAR.) THEN DO;
            del = 2;
            IF (last.&IDVAR.) THEN OUTPUT;
        END;
        ELSE DO;
            sd_after  = std(before, center);
            sd_before = std(center, after);
            sd_center = std(before, after);
        /* 
           if only last sd > 35 check sds of each pair within group of 
           3 - flag value causing each of its sds to be greater than 45 
           when remaining sd < 10 
        */
        IF (min(sd_after, sd_before, sd_center) NE sd_before) THEN DO;
            IF ((sd_after < sd_center) AND 
                (sd_after < 10) AND 
                (sd_center ge 45) AND 
                (sd_before ge 45)) 
            THEN del = 1;
            ELSE DO;
            IF ((sd_center < 10) AND 
                (sd_after ge 45) AND 
                (sd_before ge 45)) 
            THEN del = 2;
            END;
        END;
      IF (del IN (1 2)) THEN OUTPUT;
    END;
  RUN;

  /* eliminate measures flagged for deletion */
  DATA &DSN.;
    MERGE &DSN. first&DSN.(in = inf) last&DSN.(in = inl);
    BY &IDVAR.;
    KEEP &IDVAR. &DATEVAR. center;
      IF ((inf) AND (del = fctr)) THEN DELETE;
      IF ((inl) AND (del = lctr)) THEN DELETE;
    RENAME center = &ANALYSISVAR.;
  RUN;
  
  PROC SORT DATA = &DSN.;
    BY &IDVAR. &DATEVAR.;
  RUN;

  %dolag(DSN = &DSN., 
         IDVar = &IDVAR., 
         DateVar = &DATEVAR.,
         AnalysisVar = &ANALYSISVAR.);

  /* number blocks of consecutive high sd */
  DATA &DSN.;
    SET &DSN.;
    BY &IDVAR. &DATEVAR.;
    RETAIN highctr 0;
        IF (first.&IDVAR.) THEN highctr = 0;
        IF highsd = 1 THEN highctr + 1;
        ELSE highctr = 0;
  RUN;

%MEND prepost;

Example in SAS & R

/******************************* Set Library ********************************/

ods html close; ods html;

libname DCEP "X:\Damschroder-NCP PEI\6. Aim 2 Weight Algorithms\CDW Data";

/**************************** Import Sample Data ****************************/

* data stored in SQL DB;

LIBNAME samp OLEDB
INIT_STRING = "Provider = SQLNCLI11;
               Integrated Security = SSPI;
               Persist Security Info = True;
               Initial Catalog = NCP_DCEP;
               Data Source = vhacdwrb03.vha.med.va.gov"
Schema = samp;

/* examine data
%dim(Samp.heightSamples);
* nrow = 94146, ncol = 5;

%dim(Samp.weightSamples);
* nrow = 279604, ncol = 8;
*/

/******************************* Examine Data *******************************/

PROC UNIVARIATE DATA = Samp.heightSamples;
    VAR height;
RUN;

PROC UNIVARIATE DATA = Samp.weightSamples;
    VAR weight;
RUN;

PROC FREQ DATA = Samp.weightSamples;
    TABLES (Sta3n Sta6a DivisionName) * SampleYear
           (Sta3n Sta6a DivisionName) * Source
           / nocum missing;
RUN;

/***************************** Macro invocation *****************************/

/*** PCP Source ***/

DATA PCP;  SET Samp.weightSamples; WHERE Source = 'PCP';  RUN; * nrow = 225363;

%LET DSN = PCP;
%LET IDVar = PatientICN;
%LET DateVar = WeightDate;
%LET AnalysisVar = Weight;

/***** Step 1: Remove missing values and outliers, fix same day weights *****/

%SameDayMeasures(DSN = &DSN.,
                 IDVar = &IDVar.,
                 DateVar = &DateVar.,
                 AnalysisVar = &AnalysisVar.,
                 clearSpace = 'TRUE');
* nrow_before = 225363, ncol = 15;
* nrow_after  = 219958, ncol = 3;
* reduction by 5405 (2.4%);

/************ Step 2: Rolling SD by cluster and person algorithm ************/

/*
* internal function for prepost();
* ran here as test;
%dolag(DSN = MOVEweights,
       IDVar = PatientICN,
       DateVar = WeightDateTime,
       AnalysisVar = Weight);
*/

%LET DSN = Phase1Output;
%LET AnalysisVar = AnalysisVarFixed;

%prepost(DSN = &DSN.,
         IDVar = &IDVar.,
         DateVar = &DateVar.,
         AnalysisVar = &AnalysisVar.);
* Augments &DSN. and creates &DSNl;
* nrow_before = 219958, ncol = 3;
* nrow_after  = 219871, ncol = 8;
* reduction by 87 (.04%);

DATA PCPcleaned; SET Phase1Output; RUN; * nrow = 219871, ncol = 8;

/*** MOVE! Source ***/

DATA MOVE; SET Samp.weightSamples; WHERE Source = 'MOVE'; RUN; * nrow = 54241;

/***** Step 1: Remove missing values and outliers, fix same day weights *****/

%LET DSN = MOVE;
%LET AnalysisVar = Weight;

%SameDayMeasures(DSN = &DSN.,
                 IDVar = &IDVar.,
                 DateVar = &DateVar.,
                 AnalysisVar = &AnalysisVar.,
                 clearSpace = 'TRUE');
* nrow_before = 54241, ncol = 15;
* nrow_after  = 52752, ncol = 3;
* reduction by 1489 (2.7%);

/************ Step 2: Rolling SD by cluster and person algorithm ************/

%LET DSN = Phase1Output;
%LET AnalysisVar = AnalysisVarFixed;

%prepost(DSN = &DSN.,
         IDVar = &IDVar.,
         DateVar = &DateVar.,
         AnalysisVar = &AnalysisVar.);
* Augments &DSN. and creates &DSNl;
* nrow_before = 52752, ncol = 3;
* nrow_after  = 52744, ncol = 8;
* reduction by 8 (.015%);

DATA MOVEcleaned; SET Phase1Output; RUN; * nrow = 52744, ncol = 8;

/*************************** Post-Macro Mess 'Round *************************/

/* enumerate clusters with high SD */
DATA MOVEcleaned;
    SET MOVEcleaned;
    BY &IDVAR. &DATEVAR.;
    RETAIN cluster 0;
        lagctr = lag(highctr);
        IF (first.&IDVAR.) THEN cluster = 0;
        ELSE DO;
          IF ((lagctr = 0) AND (highctr = 1)) THEN cluster + 1;
        END;
RUN;
* nrow = 219871, ncol = 10;

/* count number of high sd in each cluster */
PROC MEANS DATA = MOVEcleaned NOPRINT;
    CLASS &IDVAR. cluster;
    VAR highctr;
    WAYS 2;
    OUTPUT MAX = maxctr OUT = clusters (DROP = _TYPE_ _FREQ_);
RUN;
* nrow = 2352, ncol = 3;

/* sort for merging */
PROC SORT DATA = MOVEcleaned;
    BY &IDVAR. cluster highctr;
RUN;

/* merge max number of each cluster to mark interior measures for deletion */
DATA MOVEcleaned;
    MERGE MOVEcleaned clusters (IN = inc);
    BY &IDVAR. cluster;
    IF inc THEN DO;
      IF 1 < highctr < maxctr THEN DELETE;
    END;
RUN;
* nrow = 52692, ncol = 11;

/* save permanent

* whole sample;
DATA Dcep.weightSamples; SET Samp.weightSamples; RUN;

* PCP sample;
DATA Dcep.PCPcleaned; SET PCPcleaned; RUN;

* MOVE sample;
DATA Dcep.MOVEcleaned; SET MOVEcleaned; RUN;
*/

The SAS macro (%prepost()) augments the input data by removing the “central” measurement when it is flagged for having a high standard deviation, i.e., the ‘middle’ weight in a cluster of three weights, organized by date/time + some other magical happenings I haven’t quite figured out yet…

Summary Statistics

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     1.7     3.0     4.0     5.0   259.4  387911

There were 15665 clusters with a high standard deviation identified. In total, -2311755 observations were removed.

Distribution of raw weight data versus algorithm processed data


 Descriptive statistics by group 
group: Input
   vars       n   mean    sd median trimmed   mad min     max   range skew
X1    1 2430339 205.02 48.16  199.2  201.68 43.29   0 2423.35 2423.35  1.3
   kurtosis   se
X1    14.06 0.03
------------------------------------------------------------ 
group: Output
   vars       n   mean   sd median trimmed   mad  min   max range skew kurtosis
X1    1 2311755 205.45 47.3  199.9  202.11 42.85 53.9 632.5 578.6 0.88     1.73
     se
X1 0.03

The Maciejewski algorithm processed weights removed 5.08% of the values. For the cohort described in Maciejewski et al. 2016, this algorithm reduces their sample by \(\approx 5\)%

Displaying a Vignette of 16 selected patients with at least 1 weight observation removed.

Left boxplot is raw data from the sample of 2016, PCP visit subjects while the right boxplot describes the output from running %SameDayMeasures() and %Prepost() in SAS.