5.2 Methods
Using data from the REDS-II Donor Iron Status Evaluation (RISE) study [21], we trained multiclass prediction models to predict the risk of three iron-related adverse outcomes at a subsequent donation attempt based on the time until the donor returned. The adverse outcomes were hemoglobin deferral and completing a donation with either low or absent iron stores. We used the top-performing prediction model to generate individual risk profiles for each donor’s likelihood of iron-related adverse donation outcomes at their next visit as a function of how long until they return, and we developed a simple algorithm that used this risk profile and pre-specified risk thresholds to assign a tailored minimum IDI. We estimated the performance of tailored IDIs as compared to alternative interventions such as longer uniform IDIs and daily supplemental iron using simulation, and we developed and assessed simpler alternatives to fully tailored IDIs that achieved some of the benefits while being less complex and potentially easier to implement.
5.2.1 Data preprocessing and formatting
The RISE dataset contains data on all visits to a blood center for 2,425 donors over a 2-year period. Data elements include past donation history, biometrics for each visit, and questionnaire responses regarding demographics, diet, supplemental iron consumption, female reproductive health, and demographics from a baseline and final visit. We used 42 features from an index donation together with the time until the donor returns to predict the outcome of a follow-up donation attempt. We assumed that donor characteristics measured at the baseline visit such as diet, vitamin use, smoking, and female reproductive health would not change significantly over the study period, and we used them to predict outcomes following subsequent donations by the same donor. We recoded or imputed missing values for some fields; Supplemental Table 10.1 contains these details for all features used for prediction. We also included a composite dietary iron consumption score that was generated for each donor in the RISE dataset as part of another secondary analysis [143]. Two additional biomarkers, ferritin and STfR, were not measured for all follow-up visits and are not routinely collected in most blood centers. We did not include them as features for prediction in the primary model but assessed their impact on predictive performance in a secondary analysis.
To generate the model development dataset, we considered donations with at least 150 mL of red blood cell loss as potential index donations, which included whole blood donations, apheresis red blood cell donations, and some donations that were classified as ‘quantity not sufficient’. For index donations we excluded double red blood cell donations and donations missing both fingerstick hemoglobin and hematocrit. If follow-up visits were recorded after potential index donations, we generated labels with the time until the follow-up visit (in days) and its outcome. For all index donations followed by a visit with significant iron loss, defined as a loss of at least 55 mL of red blood cells, we generated a label for the index donation based on the first such follow-up visit. We also generated labels for each index donation based on any follow-up visits that did not result in significant iron loss (i.e., visits resulting in a deferral or apheresis donations of platelets or plasma with <55 mL of red blood cell loss) if they occurred before any follow-up visits with significant iron loss. For each index donation \(i\), the outcome of its follow-up visits (\(z_i\)) was classified as hemoglobin deferral (\(z_i=1\)) if one were recorded; as a low iron donation (\(z_i=2\)) if pre-donation ferritin was \(\geq12\) mg/dl and \(<20\) mg/dl for women or \(<30\) mg/dl for men; as an absent iron donation (\(z_i=3\)) if pre-donation ferritin was <12 mg/dl; and as a ‘no adverse outcome’ donation otherwise (\(z_i=0\)). We excluded follow-up donations without ferritin measurements from the model development dataset.
We generated a more representative dataset for policy simulation. For this dataset, we labeled index donations based on the first follow-up visit that resulted in either a hemoglobin deferral or a collection of at least 55 mL of red blood cells, ignoring follow-up visits that resulted in platelet or plasmapheresis donations or deferrals unrelated to hemoglobin. When the follow-up visit resulted in a donation for which ferritin was not measured, we labeled the outcome as \(z_i=-1\).
5.2.2 Prediction model development, assessment, and calibration
We evaluated several candidate model types: gradient boosted machines, random forest, regression trees, and generalized linear models with elastic net regularization (with and without all second order interaction terms). For each model type we evaluated multiple hyperparameter settings as described in Supplemental Table 10.2. We implemented a nested cross validation procedure with resampling to minimize bias in model selection and assessment [144]. In this process, we generated 15 model assessment partitions which consisted of 3 resamples of 5 equal-sized partitions of the entire dataset that were generated with stratified sampling to ensure the distribution of outcomes was balanced across partitions. For each model assessment partition, we defined all data not included in the partition as the corresponding model tuning set. Within the 15 tuning sets, we assessed all candidate model configurations (model type and hyperparameter setting) using 5-fold validation. We selected the top-performing model configuration that produced the highest average pairwise AUC [145] across 5 cross validation folds averaged over all 15 tuning sets (assessing a total of 75 realizations of each candidate model configuration). For an unbiased assessment of the selected model configuration, we developed 15 additional realizations of the top model configuration on the entirety of each tuning set and generated predictions on each corresponding held-out model assessment partition, measuring the average one-vs-all AUC and misclassification rate. Separately, we generated 30 estimates of variable importance for the top-performing model configuration using 3 resamples of 10-fold cross validation across the entire dataset and characterized the mean and variation for each parameter.
In a secondary analysis, we assessed the impact of measuring ferritin and STfR on ability to predict iron-related adverse outcomes at follow-up donation attempts. For this, we used the subset of the model development dataset for which ferritin and STfR were measured at the index visit and performed the above model selection and assessment procedure twice: once with ferritin, STfR, and derived measures included as features for prediction and once without.
To generate the final model, we retrained the selected model configuration on the entire model development dataset and then calibrated the predicted probabilities to the first return dataset. For this, we estimated the distribution of outcomes for follow-up visits from the first return dataset by assuming that the distribution of absent, low, and ‘no-adverse outcome’ donations in follow-up donations at which ferritin was not measured would be the same as for those with ferritin measured. Mathematically, we first totaled each follow-up outcome as \(n^{(k)}\), where \(k=-1, 0, 1, 2, 3\) correspond to the outcomes described above. We then calculated \(\tilde{n}^{(k)}\), an estimation of what the totals would have been if ferritin were measured for all follow-up donations, as \(\tilde{n}^{(1)} = n^{(1)}\) (hemoglobin deferral) and \(\tilde{n}^{(l)} = n^{(l)}+n^{(-1)}\frac{n^{(l)}}{n^{(0)}+n^{(2)}+n^{(3)}}\) for \(l=0,2,3\) (completed donations). We then used our top model to generate the unnormalized probability vector \([\hat{q}_i^{(0)}, \hat{q}_i^{(1)}, \hat{q}_i^{(2)}, \hat{q}_i^{(3)}]\) for each index donation \(i\). We computed weights \(w^{(k)}\) for the unnormalized probability of each outcome \(\hat{q}_i^{(k)}\) by solving the system of equations \(\sum_{i=1}^I w^{(k)}\hat{q}_i^{(k)}/\sum_{\tilde{k}=0}^4 w^{(\tilde{k})}\hat{q}_i^{(\tilde{k})} = \tilde{n}^{(k)}\) for \(k=0,1,2,3\). The final calibrated model used parameters \(a^{(k)}\), \(b^{(k)}\), and \(w^{(k)}\) together with the uncalibrated scores from the model \(z_i^{(k)}\) to produce the estimated likelihood of each outcome at a follow-up donation as \(\tilde{q}^{(k)}=w^{(k)}\hat{q}^{(k)}/\sum_{\tilde{k}=1}^4 w^{(\tilde{k})}\hat{q}_j^{(\tilde{k})}\) where \(\hat{q}_i^{(k)} = \sigma (a^{(k)} z_i^{(k)} + b^{(k)})\). This ensured that the expectation of the distribution of the predicted outcome for the first return dataset would correspond to our estimated totals \(\tilde{n}^{(k)}\).
5.2.3 Risk profile development and analysis
For each index donation in the first return dataset, we generated a risk profile by predicting the likelihood of each outcome at the donor’s next donation attempt using the calibrated model while varying the time until their follow-up visit from 56 to 250 days. We generated graphical representations of how each donor’s estimated risk developed from 56 to 250 days post-donation. To analyze the risk trajectories, we calculated two metrics for each donor, the risk of an adverse outcome on day 56 and the change in risk from day 56 to day 256, and characterized the distribution of these two metrics for donations in the first return dataset.
5.2.4 Personalized IDI decision rule
We developed a simple decision algorithm to prescribe a tailored IDI following each index donation \(i\) based on pre-specified risk thresholds \(h_k\) for adverse outcomes \(k=1,2,3\). The algorithm identifies the minimum interval \(t_i^* \geq 56\) such that the estimated risk for each adverse outcome \(k\) is below its corresponding threshold \(h_k\) on post-donation day \(t_i^*\) and on days \(t_i>t_i^*\) up to post-donation day 250 (i.e., \(t_i^* = \min (t_i)\) such that \(\tilde{q}_{ti}^{(k)} \leq h^{(k)}\) for all \(t_i^* \leq t_i \leq 250\).
5.2.5 Policy simulation
We used the first return dataset to estimate the impact of tailored IDIs with various combinations of risk thresholds. We varied each threshold \(h_k\) for \(k=1,2,3\) from 0.1 to 0.9 in increments of 0.1 and evaluated all possible combinations except for those that assigned a higher risk threshold for absent iron donations as compared to low iron donations. We assumed donors whose tailored IDI exceeded their actual return time would exhibit the same delay from when they were first eligible to return to when they actually return as they did in the dataset (i.e., we subtracted 56 from their actual return interval and added this duration to their tailored minimum IDI). If a donor’s actual return interval exceeded their prescribed minimum IDI then we assumed there would be no change to their return behavior. We computed the expected rate of each outcome \(k\) for each index donation \(i\) as \(\gamma^{(k)}_i = \tilde{q}_{t^r_i}^{(k)} /t^r_i\), where \(t^r_i\) is the estimated return time. We then estimated the rate of each outcome per 100 donor-years across the entire cohort as \(36525(\sum_{i=1}^I \gamma^{(k)}_i)/I\) where the coefficient 36525 was derived by multiplying the number of days in a year by 100. We also simulated three alternative non-tailored minimum IDI policies based on current policies in Canada (56 days for men and 84 days from women), Australia (84 days for all donors), and the United Kingdom (84 days for women, 122 days for men).