## 9.6 Impacts of Data Processing on Modeling

The amount of preprocessing could be considered a tuning parameter. If so, then the goal would be to select the best model and the appropriate amount of signal processing. The strategy here is to use the small-scale reactor data, with their corresponding resampled performance estimates, to choose the best combination of analytical methods. Once we come up with one or two candidate combinations of preprocessing and modeling, the models built with the small-scale reactor spectra are used to predict the analogous data from the large-scale reactors. In other words, the small bioreactors are the training data while the large bioreactors are the test data. Hopefully, a model built on one data set will be applicable to the data from the production-sized reactors.

The training data contain 15 small bioreactors each with 14 daily measurements. While the number of experimental units is small, there still are several reasonable options for cross-validation. The first option one could consider in this setting would be leave-one-bioreactor-out cross-validation. This approach would place the data from 14 small bioreactors in the analysis set and use the data from one small bioreactor for the assessment set. Another approach would be to consider grouped *V*-fold cross-validation but use the bioreactor as the experimental unit. A natural choice for *V* in this setting would be 5; each fold would place 12 bioreactors in the analysis set and 3 in the assessment set. As an example, the table below shows an example of such an allocation for these data:

Resample | Heldout Bioreactor |
---|---|

1 | 5, 9, and 13 |

2 | 4, 6, and 11 |

3 | 3, 7, and 15 |

4 | 1, 8, and 10 |

5 | 2, 12, and 14 |

In the case of leave-one-bioreactor-out and grouped *V*-fold cross-validation, each sample is predicted exactly once. When there are a small number of experimental units, the impact of one unit on model tuning and cross-validation performance increases. Just one unusual unit has the potential to alter the optimal tuning parameter selection as well as the estimate of model predictive performance. Averaging performance across more cross-validation replicates, or repeated cross-validation, is an effective way to dampen the effect of an unusual unit. But how many repeats are necessary? Generally 5 repeats of grouped *V*-fold cross-validation are sufficient but the number can depend on the computational burden that the problem presents as well as the sample size. For smaller data sets, increasing the number of repeats will produce more accurate performance metrics. Alternatively, for larger data sets, the number of repeats may need to be less than 5 to be computationally feasible. For these data, 5 repeats of grouped 5-fold cross-validation were be performed.

When profile data are in their raw form, the choice of modeling techniques are limited. Modeling techniques that incorporate simultaneous dimension reduction and prediction, like principal component regression and partial least squares are often chosen. Partial least squares, in particular, is a very popular modeling technique for this type of data. This is due to the fact that it condenses the predictor information into a smaller region of the predictor space that is optimally related to the response. However, PLS and PCR are only effective when the relationship between the predictors and the response follows a straight line or plane. These methods are not optimal when the underlying relationship between predictors and response is non-linear.

Neural networks and support vector machines cannot directly handle profile data. But their ability to uncover non-linear relationships between predictors and a response make them very desirable modeling techniques. The preprocessing techniques presented earlier in this chapter will enable these techniques to be applied to profile data.

Tree-based methods can also tolerate the highly correlated nature of profile data. The primary drawback of using these techniques is that the variable importance calculations may be misled due to the high correlations among predictors. Also, if the trends in the data are truly linear, these models will have to work harder to approximate linear patterns.

In this section, linear, non-linear, and tree-based models will be trained. Specifically, we will explore the performance of PLS, Cubist, radial basis function SVMs, and feed-forward neural networks. Each of these models will be trained on the small-scale analysis set. Then each model will be trained on the sequence of preprocessing of baseline-correction, standardization, smoothing, and first-order derivatives. Model-specific preprocessing steps will also be applied within each sequence of the profile preprocessing. For example, centering and scaling each predictor is beneficial for PLS. While the profile preprocessing steps significantly reduce correlation among predictors, some high correlation still remains. Therefore, highly correlated predictor removal will be an additional step for the SVM and neural network models.

The results from repeated cross-validation across models and profile preprocessing steps are presented in Figure 9.11. This figure highlights several important results. First, overall average model performance of the assessment sets dramatically improves for SVM and neural network models with profile preprocessing. In the case of neural networks, the cross-validated RMSE for the raw data was 5.34. After profile preprocessing, cross-validated RMSE dropped to 3.41. Profile preprocessing provides some overall improvement in predictive ability for PLS and Cubist models. But what is more noticeable for these models is the reduction in variation of RMSE. The standard deviation of the performance in the assessment sets for PLS with no profile preprocessing was 3.08 and was reduced to 2.6 after profile preprocessing.

Based on these results, PLS using derivative features appears to be the best combination. Support vector machines with derivatives also appears promising, although the Cubist models showed good results prior to computing the derivatives. To go forward to the large-scale test set, the derivative-based PLS model and Cubist models with smoothing will be evaluated. Figure 9.12 shows the PLS results in more detail where the resampled RMSE estimates are shown against the number of retained components for each of the preprocessing methods. Performance is equivalent across many of the preprocessing methods but the use of derivatives clearly had a positive effect on the results. Not only were the RMSE values smaller but only a small number of components were required to optimize performance. Again, this is most likely driven by the reduction in correlation between the wavelength features after differencing was used.

Taking the two candidate models to the large-scale bioreactor test set, Figure 9.13 shows a plot of the observed and predicted values, colored by day, for each preprocessing method. Clearly, up until standardization, the models appreciably under-predicted the glucose values (especially on the initial days). Even when the model fits are best, the initial few days of the reaction appear to have the most noise in prediction. Numerically, the Cubist model had slightly smaller RMSE values (2.07 for Cubist and 2.14 for PLS). However, given the simplicity of the PLS model, this approach may be preferred.

The modeling illustration was based on the knowledge that the experimental unit was the bioreactor rather than the individual day within the bioreactor. As discussed earlier, it is important to have a firm understanding of the unit for utilizing an appropriate cross-validation scheme. For these data, we have seen that the daily measurements within a bioreactor are more highly correlated than measurements between bioreactors. Cross-validation based on daily measurements would then likely lead to better hold-out performance. But this performance would be *misleading* since, in this setting, new data would be based on entirely new bioreactors. Additionally, repeated cross-validation was performed on the small-scale data, where the days were inappropriately considered as the experimental unit. The hold-out performance comparison is provided in Figure 9.14. Across all models and all profile preprocessing steps, the hold-out RMSE values are artificially lower when day is used as the experimental unit as compared to when bioreactor is used as the unit. Ignoring or being unaware that bioreactor was the experimental unit would likely lead one to be overly optimistic about the predictive performance of a model.