9.4 Reducing Other Noise

The next step in processing profile data is to reduce extraneous noise. Noise appears in two forms for these data. First, the amplitudes vary greatly from spectra to spectra across bioreactors which is likely due to measurement system variation rather than variation due to the types and amounts of molecules within a sample. What is more indicative of the sample contents are the relative peak amplitudes across spectra. To place the spectra on similar scales, the baseline adjusted intensity values are standardized within each spectra such that the overall mean of a spectra is zero and the standard deviation is 1. The spectroscopy literature calls this transformation the standard normal variate (SNV). This approach ensures that the sample contents are directly comparable across samples. One must be careful, however, when using the mean and standard deviation to standardize data since each of these summary statistics can be affected by one (or a few) extreme, influential value. To prevent a small minority of points from affecting the mean and standard deviation, these statistics can be computed by excluding the most extreme values. This approach is called trimming, and provides more robust estimates of the center and spread of the data that are typical of the vast majority of the data.

Figure 9.6 compares the profiles of the spectrum with the lowest variation and highest variation for (a) the baseline corrected data across all days and small-scale bioreactors. This figure demonstrates that the amplitudes of the profiles can vary greatly. After standardizing (b), the profiles become more directly comparable which will allow more subtle changes in the spectra due to sample content to be identified as signal which is related to the response.

(a) The baseline-corrected intensities for the spectra that are the most- and least variable.  (b) The spectra after standardizing each to have a mean of 0 and standard deviation of 1.

Figure 9.6: (a) The baseline-corrected intensities for the spectra that are the most- and least variable. (b) The spectra after standardizing each to have a mean of 0 and standard deviation of 1.

A second source of noise is apparent in the intensity measurements for each wavelength within a spectrum. This can be visually seen in the jagged nature of the profile illustrated in either the “Original” or “Corrected” panels of Figure 9.5. Reducing this type of noise can be accomplished through several different approaches such as smoothing splines and moving averages. To compute the moving average of size \(k\) at a point \(p\), the \(k\)-1 previous values as well as the current value are averaged together which then replace the current value. The more points are considered for computing the moving average, the smoother the curve becomes.

To more distinctly see the impact of applying a moving average, a smaller region of the wavelengths will be examined. Figure 9.7 focuses on wavelengths 950 through 1200 for the first day of the first small-scale bioreactor and displays the standardized spectra, as well as the moving averages of 5, 10, and 50. The standardized spectra is quite jagged in this region. As the number of wavelengths increases in the moving average, the profile becomes smoother. But what is the best number of wavelengths to choose for the calculation? The trick is to find a value that represents the general structure of the peaks and valleys of the profile without removing them. Here, a moving average of 5 removes most of the jagged nature of the standardized profiles but still traces the original profile closely. On the other hand, a value of 50 no longer represents the major peaks and valleys of the original profile. A value of 15 for these data seems to be a reasonable number to retain the overall structure.

It is important to point out that the appropriate number of points to consider for a moving average calculation will be different for each type of data and application. Visual inspection of the impact of different number of points in the calculation like that displayed Figure 9.7 is a good way to identify an appropriate number for the problem of interest.

The moving average of lengths 5, 15, and 50 applied to the first day of the first small-scale bioreactor for wavelengths 950 through 1200.

Figure 9.7: The moving average of lengths 5, 15, and 50 applied to the first day of the first small-scale bioreactor for wavelengths 950 through 1200.