In a continuation of these series of quick reports that aim to bring awareness to those who seek to make use of either the New York Times or Johns Hopkins GitHub repositories I am going to present a more challenging set of data artifacts that you will come across and have to deal with. Again, because the intent here is to simply bring awareness and not provide a tutorial about modeling it will not cover how artifacts such as the ones covered here will generally result in a less than optimal model and importantly one that you may not even know is not particularly good if you are evaluating it against the wrong data! The first case I examined was for Hamilton County, Ohio, home county of the National Football League’s (NFL) Cincinnati Bengals which had a small number of artifacts that were tied to major U.S. holidays along with a single instance of a day where no cases were recorded followed by a data dump. In this second case, I focus on Wayne County, Michigan, home of the NFL’s Detroit Lions. This particular series has a multitude of artifacts that need to be dealt with, but admittedly dealing with these is a bit more complex to correct for than in the initial case.
Three specific types of artifacts are addressed here. The first, Sundays followed by a data dump, is one of the more pernicious artifacts because if it remains uncorrected it will embed a seasonality within the series that is illusory. This is followed by a more complex holiday artifact that includes an additional day of no recording prior to several holidays, which leads to a larger data dump than in the previous instance covered in the first of these reports. Finally, there are some instances of a day of the week artifact where no cases are recorded, and this is followed by the usual data dump. All three of these artifacts have additional conditional problems that I cover below and certainly make correcting a more challenging undertaking. In each instance, to make it clear for readers I have colored the associated figures to make visualizing the artifacts easier to identify, with the 0 data points of interest in red, the data dumps of interest green, and all other points blue.
Diving straight in to this first artifact you can see that this problem occurs repeatedly each Sunday starting on September 6, 2020 and has remained this way up until the present day. It is also fairly obvious from seeing these that each is followed by a data dump that is roughly twice the magnitude of the day before the Sunday or the following Tuesday. I do not want to speculate about why during the course of a pandemic a particular county decided to take a break from recording, but it is a problem from the perspective of the data scientist/statistician, and so it must be dealt with. Again, like in Case #1 there is no existing interpolation function in any existing R or Python package that can immediately address this specific artifact in a way that is reflective of the underlying process that generates the data dump so it requires a programmatic solution.
Here the most sensible approach is to create a conditional statement for all Sundays where there is a 0 count, then divide the recorded case count from the following Monday forward filling the Sunday with half the cases, and leaving the other half in the Monday. It is worth taking a moment to note the relative position of these former artifacts in the overall series after correction and also how this changed both the 7-day simple moving average, and the associated filter (±95CIs), which both have less volatility then prior to correction. There is an exception to this solution, can you find it, if you can, do not fret I circle back to it towards the end of this report. If you cannot find it, do not worry I will identify it for you, and explain how I correct for it before the end of the report.
The next artifact that must be addressed here is admittedly a bit more of a challenge. Why? Well if you take a moment to examine this series you will see a new artifact pattern that we have not yet come across. This pattern is one in which consecutive days, the day before a holiday, and the holiday itself have 0 recorded cases, when the counts prior to these days suggest this simply cannot be the case. Importantly, you will note that for the first time in this series of short reports we have an unequal number of red to green points. The reason for this is because we have instances where the data dump is a function of multiple days. More specifically, have a look at December 24, 25, and 26, 2020, as well as December 31, January 1, and January 2. You may notice if you examine the figure closely (they are interactive for this precise purpose) that rather than having a data dump that is approximately twice the relative volume of the previous day as has been the case up until now, we have an instance where the volume is roughly three times the volume of the previously recorded day.
Again, we have no choice here but to address this artifact programmatically, but in this instance it is more complex. A series of conditional statements related to not just holidays, but the day prior to holidays, and the days after holidays is required. We must identify all instances where holidays are preceded by another day that has 0 recorded cases and then divide the cases recorded for the data dump into thirds. Then forward fill each third into the correct position in the series leaving the remaining third for the day in which the data dump occurred. Of course, this must be done bearing in mind that other holidays, for example Thanksgiving in this series, we do not see this same pattern. This other holiday pattern must also be addressed using the same interpolation strategy of dividing the day after each holiday, the data dump day, in half forward filling and leaving the remaining half with the data dump day. The corrected series now has far less volatility around the holiday period starting around Thanksgiving and finishing after New Year’s.
The last artifact is difficult to explain or understand as it is spread across seemingly random days of the week with no real pattern. Perhaps, this is a function of the person responsible for recording for Wayne County, Michigan being sick or off from work. Regardless of why the artifact is nested within the data it must be addressed and corrected for. There are three of these instances in this series one occurring on a Wednesday, another on a Friday, and one on a Saturday. Have a look!
Again, there is no choice here other than to address these artifacts programmatically, but unfortunately one of these artifacts is particularly problematic. More specifically, the first two weekday instances occur on a Wednesday and Friday, and are straight forward, the Saturday not so much. Remember, at the outset I mentioned that there was one set of days that could not actually be addressed with the interpolation approach adopted to correct for Sundays. Well here is the exception. If you have a moment, return to the very first figure in this report and have a look at Monday, October 17th (it is green). You will likely note that it does appear to have a greater magnitude in relation to the other green points who of course are positioned relative to their own respective red point. This is because–you guessed it–this particular Monday experienced a two-day data dump. That is, both Saturday, October 16 and Sunday, October 17, have 0 recorded instances and means that the same corrective strategy that was employed correcting the two-day holiday artifact must be adopted here. In other words, the value for Monday, October 17 must be split in thirds and forward-filled for both the Saturday and Sunday leaving a third remaining for that Monday, which was the data dump day. Have a closer look at the corrected values for October 16, 17, and 18 now. As well as the other corrected days in the series. Done.
In the previous case report I closed with highlighting in a very general way what this meant with respect to modeling, cross-validation and hyperparameter tuning and provided a visualization of what the difference between the uncorrected and corrected actuals looked like. Here I do the same, providing the ground truth that all data scientists and statisticians desire for developing more accurate models to explain a process or predict what might be expected in the future. Below I have provided a visualization of both the uncorrected and corrected series, so readers can see tremendous volality in the uncorrected data when evaluated against the corrected one. In addition, and because it is of importance, I have also provided a different look at what the uncorrected versus corrected view of this data looks like by plotting the uncorrected filter in red (±95CIs) against the corrected filter (±95CIs) in blue. To keep the y-axis looking reasonable here I have simply eliminated the portion of the uncorrected filter that extends well below 0.
The variation you will find in both these final plots are compelling evidence against the use of uncorrected COVID-19 incidence and mortality data in any model you may have already developed, are currently developing, or are considering developing moving forward. Unfortunately, there are simply too many artifacts nested within these data that require attention. In the third installment of this series of reports I present some new, and even more challenging artifacts that require these conditional interpolation algorithms to be tuned further. Hopefully I find the time over the next few days to share that with you.