Chapter 2 Operations and Data Type Abstraction

2.1 Data Operations

The datasets we used were from a variety of sources, and the first focus of our operations was to develop the datasets into data that can be aggregated at the state and national level. The Medicare Part D data contained summary data at the national and state level. Other datasets were provided to us on a year-by-year basis, so we assembled the data into a single dataset and wrote functions to calculate and display results at the national level. We further filtered the data to only include relevant data to our subject matter of opioid prescriptions. We created functions that were called from operations in the user interface of the application and used the R library dplyr to filter, select, group, summarise and arrange our dataset to return information only related to the options selected by user in the application such as state and variable name.

2.2 Data Type Abstraction

The data we acquired and derived from our data operations was crafted to be displayed in the form of a map visualization to give a representation at the national level, but also allow users to explore data at the state level by seeing how different variables affect the state of interest to the researcher. Given that our research involves comparing states at the national level we weighted the results by dividing the value by the population of the year selected and multiplied the result for each state by 100,000. The representation of the values shown is indicated in the application to make users aware of the scale that the values are displayed.

We used a variety of sources to get information about opioids. The Medicare Part D Prescriber data was a very large file because it listed records for every practice and drug prescribed for the entire country. To improve performance we decided to use summary tables provided by CMS at the state and national level. We cleaned the data by changing the field names to match what was in the CMS data dictionary, and only included records where the flags classifed the drug either as an Opioid or Long-Acting Opioid. This significantly improved the load time of the data in R, because majority of the records were non-opioid records.

As a tool designed for researchers, we wanted to provide different characteristics related to opioid death to get a quantitative view of what groups are affected the most by opioid use. Based on the available data, we chose to use race (Opioid Overdose Deaths by Race/Ethnicity 2017), age groups (Opioid Overdose Deaths by Age Group 2018) and the type of opioids (Opioid Overdose Deaths by Type of Opioid 2018). The race data included variables classified as White Non-Hispanic, Black Non-Hispanic and Hispanic. The age data divided ages into the following groups, 0-24, 25-34, 35-44, 45-54 and 55+. The deaths by opioid data was divided into four groups, Natural and Semisynthetic (e.g. Oxycodone, Hydrocodone), Synthetic opioids (e.g. Fentanyl, Tramadol), Methodone and Heroin. Some datasets did not have sufficient data for a particular state or time period, so we created an “unknown” variable when the totals were known and subtracted the total from the sum of the other fields to determine what was unknown. We added it into our visualization to make the user aware of the results that cannot be confirmed to help them make the correct interpretation of the data.

We used shapefiles for the spatial data consisting of data at the state level and also data at the county level for our analysis on prescription rates. The prescription rate data was provided by the Centers for Disease Control (U.S. Opioid Prescribing Rate Maps 2017). The data was not combined into one dataset for all the years, so we manually put together the dataset. We also found data at the county level that studied data from 2010 and 2015 and determined if the prescription rates were increasing, decreasing or stable (Opioid Prescriptions 2010 + 2015 2017).

References