Chapter 2 Operations and Data Type Abstraction

2.1 Data Operations

The datasets we used were from a variety of sources and the first focus of our operations were to develop the datasets into data that can be aggregated at the state and national level. The Medicare Part D data contained summary data at the national and state level. Other datasets were provided to us on a year by year basis, so we assembled the data into a single dataset and wrote functions to calculate and display results at the national level. We further filtered the data to only include relevant data to our subject matter of opioid prescriptions. We created functions that were called from operations in the user interface of the application and used the R library dplyr to filter, select, group, summarise and arrange our dataset to return information only related to the options selected by user in the application such as state and variable name.

2.2 Data Type Abstraction

The data we acquired and derived from our data operations was crafted to be displayed in the form of a map visualization to give a representation at the national level, but also allow users to explore data at the state level by seeing how different variables affect the state of interest to the researcher. Given that our research involves comparing states at the national level we weighted the results by dividing by population at the year select and multiplied the result for each state by 100,000. The representation of the values shown is indicated in the application to make users aware of the scale of the values displayed.

We used a variety of sources to get information about opioids. The original data that we looked at for the project was related to prescribers and was Medicare Part D Prescriber data for 2016 from the Centers for Medicare and Medicaid Services (CMS). The raw data was a very large file because it listed records for every practice and drug prescribed. We decided to use summary tables provided by CMS at the state and national level, because they were aggregated amounts. I cleaned the data by changing the field names to match what was in the CMS data dictionary, and only including record where the flags classifying it as an Opioid or Long-Acting Opioid were set to ‘Y’. This significantly improved the performance loading the data because majority of the records were non opioid records.

As a tool designed for researchers we wanted to provide different characteristics related to opioid death to get a quantitative view of what groups are affected by opioid use the most. Based on the data we found we chose to use race (Opioid Overdose Deaths by Race/Ethnicity 2017), age groups (Opioid Overdose Deaths by Age Group 2018) and the type of opioids (Opioid Overdose Deaths by Type of Opioid 2018). Some datasets did not have sufficient data for a particular state or time period, so we created an “unknown” variable when the totals were known and subtracted the total from the sum of the other variables to determine what was unknown. We added it into our visualization to make the user aware of the results that cannot be confirmed to help them make the correct interpretation of the data.

We used shapefiles for the spatial data consisting data at the state level and also data at the county level for our analysis on prescriptions rates. The prescription rate data was provided by the Centers for Disease Control and Prevention Opioid Prescribing Rate Maps (U.S. Opioid Prescribing Rate Maps 2017). Data was not combined into one dataset for all the years, so we manually put together the dataset. We found data at the county level that studied data from 2010 and 2015 and determined if the prescription rate was increasing, decreasing or stable (Opioid Prescriptions 2010 + 2015 2017).

References