# Chapter 4 Methods

As highlighted in the previous chapter, across the study period of 2015-2020, there were 315,016 unique physicians receiving payments from pharmaceutical industry companies. Thus, in each study year, there were over 1 million data points, representing unique payments. The first few rows of the Open Payments dataset can be seen below:

# A tibble: 6 × 98
clean_zip Covered_Recipient_Type     Physician_Profi… Physician_First… Physician_Last_… Recipient_Prima… Recipient_Prima…
<chr>     <chr>                                 <dbl> <chr>            <chr>            <chr>            <chr>
1 00602     Covered Recipient Physici…           519418 Alberto          Ramos Mendez     137 Calle Colon  --
2 00602     Covered Recipient Physici…           479421 GUILLERMO        ALVAREZ          CALLE COLON # 6  SUITE 1
3 00602     Covered Recipient Physici…           456660 NAVID            POUR-AHMADI      416 ROAD KM 8.7  <NA>
4 00602     Covered Recipient Physici…           479421 GUILLERMO        ALVAREZ          CALLE COLON # 6  SUITE 1
5 00602     Covered Recipient Physici…           479421 GUILLERMO        ALVAREZ          CALLE COLON # 6  SUITE 1
6 00602     Covered Recipient Physici…          6247325 KEVIN            CHAPARRO         PR 4416 KM 52 B… <NA>
# … with 91 more variables: Recipient_City <chr>, Recipient_State <chr>, Recipient_Zip_Code <chr>,
#   Recipient_Country <chr>, Recipient_Province <lgl>, Recipient_Postal_Code <lgl>, Physician_Primary_Type <chr>,
#   Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name <chr>,
#   Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID <dbl>,
#   Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name <chr>, …

Merging the Open Payments data with the ZIP code data allows for direct analysis of the relationship between payments to primary care physicians and payment characteristic, geospatial, and temporal trends.

## 4.1 Payment Characteristic

As noted previously, the general payments category is subdivided into categories describing the nature of the payment. In exploring this, this study asks questions of which payment type represents the highest or lowest aggregate amount and average payment. This analysis provides context for the manner in which these transactions are taking place.

Further, each payment lists the ZIP code at which the receiving physician practices. The zip_code_db dataset contains information on the population density of each ZIP code. Population density is a better reflection of the population served by these physicians, as ZIP codes are drawn to represent relatively equal raw population values.

For each year, population density is split into 5 categories based on every 20th quantile split: extremely sparse, sparse, normal, dense, and extremely dense. Transforming population density into these categories allows for comparison of payments to areas of different density within each year, but also allows for comparisons of changes across the study time period. Here, again, this study asks which population density category the highest or lowest aggregate amount and average payment. Likewise, one-sided ANOVA tests are performed on each study year to investigate differences between the density categories.

## 4.2 Geospatial

The identification of payment data by ZIP code allows for both granular and high-level analysis of geospatial trends. This study explores what ZIP codes have the highest aggregate payment amount for each year of the study time period, and then it compares the number of payments, number of unique physicians, average payment amount, and average amount per physician. This same analysis is repeated at the state-level. Both levels of data are visualized using the sf package [citation 2].

However, the state-level analysis still misses out on potential higher-level trends between different areas of the country. Thus, states are further grouped into 4 regions of the country: Northeast, Midwest, South, and West. The breakdown of the states into these regions comes from the official US Census Bureau regions, and they are [citation 1]:

• Northeast: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont, New Jersey, Pennsylvania
• Midwest: Indiana, Illinois, Michigan, Ohio, Wisconsin, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, South Dakota
• South: Delaware, DC, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, West Virginia, Alabama, Kentucky, Mississippi, Tennessee, Arkansas, Louisiana, Oklahoma, Texas, Puerto Rico

As with the categories for population density, transforming state values into these categories allows for comparison of payments to different areas of the county within each year, but also allows for comparisons of changes across the study time period. These regions are compared on the basis of yearly aggregate payments and average payment value. One-sided ANOVA tests are performed on each year to generate p-values for evaluating significant differences between regions.

## 4.3 Temporal

The temporal analysis is twofold. Primarily, each of the above trends is observed for each year of the study period and compared to the others. In this way, this study analyzes how payment characteristics and geospatial patterns change from 2015 to 2020.

Secondarily, further analysis is performed on the distribution of payments during the year. Each payment observation has a date reflecting when the transaction occurred between the physician and the pharmaceutical industry entity. To avoid sampling size challenges associated with comparing days of the year, dates are grouped into quarters of the year. The dates for each quarter are:

• Q1: 1 January - 31 March
• Q2: 1 April - 30 June
• Q3: 1 July - 30 September
• Q4: 1 October - 31 December

By comparing across quarters of the year, this study looks at high-level trends in the distribution of payments throughout the year. Quarters are compared on the basis of aggregate payments for each year of the study and average payment value. One-sided ANOVA tests are performed on each year to generate p-values for evaluating significant differences between quarters within each year of the study.

## 4.4 Models

Further analysis of the characteristics, geospatial, and temporal relationships is performed via multiple linear regression models on the payments in each year of the study period, using the lm() function from the R stats package. Each model predicts payments (inflation-adjusted to 2020 equivalence) from the receiving physician’s specialty, region, ZIP code density category, the nature of the payment, and the quarter of the year in which the payment was made.

The residuals for each of these models demonstrated non-linearity due to significant right skew in the response. Thus, the payment data was log-transformed. So, the final models for interpretation utilize the receiving physician’s specialty, region, ZIP code density category, the nature of the payment, and the quarter of the year in which the payment was made to predict log payment values for a given year.

Moreover, this study performed a temporal analysis of how payments changed each quarter during the study period of 2015-2020. For each quarter, payments were aggregated to reflect the average payment amount transferred to primary care physicians during that time period. The payment-quarter pairs were then turned into a time-series object for modeling purposes.

This study used the Holt-Winters method to model this time-series object. The Holt-Winters method accounts for three features of the data:

• A typical value - the average payment
• A slope - how these payment values change over the study period
• A cyclical pattern - the relationship between quarter and the overall trend

As there is an established cyclical pattern - repeating quarters of the year - in this data, the Holt-Winters method is an effective method for modeling this data. To evaluate the accuracy of these models, this study uses the Mean Absolute Percent Error (MAPE). A common error metric for evaluating time-series models, the MAPE quantifies the mean of the absolute difference between the actual and fitted values as a percent of the actual values.