Set-up & Descriptive Stats

Loading Packages:

library(readr)
library(ggplot2)
library(MASS)
library(psych)
library('dplyr')      # for data manipulation
library('tidyr')      # for reshaping data
library('ggplot2')    # plotting data
library('scales')     # for scale_y_continuous(label = percent)(?????)
library(tidyverse)
library(forcats)
library(readxl)
library(sandwich)
library(msm)
library(AER)

Import dataset.

MAT <- read_excel("~/Dropbox/Research/Basia.MAT.Media/MAT.xlsx")
View(MAT)

Summary Statistics

summary(MAT$source)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.000   2.657   4.000   4.000

More Descriptive Statistics

I decided to make year a numeric variable rather than a categorical variable. If you would like to compare changes in a years in comparison to a base year, I can add an analysis treating years as a categorical variable.

transform(MAT, year = as.numeric(year))

Next, create categorical variable for source. Codes: NYT=1 INYT=2 USA=3 Fox=4

MAT$source= factor(MAT$source, levels = c(1,2,3,4), labels = c("NYT", "INYT", "USA", "Fox"))

We decided to create a new variable called “ttl”. We did this by combining naltr, bup, metho, and nalox. We combined them because there were too many 0 values in the dataset.

Total number of articles was 105.

sum_ttl = sum(MAT$ttl)

Total number of mentions was 124.

Naltrexone Descriptive Statistics

Naltrexone had 8 mentions. It made up about 7% of the mentions ….[fill in]

sum_naltr = sum(MAT$naltr)
sum_naltr/sum_ttl
## [1] 0.06451613

Five of the 8 were in the NYT, 1 was in USA, and 2 were in Fox.

aggregate(naltr ~ source, data=MAT, sum)
##   source naltr
## 1    NYT     5
## 2   INYT     0
## 3    USA     1
## 4    Fox     2

The first mention of naltrexone was in 2014, 2 were in 2016 and 4 were in 2017.

aggregate(naltr ~ year, data=MAT, sum)
##    year naltr
## 1  2005     0
## 2  2006     0
## 3  2007     0
## 4  2008     0
## 5  2010     0
## 6  2011     0
## 7  2012     0
## 8  2013     0
## 9  2014     2
## 10 2015     0
## 11 2016     2
## 12 2017     4
## 13 2018     0

Bup Descriptive Statistics

Bup appeared in 26 articles and accounted for 21% of the articles.

sum_bup = sum(MAT$bup)
sum_bup/sum_ttl
## [1] 0.2096774

Of these 26 articles, 10 were in the NYT, 3 were in INYT, 3 were in USA, and 10 were in Fox.

aggregate(bup ~ source, data=MAT, sum)
##   source bup
## 1    NYT  10
## 2   INYT   3
## 3    USA   3
## 4    Fox  10

See the table below for the frequency of articles by year for bup. In 2016 & 2017 we saw a spike in news coverage.

aggregate(bup ~ year, data=MAT, sum)
##    year bup
## 1  2005   1
## 2  2006   0
## 3  2007   0
## 4  2008   1
## 5  2010   0
## 6  2011   0
## 7  2012   0
## 8  2013   2
## 9  2014   2
## 10 2015   1
## 11 2016   8
## 12 2017   9
## 13 2018   2

Metho Descriptive Statistics

Metho appeared in 31 articles and accounted for 25% of the total mentions.

sum_metho = sum(MAT$metho)
sum_metho/sum_ttl
## [1] 0.25

Of these 31 articles, 15 were in the NYT, 1 were in INYT, 10 were in USA, and 5 were in Fox.

aggregate(metho ~ source, data=MAT, sum)
##   source metho
## 1    NYT    15
## 2   INYT     1
## 3    USA    10
## 4    Fox     5

See the table below for the frequency of articles by year for metho. We dont see the same spike in 2016-2017 that we saw with bup. The max is actually in 2006 (n=7).

aggregate(metho ~ year, data=MAT, sum)
##    year metho
## 1  2005     4
## 2  2006     7
## 3  2007     1
## 4  2008     1
## 5  2010     0
## 6  2011     2
## 7  2012     2
## 8  2013     2
## 9  2014     0
## 10 2015     1
## 11 2016     4
## 12 2017     4
## 13 2018     3

Nalox Descriptive Statistics

Nalox appeared in 59 articles and accounted for 48% of the articles, making it the OTP most mentioned/discussed.

sum_nalox = sum(MAT$nalox)
sum_nalox/sum_ttl
## [1] 0.4758065

Of these 59 articles, 7 were in the NYT, 17 were in INYT, 8 were in USA, and 27 were in Fox.

aggregate(nalox ~ source, data=MAT, sum)
##   source nalox
## 1    NYT     7
## 2   INYT    17
## 3    USA     8
## 4    Fox    27

See the table below for the frequency of articles by year for nalox. The spike in article mentions started in 2014 (earlier than for bup) and trends mostly upward until 2018, where the media mentions were only 1, compared 16 in 2017. This seems a little odd to have only 1 mention in 2018, we should go back and check the data to make sure nothing is funky. Check with Lexis Nexis to make sure the coverage of the newspapers didn’t change. We need to make sure that we don’t have some type of “user error”.

aggregate(nalox ~ year, data=MAT, sum)
##    year nalox
## 1  2005     0
## 2  2006     0
## 3  2007     2
## 4  2008     1
## 5  2010     1
## 6  2011     3
## 7  2012     0
## 8  2013     2
## 9  2014    10
## 10 2015    12
## 11 2016    11
## 12 2017    16
## 13 2018     1

Ttl Descriptive Statistics The total number of mentions is 124. Below is a frequency table of the number of mentions in each newspaper. So, NYT has a total of 37 mentions, INYT 21 mentions, USA 22 mentions, and Fox 44 mentions.

aggregate(ttl ~ source, data=MAT, sum)
##   source ttl
## 1    NYT  37
## 2   INYT  21
## 3    USA  22
## 4    Fox  44

It looks like there was a very large jump between 2013-2014, then again from 2014-2016, with a max of mentions of 33.

year_ttl <- aggregate(ttl ~ year, data=MAT, sum)

Number of 0s

The following is the percentage of 0s in each DV

100 * length(which(MAT$bup == 0)) / length(MAT$bup)
## [1] 75.2381
100 * length(which(MAT$naltr == 0)) / length(MAT$naltr)
## [1] 92.38095
100 * length(which(MAT$metho == 0)) / length(MAT$metho)
## [1] 70.47619
100 * length(which(MAT$nalox == 0)) / length(MAT$nalox)
## [1] 43.80952
100 * length(which(MAT$ttl == 0)) / length(MAT$ttl)
## [1] 0.952381

75% of bup are 0’s. 92% of naltr are 0’s. 71% of nalox are 0’s. 44% of metho are 0’s. 1% of ttl are 0’s.

Then I had an idea* The article level doesn’t give us much additional information. Why not collapse by year?

new_MAT <- aggregate(ttl ~ source + year + bup + naltr +metho + nalox, data=MAT, sum)
new_MAT <- aggregate(ttl ~ source + year + bup + naltr +metho + nalox, data=MAT, sum)

Models

Based on the high number of 0’s, here are the possible equations for regression models: M1: ttl = year + source M2: ttl = year M3: ttl = source

“Poisson regression – Poisson regression is often used for modeling count data. Poisson regression has a number of extensions useful for count models. Negative binomial regression – Negative binomial regression can be used for over-dispersed count data, that is when the conditional variance exceeds the conditional mean. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion. If the conditional distribution of the outcome variable is over-dispersed, the confidence intervals for Negative binomial regression are likely to be narrower as compared to those from a Poisson regression. Zero-inflated regression model – Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to exist in the data, “true zeros” and “excess zeros”. Zero-inflated models estimate two equations simultaneously, one for the count model and one for the excess zeros.” (Source: https://stats.idre.ucla.edu/r/dae/poisson-regression/)

M1: ttl = year + source

f <- ggplot(new_MAT, aes(year, ttl, fill=source))
f + geom_bar(stat="identity")

M2: ttl = year

p <- ggplot(year_ttl, aes(year, ttl))
# A basic scatter plot
p + geom_point(size = 4)

M3: ttl = source

f <- ggplot(new_MAT, aes(source, ttl))
f + geom_bar(stat="identity")

Converting year from a character variable to a numeric variable.

new_MAT$year <- as.numeric(as.character(new_MAT$year))
sapply(new_MAT, class)
##    source      year       bup     naltr     metho     nalox       ttl 
##  "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

Model 1:

m2<- glm(new_MAT$ttl ~ new_MAT$year + new_MAT$source, family=poisson)
summary(m2)
## 
## Call:
## glm(formula = new_MAT$ttl ~ new_MAT$year + new_MAT$source, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6015  -0.7548  -0.2805   0.3558   2.9252  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -117.73459   51.80459  -2.273   0.0230 *
## new_MAT$year          0.05877    0.02571   2.286   0.0223 *
## new_MAT$sourceINYT    0.20398    0.27968   0.729   0.4658  
## new_MAT$sourceUSA    -0.13863    0.27117  -0.511   0.6092  
## new_MAT$sourceFox     0.41145    0.22475   1.831   0.0671 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 62.037  on 58  degrees of freedom
## Residual deviance: 51.093  on 54  degrees of freedom
## AIC: 206.11
## 
## Number of Fisher Scoring iterations: 5

The coefficient for year is .06. This means that the expected log count for a one-unit increase in ttl is .06. It is statisically significant (p=0.023). Source is not statistically significant. I will change the coefficients into incident rate ratios (IRRs), but before we do that let’s make sure that the model is a good fit.

Is the model a good fit?

#chisquare
with(m2, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE))) 
##      res.deviance df         p
## [1,]     51.09251 54 0.5872439

Goodness of Fit “We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data. Deviance residuals are approximately normally distributed if the model is specified correctly.” (https://stats.idre.ucla.edu/r/dae/poisson-regression/)

Our model: -Are residuals normally distributed? Our median is -0.2805. This means that our model is skewed a little bit. -Model fit? The goodness-of-fit chi-squared is not statistically significant (p=0.6), which means the model fits well. We can conclude that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. (Do we need to test b/t treatment groups?) But, What does the model tell us? Let’s change the coefficients to IRRs. Also, let’s make the standard of errors more robust. “Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean.” (https://stats.idre.ucla.edu/r/dae/poisson-regression/)

Is there overdispersion? If so, we need to use a quasi-poisson or a negative binomial…

“In this case, our residual deviance is 3000 for 397 degrees of freedom. The rule of thumb is that the ratio of deviance to df should be 1, but it is 7.6, indicating severe overdispersion.” (https://biometry.github.io/APES/LectureNotes/2016-JAGS/Overdispersion/OverdispersionJAGS.pdf)

Residual deviance is 51.093 on 54 degrees of freedom. The ratio should be 1, and we are pretty close, so there is likely not any overdispersion.

We can test this using some packages:

dispersiontest(m2) #AER package
## 
##  Overdispersion test
## 
## data:  m2
## z = -0.14384, p-value = 0.5572
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   0.966982

Dispersion is less than 1 at 0.97. Overdispersion occurs when dispersion stat is greater than 1. So, we don’t need to worry about overdispersion.

options(scipen = 999) #disabling scientific notations in R
#robust standard errors
cov.msd4 <- vcovHC(m2, type="HC0")
std.err4 <- sqrt(diag(cov.msd4))
r.est4 <- cbind(Estimate= coef(m2), "Robust SE" = std.err4,
"Pr(>|z|)" = 2 * pnorm(abs(coef(m2)/std.err4), lower.tail=FALSE),
LL = coef(m2) - 1.96 * std.err4,
UL = coef(m2) + 1.96 * std.err4)

#IRR
spd <- deltamethod(list(~ exp(x1), ~ exp(x2), ~ exp(x3)), 
                                                coef(m2), cov.msd4)

## exponentiate old estimates dropping the p values
rexp.est <- exp(r.est4[, -3])
## replace SEs with estimates for exponentiated coefficients
rexp.est[, "Robust SE"]
##                 (Intercept)                new_MAT$year 
## 55040345650436448256.000000                    1.022814 
##          new_MAT$sourceINYT           new_MAT$sourceUSA 
##                    1.298860                    1.210051 
##           new_MAT$sourceFox 
##                    1.308729
rexp.est
##                                                                        Estimate
## (Intercept)        0.0000000000000000000000000000000000000000000000000007387842
## new_MAT$year       1.0605282397762305635069424170069396495819091796875000000000
## new_MAT$sourceINYT 1.2262690674049470107576098598656244575977325439453125000000
## new_MAT$sourceUSA  0.8705485671699375593846070842118933796882629394531250000000
## new_MAT$sourceFox  1.5090066907772659021702565951272845268249511718750000000000
##                                      Robust SE
## (Intercept)        55040345650436448256.000000
## new_MAT$year                          1.022814
## new_MAT$sourceINYT                    1.298860
## new_MAT$sourceUSA                     1.210051
## new_MAT$sourceFox                     1.308729
##                                                                                                                   LL
## (Intercept)        0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000150239
## new_MAT$year       1.01466116904920911068188615899998694658279418945312500000000000000000000000000000000000000000000
## new_MAT$sourceINYT 0.73451942787196899953272577477036975324153900146484375000000000000000000000000000000000000000000
## new_MAT$sourceUSA  0.59909802256260424879741322001791559159755706787109375000000000000000000000000000000000000000000
## new_MAT$sourceFox  0.89056503992503621702780947089195251464843750000000000000000000000000000000000000000000000000000
##                                       UL
## (Intercept)        0.0000000000003632894
## new_MAT$year       1.1084687003611186640
## new_MAT$sourceINYT 2.0472376476559426273
## new_MAT$sourceUSA  1.2649930049175508451
## new_MAT$sourceFox  2.5569173398073554715

We said that year is the only significant predictor variable so let’s just interpret it. For every 1 unit increase in year, there is a 6% increase in ttl mention (LL CI: 2%; UL CI: 11%).

Do we need an offset

“Count data often have an exposure variable, which indicates the number of times the event could have happened. This variable should be incorporated into a Poisson model with the use of the offset option.” (https://stats.idre.ucla.edu/r/dae/poisson-regression/)

No because the analysis above this is analysis based on the reformed dataset…with only 59 = n

Now, that is a small dataset. So, what happens if we run the same analysis, but with the original dataset?

Does the test have sufficient power given that it has a small sample size?

n=59

“For Poisson glms, statistical power depends just as much on the size of the counts as on the number of count observations. Larger counts provide more precision, in fact the coefficient of variation of an observation is the inverse of the squareroot expected count size.

There is no minimum sample size, except that the number of observations should be at least as large as the number of coefficients in the linear predictor. The Poission distribution has no dispersion parameter, so there is no need to estimate variances separately to the linear predictor and, hence, no need for any residual degrees of freedom.

Note that each observation should correspond to a unique combination of predictor variables. In a Poisson glm, multiple count observations with the same predictor variables will be summed to make a single observation. The minimum sample size requirement therefore is that the number of unique predictor variable combinations should be at least as large as the number of coefficients. If this condition is not satisfied then the linear predictor will not be estimable."

(Source: https://stats.stackexchange.com/questions/310706/poisson-generalized-linear-model-glm-sample-size)

Original Data

Converting year from a character variable to a numeric variable.

MAT$year <- as.numeric(as.character(MAT$year))

I’m unsure if we would need to offset because we aren’t interested in dealing with rates/proportions so I would err on the side of saying we don’t need an offset.

m3<- glm(MAT$ttl ~ MAT$year + MAT$source, family=poisson)
summary(m3)
## 
## Call:
## glm(formula = MAT$ttl ~ MAT$year + MAT$source, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.54034  -0.16371  -0.10540  -0.01714   1.76616  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -55.90361   52.18818  -1.071    0.284
## MAT$year         0.02789    0.02590   1.077    0.282
## MAT$sourceINYT  -0.18024    0.27766  -0.649    0.516
## MAT$sourceUSA    0.01431    0.27120   0.053    0.958
## MAT$sourceFox   -0.19242    0.22314  -0.862    0.389
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 20.116  on 104  degrees of freedom
## Residual deviance: 17.743  on 100  degrees of freedom
## AIC: 246.73
## 
## Number of Fisher Scoring iterations: 4

Year is no longer significant with the original dataset.

CONCLUSION Stick with the modified dataset.