38 Day 36 (July 28)
38.1 Announcements
- Please fill out teaching evaluations!
- I will email you your final grade before I submitt anything
38.2 Data fusion
Data fusion is a term used to describe a statistical model that uses more than one source of data.
- A common situation is when we have responses variables (i.e., \(\mathbf{y}\)) from different studies or experiments.
- Examples:
- Age vs. height
- Abundance vs. presence/absence
- Other names include: data integration, data reconciliation, and multi-data source models
Common approaches for modeling data from different sources or of different quality
- Simple pooling
- Don’t destroy/degrade good data!
- Avoid non-invertible transformation to the response!
- Example: Fletcher et al. (2019)
General model building strategy
- For each data source, pick an appropriate distribution for the response.
- Select mathematical models that will be used to model the expected value (or other parameters such as the variance) of the distributions. The mathematical models may be the same for each distribution or may only share parameters.
- Construct the likelihood function for each data source/statistical model.
- Ideally the data sources are independent and, therefore, the likelihood function for each data source/statistical model can be multiplied
- Proceed with estimation.
Example: Konza percent grass cover
- Exact cover data
<- "https://www.dropbox.com/scl/fi/g633b8ebwidskngmjf0vg/konza_grass_exact.csv?rlkey=y2ddh50ucyp92hwfwmi6lr2lj&dl=1" url <- read.csv(url) df.grass.exact head(df.grass.exact)
## percgrass elev ## 1 13 420.860 ## 2 45 399.306 ## 3 72 412.792 ## 4 9 422.606 ## 5 21 381.152 ## 6 37 397.789
plot(df.grass.exact$elev,df.grass.exact$percgrass,xlab="Elevation",ylab="Percent grass")
<- lm(percgrass~elev,df.grass.exact) m1 summary(m1)
## ## Call: ## lm(formula = percgrass ~ elev, data = df.grass.exact) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.745 -10.850 -2.845 7.794 50.374 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.86463 14.04673 0.916 0.360 ## elev 0.02325 0.03416 0.681 0.496 ## ## Residual standard error: 14.84 on 398 degrees of freedom ## Multiple R-squared: 0.001163, Adjusted R-squared: -0.001347 ## F-statistic: 0.4633 on 1 and 398 DF, p-value: 0.4965
confint(m1)
## 2.5 % 97.5 % ## (Intercept) -14.75042337 40.47968843 ## elev -0.04390688 0.09041252
- Cheap cover data
<- "https://www.dropbox.com/scl/fi/gqigo9t23xlhhzuouta9f/konza_grass_cheap.csv?rlkey=jxbqo03fywky9g1lco336d17m&dl=1" url <- read.csv(url) df.grass.cheap head(df.grass.cheap)
## percgrass elev ## 1 <50% 430.380 ## 2 <50% 430.380 ## 3 <50% 430.380 ## 4 <50% 430.380 ## 5 <50% 425.015 ## 6 <50% 425.015
- Model formulation (go over on white board)
- Model implementation
# Summary function for optim output (get maximum likelihood estimates, standard errors, and 95% CI ) <- function(est){round(data.frame(MLE = est$par,SE = diag(solve(est$hessian))^0.5, summary.opt lower.CI = est$par-1.96*diag(solve(est$hessian))^0.5, upper.CI = est$par+1.96*diag(solve(est$hessian))^0.5),3)} # Perpare data <- model.matrix(~elev,df.grass.exact) X.y <- df.grass.exact$percgrass y <- model.matrix(~elev,df.grass.cheap) X.z <- ifelse(df.grass.cheap$percgrass==">50%",1,0) z # Negative log-likelihood function <- function(par){ nll <- exp(par[1]) sigma2 <- par[-1] beta -(sum(dnorm(y,X.y%*%beta,sqrt(sigma2),log=TRUE)) + sum(dbinom(z,1,pnorm(X.z%*%beta,50,sqrt(sigma2)),log=TRUE))) } # Obtain maximum likelihood estimates for beta and sigma2 <- optim(c(log(200),0,0),fn=nll,method="Nelder-Mead",hessian=TRUE) est summary.opt(est)[2:3,]
## MLE SE lower.CI upper.CI ## 2 10.450 8.696 -6.595 27.494 ## 3 0.031 0.021 -0.010 0.072
# Compare 95% CI length for beta_1 from exact data only to data fusion confint(m1)[2,2] - confint(m1)[2,1])/(summary.opt(est)[3,4] - summary.opt(est)[3,3]) (
## [1] 1.638041
Summary
- There are multiple ways to collect data on the same underlying phenomenon.
- Some ways of collecting data are “better” than others, but there are many criteria to decide how to measure best.
- Even if the data are collected using a standardized protocol it is a (testable) assumption that the error follows the same distribution for each observation.
- The ability to build and combined statistical models for multiple types of data is becoming increasingly valuable.
- Many ad hoc approaches exist (e.g., “correction factors” or “calibration”)
- Unlikely to have easy to use software because of the large number of model combinations
- Properties of such models are only beginning to be studied (e.g., optimal design)
- There are multiple ways to collect data on the same underlying phenomenon.