확장모델

추가 군집 구조

Cross-classified models

하나의 군집 요인을 넘어 추가적인 분산 소스가 있는 경우가 종종 생길 것이다. 예를들어, 특정 그림들을 보여주고 각 개인이 여러번 시행하는 시각인지 실험을 생각해보자. 데이터가 다음과 같을 것이다.

이 경우, 관측값이 사람과 이미지 내로 군집이 되지만, 사람과 이미지는 서로 포함관계가 아니고 - 모든 참가자들은 10개 이미지 모두를 보았다. 이러한 경우는 일반적으로 crossed random effect 가 있다라고 하는데, non-nested 라는 의미이다. 다음에서 보게 될 상황에서 여러개의 소스의 분산을 보게 될 것이다.

예: 학생 학업성취도

예시를 위해 학생의 학업성취도 점수를 살펴볼 것이다. 의존성 소스는 같은 초등학교를 가거나, 중학교 진학한 학생에 기인한다. 하지만, 이 예에서, 한 초등학교(primary school)를 간다는 것은 반드시 특정한 중학교(secondary school)를 가는 것은 아니다. 반복측정값이 없고, 각 학생을 딱 한번만 측정했다는 것을 주목하라. 데이터를 빠르게 본 것인데, 더 자세한 내용은 appendix 을 확인하라.

load('data/pupils.RData')

우리의 mixed model 에서, 학업성취도에 대한 sex, 낮음에서 높음까지의 여섯 단계 변수인 사회경제적 지위, ses 효과를 살펴볼 것이다. 성취도 범위는 약 4 에서 10 까지이고, 평균은 6.3, 표준편차는 0.9 이다. 초등학교와 중학교의 클러스터링 고려할 것이다. 지금은 그룹 인자가 두 개¹⁷라는 것만 제외하면, lme4 문법으로 추가 구조를 넣는 것은 앞에서 본 것 처럼 매우 쉽다.

pupils_crossed = lmer(
  achievement ~ sex + ses 
  + (1|primary_school_id) + (1|secondary_school_id),
  data = pupils
)
## 
## summary(pupils_crossed, correlation = FALSE)

term	value	se	lower_2.5	upper_97.5
Intercept	5.924	0.123	5.684	6.164
sexfemale	0.261	0.046	0.171	0.350
ses2	0.132	0.118	-0.098	0.362
ses3	0.098	0.110	-0.118	0.314
ses4	0.298	0.105	0.093	0.503
ses5	0.354	0.101	0.156	0.551
seshighest	0.616	0.110	0.401	0.832

fixed effect 는 학업성취도에 있어 여자인 것이 양의 효과를 가지고, 일반적으로, 가장 낮은 사회경제적지위 범주에 비해 상위 범주가 양의 효과를 가진다는 것을 보여준다.

group	effect	variance	sd	var_prop
primary_school_id	Intercept	0.17	0.42	0.24
secondary_school_id	Intercept	0.07	0.26	0.09
Residual		0.47	0.69	0.66

분산 컴포넌트를 보면, 초등학교와 중학교가 전체 변동의 약 34% 를 기여하고 있음을 알 수 있다. 학교에 기인한 변동의 대부분은 초등학교에서 온다.

랜덤효과를 조사하면, 두가지 세트의 효과를 가진다는 것을 볼 수 있다 - 초등학교에 해당하는 50 개와 중학교에 해당하는 30 개 이다. 두 개다 학생-specific 예측에 포함된다.

glimpse(ranef(pupils_crossed))

List of 2
 $ primary_school_id  :'data.frame':    50 obs. of  1 variable:
  ..$ (Intercept): num [1:50] -0.327 0.183 0.52 0.474 0.253 ...
  ..- attr(*, "postVar")= num [1, 1, 1:50] 0.0247 0.0225 0.0244 0.0215 0.0285 ...
 $ secondary_school_id:'data.frame':    30 obs. of  1 variable:
  ..$ (Intercept): num [1:30] -0.41069 0.08247 -0.00589 -0.06162 0.08481 ...
  ..- attr(*, "postVar")= num [1, 1, 1:30] 0.0155 0.014 0.0159 0.0131 0.0142 ...
 - attr(*, "class")= chr "ranef.mer"

merTools 을 사용하여 시각적으로 살펴보자.

우리는 원하는대로 모델을 확장할 수 있다는 것을 주목하라. 예를 들어, 학생레벨 특징에 대해 랜덤 기울기를 할 수도 있다.

계층구조

cross-classified 모델을 살펴보았고 이제 계층적 클러스터 구조화하는 것으로 나아갈 수 있다. 이 경우, 클러스터들이 다른 클러스터 내부에 포함되고, 다른 클러스터도 다른 클러스터 내부에 포함되는 경우가 있다. 쉬운 예는 행정시 내부의 행정동과, 행정’도’ 내부의 행정’시’들이다.

예: 간호사와 스트레스

실례를 위해 간호사 데이터셋을 사용할 것이다. 교육프로그램 (treatment) 이 간호사의 스트레스 (1-7 점 점수) 에 주는 효과에 관심이 있다. 이 시나리오에서, 간호사는 병동 내에 포함되어있는데, 병동은 병원에 포함되어 있기 때문에 병동과 병원에 해당하는 랜덤효과를 가질 것이다. 더 자세한 내용은 appendix 를 참고하라.

load('data/nurses.RData')

이 모델의 처치 (treatment) 효과와 간호사, 병동, 병원 수준 각각 하나를 가지는 공변량을 측정한다. fixed effect 에 관해서는 다른 표준 회귀와 같이 이론/탐색법이 제안하는대로 공변량을 추가한다. 이러한 랜덤효과를 구현하는 것은 cross-classified 접근법과 크게 다르지 않지만 문법에 약간 다른 점이 있다.

nurses_hierarchical = lmer(
  stress ~ age + sex + experience + treatment + wardtype + hospsize
  + (1 | hospital) + (1 | hospital:ward),
  data = nurses
)
## 
## # same thing!
## nurses_hierarchical = lmer(
##   stress ~ age  + sex + experience + treatment + wardtype + hospsize 
##   + (1|hospital/ward), 
##   data = nurses
## )
## 
## summary(nurses_hierarchical, correlation = F)

term	value	se	lower_2.5	upper_97.5
Intercept	5.380	0.185	5.018	5.742
age	0.022	0.002	0.018	0.026
sexFemale	-0.453	0.035	-0.522	-0.385
experience	-0.062	0.004	-0.070	-0.053
treatmentTraining	-0.700	0.120	-0.935	-0.465
wardtypespecial care	0.051	0.120	-0.184	0.286
hospsizemedium	0.489	0.202	0.094	0.884
hospsizelarge	0.902	0.275	0.363	1.440

fixed 효과에 관해서는, 통계적인 효과가 없는 것은 병동유형(wardtype)이다.¹⁸

group	effect	variance	sd	var_prop
hospital:ward	Intercept	0.337	0.580	0.500
hospital	Intercept	0.119	0.345	0.177
Residual		0.217	0.466	0.323

랜덤효과에 관해서는, 병동 병동마다 변동성이 있고, 병원도 그렇다. 스트레스는 7 점 스케일이기 때문에 병동 병동마다 평균 약 0.5 점 정도 튈것으로 생각되는데, 이는 내 의견에는 꽤 극적이다. 시각적으로 살펴본다.

Crossed vs. nested

다음은 병동을 (병원 내) nested 로 처리하는것 대비 crossed 랜덤효과로 처리하는것의 차이점을 보여준다. 뭐가 다른 점으로 보이는가?

nurses_hierarchical = lmer(
  stress ~ age  + sex + experience + treatment + wardtype + hospsize 
  + (1|hospital) + (1|hospital:wardid), 
  data = nurses
)

nurses_crossed = lmer(
  stress ~ age  + sex + experience + treatment + wardtype + hospsize 
  + (1|hospital) + (1|wardid),
  data = nurses
)

group	effect	variance	sd	var_prop
hospital:ward	Intercept	0.337	0.580	0.500
hospital	Intercept	0.119	0.345	0.177
Residual		0.217	0.466	0.323

group	effect	variance	sd	var_prop
wardid	Intercept	0.337	0.580	0.500
hospital	Intercept	0.119	0.345	0.177
Residual		0.217	0.466	0.323

없다? 좋다, 당신이 이상한 것이 아니다. 다음은 lme4 text, section 2.2.1.1, 의 글인데, 시간을 써서 읽을 가치가 있다.

혼합효과 모델과 다중 계층 수준을 헷갈리는 것은 모델을 정의할 때 수준에 보장되지 않는 강조를 낳고, 상당한 혼란을 야기한다. non-nested 요인과 관련된 랜덤 효과를 갖는 모델을 정의하는 것은 완벽하게 정당하다. nested 요인에 관해서만 랜덤효과를 정의하는 것을 강조하는 이유는 이러한 경우가 실생활에서 자주 일어나고 모델의 파라미터를 추정하는 계산법이 nested 요인에 쉽게 적용할 수 있기 때문이다.

이는 lme4 패키지에서 사용한 방법의 경우는 아니다. 사실 nested 요인에 랜덤효과를 주는 모델은 특별한 것이 없다. 랜덤효과가 다중 요인과 연관될때, 요인들이 nested sequence 를 형성하는지, 부분적으로 crossed 인지, 완벽히 crossed 와 상관없이 같은 계산법이 사용된다.

첫 예에서 그룹변수로 ward 가 아닌 wardid 를 사용했다는 것을 눈치챘는가? 모든 병동이 유일하지만, ward 열은 column labels them with an arbitrary sequence starting with 1 부터 시작하는 임의의 sequence 로 라벨링한다. 이것이 자연스러워 보일 수 있지만, hospital 1 의 ward 1 은 hospital 2 의 ward 1 과 같은 병동이 아니기 때문에 이들에게 같은 라벨을 주는 것은 좋은 방법이 아닐것이다. wardid 열은 적절하게 고유한 값을 사용하여 병동들을 구분한다 (예. 11, 12).

이 변수를 crossed random effect 로 사용한다면 어떤 일이 벌어질까?

nurses_crossed_bad_data = lmer(
  stress ~ age  + sex + experience + treatment + wardtype + hospsize 
  + (1|hospital) + (1|ward), 
  data = nurses
)

group	effect	variance	sd	var_prop
hospital	Intercept	0.196	0.442	0.297
ward	Intercept	0.000	0.000	0.000
Residual		0.463	0.681	0.703

아마 우리가 원하는 결과가 절대 아니다. ward 분산은 이미 treatment 와 type 에 의해 캡쳐됐다. 하지만, 보았듯이, 적절한 문법을 사용하거나, 고유 클러스터를 고유한 식별자를 갖도록 하는, 데이터내에서의 적절한 라벨링을 하면 피할 수 있다.

이 논의 와 lme4 개발자 중 한명이 쓴 FAQ 를 읽으라. CSCAR 의 Josh Errickson 도 관심있는 행렬 시각화에 관한 글을 썻는데, 다음 섹션의 시각화의 동기가 되었다.

자 이제 되었다. lme4, 더 일반적으로 mixed models, crossed vs. nested 는 단순히 (data) 에 관한 생각을 어떻게 하느냐에 달려있다.¹⁹

잔차 구조

잔차 공분산/상관관계 구조와 관련하여 더 구체적인 값을 측정할 필요가 있는 경우가 있다. longitudinal 세팅에서 특별히 그러한데, 이 세팅에서는 관측값이 시간상 가까울 수록 먼 것보다 강한 상관관계가 있거나 분산이 시간에 따라 변한다고 생각한다. 이 모델은 어떻게 생겼을까?

우선, 우리 타겟 변수의 전체 관측값의 분산/상관 행렬과, 이러한 관측값에서 dependency 를 어떻게 표현하는 것이 좋을지를 생각해보자. 우리 GPA 데이터와 모델링 상황에 대해 첫 5 명의 시각화를 볼 것이다.

각 사람은 6개의 관측값이 있다는 것을 기억하라. GPA 에 관한 우리의 랜덤 절편 (only) 모델의 결과를 보여준다.

각 블락은 한 개인 내 관측값들에 해당하는 공분산 행렬을 나타낸다. 해당 사람 내에서 대각선에 분산들이 있고 대각선바깥에는 공분산들이 있다. 모든 데이터를 고려할 때, 한 사람의 관측값들은 다른 사람과 공분산이 없다는 것을 알 수 있다 (회색). 더 나아가서, 한 사람 내 공분산은 상수값이고, 분산도 또한 상수 값이다. 이 값들은 어디서 왔을까?

group	effect	variance	sd	var_prop
student	Intercept	0.064	0.252	0.523
Residual		0.058	0.241	0.477

이 모델에서 두개의 분산 소스가 있고, 잔차 관측값 수준 분산과 사람에 관련된 것이라는 것을 기억하라. 이 두 분산이 합쳐져서, 우리 공변량으로 설명하지 않은 총 잔차 분산을 제공한다. 우리의 경우, 약 0.12 이며 우리 대각선에 표현된 값이다. 대각이 아닌 곳은 학생에 관한 분산인데, 다른 시각으로는 class 내 상관관계로 해석할 수 있다. (총 분산으로 나누는 것은 이를 상관계수 지표로 바꿀 수 있다.)

더 일반적으로, 또 우리 추정 분산의 이전 개념을 참고하여, (한 클러스터에 대한) 공분산 행렬을 다음과 같이 볼 수 있다.

\[\Sigma = \left[ \begin{array}{ccc} \color{orange}{\sigma^2 + \tau^2} & \tau^2 & \tau^2 & \tau^2 & \tau^2 & \tau^2 \\ \tau^2 & \color{orange}{\sigma^2 + \tau^2} & \tau^2 & \tau^2 & \tau^2 & \tau^2 \\ \tau^2 & \tau^2 & \color{orange}{\sigma^2 + \tau^2} & \tau^2 & \tau^2 & \tau^2 \\ \tau^2 & \tau^2 & \tau^2 & \color{orange}{\sigma^2 + \tau^2} & \tau^2 & \tau^2\\ \tau^2 & \tau^2 & \tau^2 & \tau^2 & \color{orange}{\sigma^2 + \tau^2} & \tau^2 \\ \tau^2 & \tau^2 & \tau^2 & \tau^2 & \tau^2 & \color{orange}{\sigma^2 + \tau^2} \\ \end{array}\right]\]

compound symmetry 공분산 구조를 나타낸다. 대부분의 mixed model 세팅에서 기본값인데, 앞에서 시각적으로 본 것과 같은 것이다. 이제 다른 종류의 공분산 구조들도 생각해보자.

간단하게 살펴보기 위해 개인에 대한 다음의 모형과 세개의 시점을 살펴보자.

\[\boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\]

정규 분포를 따르는 \(y\) 세 관측값이 있다. 평균 \(\mu\) 는 표준 회귀에서처럼 공변량들의 함수이다.

\[\mu = b_0 + b_1\cdot \mathrm{time} + b_2\cdot x_1 ...\] 마지막에 있는 \(\epsilon\) 을 플롯하는 대신, 세 점 모두에 대해 전체 잔차 분산/공분산 구조를 정의하고 싶다.

표준 선형 회귀 모델 중 가장 간단한 세팅에서는 상수 분산과 공분산이 없다.

\[\Sigma = \left[ \begin{array}{ccc} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 & 0 \\ 0 & 0 & \sigma^2 \\ \end{array}\right]\]

다음으로, 동일분산 가정을 풀고, 각 분산을 분리해서 추정할 수 있다. 이와 같은 이종분산 경우에 예를 들어 시간에 따라 분산이 오르락 내리락 할 수 있다.

\[\Sigma = \left[ \begin{array}{ccc} \sigma_1^2 & 0 & 0 \\ 0 & \sigma_2^2 & 0 \\ 0 & 0 & \sigma_3^2 \\ \end{array}\right]\]

우리가 사실 공분산/correlation 을 추정하고 싶다고 해보자. correlation 표현으로 바꿀 것이지만, 이 경우에도 분산이 상수거나 따로 추정된다고 가정해도 된다. 다음과 같은 것을 얻는데, \(\rho\) 는 관측값 사이의 잔차 correlation 을 나타낸다.

\[\Sigma = \sigma^2 \left[ \begin{array}{ccc} 1 & \rho_1 & \rho_2 \\ \rho_1 & 1 & \rho_3 \\ \rho_2 & \rho_3 & 1 \\ \end{array}\right]\]

이 경우 모든 시간 포인트 쌍에 대해 (상수 분산과) 다른 correlation 을 추정하게 된다. 이를 unstructured, 혹은 단순히 ‘symmetric’ correlation 구조라고 부른다.

mixed model 의 특수형태인 반복수행 ANOVA를 알고 있다면, 흔한 가정은 sphericity, 혹은 compound symmetry의 느슨한 형태, 즉 모든 correlation 이 같은 값을 가지고 (\(\rho_1=\rho_2=\rho_3\)), 모든 분산이 같다는 것이 생각날 것이다.

자주 사용되는 다른 correlation 구조(시간 기준 세팅에서)는 lag order 1 인 autocorrelation 잔차 구조이다. 한 시점 떨어진 잔차가 어떤 값(\(\rho\))으로 correlate 되어 있고, 두 시점 떨어진 관측값은 \(\rho^2\) correlate 되어 있고, 이런 식이다. 그렇기 때문에 우리는 \(\rho\) 만 추정하면 되고 나머지는 자동으로 결정된다. 네개의 시점에는 다음과 같이 된다.

\[\Sigma = \sigma^2 \left[ \begin{array}{cccc} 1 & \rho & \rho^2 & \rho^3 \\ \rho & 1 & \rho & \rho^2 \\ \rho^2 & \rho & 1 & \rho \\ \rho^3 & \rho^2 & \rho & 1 \\ \end{array}\right]\]

\(\rho\) 가 .5 로 추정이 된다면, 다음과 같이 될 것이다.

\[\Sigma = \sigma^2 \left[ \begin{array}{cccc} 1 & .5 & .25 & .06 \\ .5 & 1 & .5 & .25 \\ .25 & .5 & 1 & .5 \\ .06 & .25 & .5 & 1 \\ \end{array}\right]\]

제일 중요한 점은, 시점상 더 멀리 떨어져 있는 점들은 덜 correlate 되어 있다는 것을 가정한다는 것이다.

잠재적으로 고려해야하는 많은 패턴과 가능성들이 있다는 것과, 이것들이 반복 측정 시나리오에만 국한되지 않다는 점을 알아야 한다. 예를 들어, 지리적으로 가까운 유닛이 더 correlation 되는 공간 구조를 나타낼 수도 있다. 언급했듯이 각 시점마다 다른 분산을 갖게 할 수도 있다²⁰. 이 예부터 살펴보자.

이종분산

다른 많은 mixed model 패키지들은 잔차 공분산 구조를 모델링해 주지만²¹, lme4는 적어도 직관적인 방법은 아니다. 사실 기본 R 설치하면 함께 오는 두 개의 패키지(mgcv, nlme)는 공분산구조 모델링을 해 준다.

nlme 패키지는 아주 다르지는 않지만 다른 랜덤효과 specification 을 할 것이다. 또한, 이종분산을 추정하기 위해 weights 인수를 추가하여 사용할 것이다. 아래는 각 학기의 시점에서 고유 추정값을 갖도록 한다.

library(nlme)

heterovar_res = lme(
  gpa ~ occasion,
  data = gpa,
  random = ~ 1 | student,
  weights = varIdent(form = ~ 1 | occasion)
)

## summary(heterovar_res)

term	value	se	z	p_value	lower_2.5	upper_97.5
Intercept	2.599	0.026	99.002	0	2.547	2.650
occasion	0.106	0.004	26.317	0	0.098	0.114

group	effect	variance	sd	var_prop
student	Intercept	0.094	0.306	0.404
Residual		0.138	0.372	0.596

결과가 이전과 같게 나왔다. 별로 재미없는 부분이다. 이 예에서 우리가 관심있는 값, 즉 학기마다의 분산에 대해, nlme 는 출력값이 추정을 위한 형태이지, 보고를 위한 형태가 아니므로, 처음에 이해하는데 쉽지 않게 만든다. 분산들은 첫번째 분산 추정값인 랜덤효과 부분의 보고된 잔차분산에 상대하여 스케일된다. 또한 이 값들은 분산 스케일이 아닌 표준편차스케일이다. 기본 출력디스플레이에서 이 경우 시간에 따라 분산이 감소하는 것을 볼 수 있지만, 실제 값은 제공되지 않는다.

summary(heterovar_res$modelStruct)

Random effects:
 Formula: ~1 | student
        (Intercept) Residual
StdDev:   0.8232544        1

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | occasion 
 Parameter estimates:
        0         1         2         3         4         5 
1.0000000 0.8261186 0.6272415 0.4311126 0.3484013 0.4324628

상대 값이라도 나쁘지 않다고 생각하지만, 우리가 원하는 것은 실제 추정값들이다. 잔차 표준편차를 사용하여 이 값들을 스케일 한 뒤, 제곱하여 분산 스케일을 하여 실제추정값을 얻는 방법은 다음과 같다.

(c(1.0000000, coef(heterovar_res$modelStruct$varStruct, unconstrained=F))*heterovar_res$sigma)^2

                    1          2          3          4          5 
0.13815037 0.09428374 0.05435276 0.02567636 0.01676917 0.02583744

야호. 앞으로 같은 작업을 하려면 매번 이 것들을 찾아봐야하지만, 모델을 인풋으로 하는 당신만의 함수를 만들 수 있다. 내 함수가 정보를 추출하는 방법이다.

mixedup::extract_het_var(heterovar_res, scale = 'var')

     X0    X1    X2    X3    X4    X5
1 0.138 0.094 0.054 0.026 0.017 0.026

최근 생긴 glmmTMB 을 사용하는 방법도 눈여겨 볼 만하다. lme4 스타일과 출력을 유지할 수 있게 해 준다.

library(glmmTMB)

heterovar_res2 = glmmTMB(
  gpa ~ occasion + (1|student) + diag(0 + occas |student), 
  data = gpa
)

summary(heterovar_res2)

 Family: gaussian  ( identity )
Formula:          gpa ~ occasion + (1 | student) + diag(0 + occas | student)
Data: gpa

     AIC      BIC   logLik deviance df.resid 
   261.1    312.0   -120.5    241.1     1190 

Random effects:

Conditional model:
 Groups    Name                   Variance Std.Dev. Corr                     
 student   (Intercept)            0.093123 0.30516                           
 student.1 occasyear 1 semester 1 0.129833 0.36032                           
           occasyear 1 semester 2 0.086087 0.29341  0.00                     
           occasyear 2 semester 1 0.046240 0.21503  0.00 0.00                
           occasyear 2 semester 2 0.017615 0.13272  0.00 0.00 0.00           
           occasyear 3 semester 1 0.008709 0.09332  0.00 0.00 0.00 0.00      
           occasyear 3 semester 2 0.017730 0.13316  0.00 0.00 0.00 0.00 0.00 
 Residual                         0.008065 0.08980                           
Number of obs: 1200, groups:  student, 200

Dispersion estimate for gaussian family (sigma^2): 0.00806 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.598788   0.026201   99.19   <2e-16 ***
occasion    0.106141   0.004034   26.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

각 시점에서 디스플레이된 분산들은 잔차 분산과 conflated 되지 않았다는 것을 주목하라. nlme 와 비교하기 위해 이러한 추정값에 잔차분산을 추가하라. 모든 mixed model 패키지와 같이, VarCorr 객체로 부터 분산을 유용한 형태로 얻으려면 꽤 힘들다. 다음은 어떻게 하는지, 또 nlme 의 결과와 비교를 보여준다.

vc_glmmtmb = VarCorr(heterovar_res2)
vc_glmmtmb = attr(vc_glmmtmb$cond$student.1, 'stddev')^2 + sigma(heterovar_res2)^2

내 함수의 출력이다:

mixedup::extract_het_var(heterovar_res2, scale = 'var')

      group occasyear.1.semester.1 occasyear.1.semester.2 occasyear.2.semester.1 occasyear.2.semester.2 occasyear.3.semester.1 occasyear.3.semester.2
1 student.1                  0.138                  0.094                  0.054                  0.026                  0.017                  0.026

어떤 경우가 되었든, 다 조합하면 같은 결과를 얻는다.

	year 1 sem 1	year 1 sem 2	year 2 sem 1	year 2 sem 2	year 3 sem 1	year 3 sem 2
glmmTMB	0.138	0.094	0.054	0.026	0.017	0.026
nlme	0.138	0.094	0.054	0.026	0.017	0.026

Autocorrelation

다음 예는 같은 기초 모델이지만 앞에서 기술한 autocorrelation 구조를 가진다.

weights 인수와 했던 것과 비슷하게 nlme 에서 내장 corAR1 함수와 correlation 인수를 사용한다.

library(nlme)

corr_res = lme(
  gpa ~ occasion,
  data = gpa,
  random = ~ 1 | student,
  correlation = corAR1(form = ~ occasion)
)

## summary(corr_res)

term	value	se	z	p_value	lower_2.5	upper_97.5
Intercept	2.597	0.023	113.146	0	2.552	2.642
occasion	0.107	0.005	20.297	0	0.097	0.118

group	effect	variance	sd	var_prop
student	Intercept	0.046	0.215	0.381
Residual		0.075	0.273	0.619

우선, 학기에 해당하는 fixed 효과는 전과 같다. 분산추정값이 fixed effects 분산 (즉 표준오차)와 함께 약간 바뀌었다. 주된 점은 nlme 출력에서 autocorrelation 을 나타내는 새로운 파라미터 Phi가 생겼다는 것인데, 0.418 값을 갖는다. 이는 시간상 옆에 있는 관측값들에 대해 잔차 사이에 correlation 이 최소한 존재하지만, 멀리 떨어질 수록 빠르게 없어진다는 것을 보여준다.

glmmTMB 도 이러한 구조를 분석할 수 있다. 이 specification 에 학기에 대해 factor form 이 필요하고, 다른 랜덤효과 처럼 모델 공식에 들어간다는 것을 주목하라. supplemental section 에 자세한 내용이 있다.

corr_res_tmb = glmmTMB(
  gpa ~ occasion +  ar1(0 + occas | student) + (1 | student),
  data = gpa
)

일반화 선형 Mixed Model

표준 선형모델을 확장하여 일반화 선형 모델을 만들었듯이, (선형) mixed model 을 일반화 선형 mixed model 로 일반화할 수 있다. 더 나아가서, 다른 패키지들이 여러 반응 분포를 허용하는 것 같이 우리도 exponential family 에만 국한되는 것은 아니다.

우리는 mixed model 세팅에서 로지스틱 회귀를 살펴볼 것이다. Speed dating 데이터셋을 사용할 것이다. 스피드 데이팅 이벤트에서 각 참가자들은 다른 참가자들과 열 번의 짧은 데이트 (4분)를 하도록 실험이 할당했다. 각 데이트에서 각 사람은 다른 사람을 여섯 속성 (attractive, sincere, intelligent, fun, ambitious, shared interests)에 대해 10-점 스케일로 평가하고 상대를 다시 볼 건지 적도록 했다.

목표 변수는 참석자들이 상대를 다시 데이트할 것인지 아닌지 (decision) 이다. 간단하게 해서, 설명변수는 참석자들의 성별(sex), 파트너가 같은 인종인지 (samerace), 그리고 참석자가 상대방의 매력 (attractive), 진실성 (sincere), 지능 (intelligence) 점수에만 국한할 것이다. 후자는 평균 0 과 표준편차 .5 으로 스케일되는데, 이는 binary 공변량과 공평하게 만든다. (_sc)²².

load('data/speed_dating.RData')

sd_model = glmer(
  decision ~ sex + samerace + attractive_sc + sincere_sc + intelligent_sc
  + (1 | iid),
  data   = speed_dating,
  family = binomial
)

summary(sd_model, correlation = FALSE)

term	value	se	z	p_value	lower_2.5	upper_97.5
Intercept	-0.743	0.121	-6.130	0.00	-0.981	-0.506
sexMale	0.156	0.164	0.954	0.34	-0.165	0.478
sameraceYes	0.314	0.075	4.192	0.00	0.167	0.460
attractive_sc	1.957	0.058	33.560	0.00	1.842	2.071
sincere_sc	0.311	0.054	5.747	0.00	0.205	0.417
intelligent_sc	0.444	0.054	8.232	0.00	0.338	0.550

fixed effect 결과들은 각 attributes 에 관해 예상한 대로인데, 매력이 특히 매우 강한 효과를 갖는다. 또한, 상대가 같은 인종인 것은 양의 효과가 있는데 반해 성별은 통계적으로 유의하지 않았다. 원한다면 로지스틱 회귀에서와 같이 계수에 exponent 함수를 취하여 오즈비를 구할 수 있다.

group	effect	variance	sd
iid	Intercept	2.708	1.645

분산요소에 대해 residual 분산이 없는 것을 주목하라. 이는 반응변수의 정규 분표로 모델링하는 것이 아니기 때문에, 추정해야 할 \(\sigma\) 가 없다. 하지만, 결과는 사람마다 변동성이 꽤 있다는 것을 이야기한다.

Exercises for Extensions

Sociometric data

In the following data, kids are put into different groups and rate each other in terms of how much they would like to share some activity with the others. We have identifying variables for the person doing the rating (sender), the person being rated (receiver), what group they are in, as well as age and sex for both sender and receiver, as well as group size.

To run a mixed model, we will have three sources of structure to consider:

senders (within group)
receivers (within group)
group

First, load the sociometric data.

load('data/sociometric.RData')

To run the model, we will proceed with the following modeling steps. For each, make sure you are creating a separate model object for each model run.

Model 1: No covariates, only sender and receiver random effects. Note that even though we don’t add group yet, still use the nesting approach to specify the effects (e.g. 1|group:receiver)
Model 2: No covariates, add group random effect
Model 3: Add all covariates: agesend/rec, sexsend/rec, and grsize (group size)
Model 4: In order to examine sex matching effects, add an interaction of the sex variables to the model sexsend:sexrec.
Compare models with AIC (see the note about model comparison), e.g. AIC(model1). A lower value would indicate the model is preferred.

Patents

Do a Poisson mixed effect model using the patent data. Model the number of citations (ncit) based on whether there was opposition (opposition) and if it was for the biotechnology/pharmaceutical industry (biopharm). Use year as a random effect to account for unspecified economic conditions.

load('data/patents.RData')

Interestingly, one can model overdispersion in a Poisson model by specifying an random intercept for each observation (subject in the data). In other words, no specific clustering or grouped structure is necessary, but we can use the random effect approach to get at the extra variance.

I don’t show the formal model here as we did before, but this is why depicting mixed models solely as ‘multilevel’ becomes a bit problematic in my opinion. In the standard mixed model notation it’s straightforward though, you just add an additional random effect term, just as we do in the actual model syntax.↩︎
Setting aside our discussion to take a turn regarding regression modeling more generally, this is a good example of ‘surprising’ effects not being so surprising when you consider them more closely. Take a look at the effect of experience. More experience means less stress, this is probably not surprising. Now look at the age effect. It’s positive! But wouldn’t older nurses have more experience? What’s going on here? When interpreting experience, it is with age held constant, thus more experience helps with lowering stress no matter what your age. With age, we’re holding experience constant. If experience doesn’t matter, being older is affiliated with more stress, which might be expected given the type of very busy and high pressure work often being done (the mean age is 43). A good way to better understand this specifically is to look at predicted values when age is young, middle, and older vs. experience levels at low, middle, and high experience, possibly explicitly including the interaction of the two in the model. Also note that if you take experience out of the model, the age effect is negative, which is expected, as it captures experience also.↩︎
Just a reminder, it does matter if you label your data in a less than optimal fashion. For example, if in the nesting situation you start your id variable at 1 for each nested group, then you have to use the nested notation in lme4, otherwise, e.g. it won’t know that id = 1 in group 1 is different from id 1 in group 2. In our hospital example, this would be akin to using ward instead of wardid as we did. Again though, this wouldn’t be an issue if one practices good data habits. Note also the : syntax. In other modeling contexts in R this denotes an interaction, and that is no different here. In some contexts, typically due to experimental designs, one would want to explore random effects of the sort 1|A, 1|B and 1|A:B. However, this is relatively rare.↩︎
One reason to do so would be that you expect variability to decrease over time, e.g. due to experience. You might also allow that variance to be different due to some other grouping factor entirely (e.g. due to treatment group membership).↩︎
This feature request has been made by its users for over a decade at this point- it’s not gonna happen. The issue is that the way lmer works by default employs a method that won’t allow it (this is why it is faster and better performing than other packages). Unfortunately the common response to this issue is ‘use nlme.’ However many other packages work with lme4 rather than nlme, and if you aren’t going to use lme4 for mixed models you might as well go Bayesian with rstanarm or brms instead of nlme. I would even prefer mgcv to nlme (though it can use nlme under the hood) because of the other capabilities it provides, and the objects created are easier to work with in my opinion.↩︎
Note that for a balanced binary variable, the mean p=.5 and standard deviation is sqrt(p*(1-p)) = .5↩︎