## 14.3 Confirmatory Factor Analysis

Confirmatory Factor Analysis (CFA) is a popular SEM method in which one specifies how observed variables relate to assumed latent variables (Thompson 2004). CFA is often used to evaluate the psychometric properties of questionnaires or other assessments. It allows researchers to determine if the variables they assess indeed measure one or more latent variables, and how these latent variables relate to each other. For frequently used questionnaires, there are often many studies which report the correlations between the different items of the questionnaires. Such data can be used for meta-analytic SEM, which allows us to evaluate which latent variable (or factor) structure is the most appropriate based on all available evidence.

In this example, let us assume that we want to confirm the latent factor structure of a questionnaire for sleep problems. The questionnaire is assumed to measure two distinct latent variables, which together characterize sleep problems: insomnia and lassitude (in fact Koffel and Watson (2009) argue that sleep complaints do indeed follow this structure). For our questionnaire, we have created a fictitious dataset, dat.cfa (based on the Digman97 data in the metaSEM package) which contains 14 independent studies which report the correlations between the different symptoms of sleep complaints that our questionnaire measures: sleep quality, sleep latency, sleep efficiency, daytime dysfunction and hypersomnia (i.e., sleeping too much; the dataset can be downloaded here). We assume that the first three symptoms are related, because they all measure insomnia as a latent variable, whereas daytime dysfunction and hypersomnia are related because they are symptoms of the lassitude factor.

The proposed structure represented as a graphical model looks like this (please note that the labels are somewhat idiosyncratic to make identifying the relevant components of the model easier later on):

### 14.3.1 Setting up our data

Let us first have a look at the data we want to use for model fitting. The dat.cfa dataset i will use here has a special structure: it is a so-called list, containing (1) another list of matrices and (2) a vector. A list is a very versatile R object, which allows to bind together different objects in one single big object. Lists can be accessed like data.frames by using the $ operator. By using the names() function, we can see the names of the objects in the list. names(dat.cfa) ## [1] "data" "n" We see that the list contains two elements, our data and the n (sample size) of each study. The data object is itself a list, so we can get the names of its contents using the names() function, or can display single contents of it through the $ operator.

dat.cfa$n, method="REM", RE.type = "Diag") summary(cfa1)  Call: meta(y = ES, v = acovR, RE.constraints = Diag(paste0(RE.startvalues, "*Tau2_", 1:no.es, "_", 1:no.es)), RE.lbound = RE.lbound, I2 = I2, model.name = model.name, suppressWarnings = TRUE, silent = silent, run = run) 95% confidence intervals: z statistic approximation Coefficients: Estimate Std.Error lbound ubound z value Pr(>|z|) Intercept1 0.38971904 0.05429257 0.28330757 0.49613052 7.1781 7.068e-13 *** Intercept2 0.43265876 0.04145136 0.35141559 0.51390194 10.4377 < 2.2e-16 *** Intercept3 0.04945626 0.06071079 -0.06953470 0.16844722 0.8146 0.41529 Intercept4 0.09603706 0.04478711 0.00825593 0.18381818 2.1443 0.03201 * Intercept5 0.42724237 0.03911621 0.35057601 0.50390873 10.9224 < 2.2e-16 *** Intercept6 0.11929310 0.04106204 0.03881299 0.19977321 2.9052 0.00367 ** Intercept7 0.19292421 0.04757961 0.09966988 0.28617853 4.0548 5.018e-05 *** Intercept8 0.22690159 0.03240893 0.16338126 0.29042192 7.0012 2.538e-12 *** Intercept9 0.18105563 0.04258855 0.09758361 0.26452765 4.2513 2.126e-05 *** Intercept10 0.43614968 0.03205961 0.37331400 0.49898536 13.6043 < 2.2e-16 *** Tau2_1_1 0.03648373 0.01513055 0.00682838 0.06613907 2.4113 0.01590 * Tau2_2_2 0.01963097 0.00859789 0.00277942 0.03648253 2.2832 0.02242 * Tau2_3_3 0.04571437 0.01952285 0.00745030 0.08397845 2.3416 0.01920 * Tau2_4_4 0.02236122 0.00995083 0.00285794 0.04186449 2.2472 0.02463 * Tau2_5_5 0.01729073 0.00796404 0.00168149 0.03289996 2.1711 0.02992 * Tau2_6_6 0.01815482 0.00895896 0.00059557 0.03571407 2.0264 0.04272 * Tau2_7_7 0.02604880 0.01130265 0.00389601 0.04820160 2.3047 0.02119 * Tau2_8_8 0.00988761 0.00513713 -0.00018098 0.01995620 1.9247 0.05426 . Tau2_9_9 0.01988243 0.00895052 0.00233973 0.03742514 2.2214 0.02633 * Tau2_10_10 0.01049222 0.00501578 0.00066148 0.02032297 2.0918 0.03645 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Q statistic on the homogeneity of effect sizes: 1220.333 Degrees of freedom of the Q statistic: 130 P value of the Q statistic: 0 Heterogeneity indices (based on the estimated Tau2): Estimate Intercept1: I2 (Q statistic) 0.9326 Intercept2: I2 (Q statistic) 0.8872 Intercept3: I2 (Q statistic) 0.9325 Intercept4: I2 (Q statistic) 0.8703 Intercept5: I2 (Q statistic) 0.8797 Intercept6: I2 (Q statistic) 0.8480 Intercept7: I2 (Q statistic) 0.8887 Intercept8: I2 (Q statistic) 0.7669 Intercept9: I2 (Q statistic) 0.8590 Intercept10: I2 (Q statistic) 0.8193 Number of studies (or clusters): 14 Number of observed statistics: 140 Number of estimated parameters: 20 Degrees of freedom: 120 -2 log likelihood: -112.4196 OpenMx status1: 0 ("0" or "1": The optimization is considered fine. Other values may indicate problems.)  A look at the OpenMx status1 shows the the model estimates are trustworthy. To make the results more easily digestable, we can extract the fixed effects (our estimated pooled correlations) using the coef() function. We then make a symmetrical matrix out of the coefficients using vec2symMat(), and add the dimension names for easier interpretation. # Extract the fixed coefficients (correlations) fixed.coefs <- coef(cfa1, "fixed") # Make a symmetric matrix fc.mat <- vec2symMat(fixed.coefs, diag = FALSE) # Label rows and columns dimnames(fc.mat)[[1]] <- dimnames(fc.mat)[[2]] <- c("Quality", "Latency", "Efficiency", "DTDysf", "HypSomnia") fc.mat ## Quality Latency Efficiency DTDysf HypSomnia ## Quality 1.00000000 0.3897190 0.4326588 0.04945626 0.09603706 ## Latency 0.38971904 1.0000000 0.4272424 0.11929310 0.19292421 ## Efficiency 0.43265876 0.4272424 1.0000000 0.22690159 0.18105563 ## DTDysf 0.04945626 0.1192931 0.2269016 1.00000000 0.43614968 ## HypSomnia 0.09603706 0.1929242 0.1810556 0.43614968 1.00000000 We can now see the pooled correlation matrix for our our variables. Looking back the model output, we also see that all correlation coefficients are significant ($$p<0.05$$), except one: the correlation between sleep quality and daytime dysfunction was not significant. From the perspective of our model, this makes sense, because we expect these variables to load on different factors. We also find that the $$I^2$$ values for the different estimates are very, very large ($$76\% - 93\%$$). We may therefore also have a look if a model fitted in different subclusters of studies might reduce the amount of heterogeneity we find, and if this translates to different SEM fits on stage 2. #### 14.3.2.2 Stage 2 After pooling the correlation matrices, it is now time to determine if our proposed factor model does indeed fit the data. To specify our model, we have to resort to the RAM formulation this time, and specify the $$\mathbf{A}$$, $$\mathbf{S}$$ and $$\mathbf{F}$$ matrices. To fill the fields, it is often easier to construct an empty matrix first. In the rows and columns, the matrices will not only contain the observed variables, but also the latent variables we want to estimate, f_Insomnia and f_Lassitude. Here is how we can create a zero matrix as a blueprint: # Create vector of column/row names dims <- c("Quality", "Latency", "Efficiency", "DTDysf", "HypSomnia", "f_Insomnia", "f_Lassitude") # Create 7x7 matrix of zeros mat <- matrix(rep(0, 7*7), nrow = 7, ncol = 7) # Label the rows and columns dimnames(mat)[[1]] <- dimnames(mat)[[2]] <- dims mat ## Quality Latency Efficiency DTDysf HypSomnia f_Insomnia ## Quality 0 0 0 0 0 0 ## Latency 0 0 0 0 0 0 ## Efficiency 0 0 0 0 0 0 ## DTDysf 0 0 0 0 0 0 ## HypSomnia 0 0 0 0 0 0 ## f_Insomnia 0 0 0 0 0 0 ## f_Lassitude 0 0 0 0 0 0 ## f_Lassitude ## Quality 0 ## Latency 0 ## Efficiency 0 ## DTDysf 0 ## HypSomnia 0 ## f_Insomnia 0 ## f_Lassitude 0 $~$ $$\mathbf{A}$$ Matrix In the $$\mathbf{A}$$ matrix, we specify the asymmetrical (i.e. single) arrows in our model. The logic, again, is that the arrow starts at the column variable and ends where the column meets with the entry of the row variable. All other fields which do not represent arrows are filled with 0. We specify that an arrow has to be “estimated” by providing a character string. This character string starts with a starting value for the optimiztion procedure (usually somewhere between $$0.1$$ and $$0.3$$) followed by *. After the * symbol, we specify a label for the value. If two fields in the $$\mathbf{A}$$ matrix have the same label, this means that we assume that the fields have the same value. We assume a starting value of $$0.3$$ for all estimated arrows, and label the fields according to the graph from above. We can do this like this: A <- matrix(c(0, 0, 0, 0, 0, "0.3*Ins_Q", 0 , 0, 0, 0, 0, 0, "0.3*Ins_L", 0 , 0, 0, 0, 0, 0, "0.3*Ins_E", 0 , 0, 0, 0, 0, 0, 0 , "0.3*Las_D", 0, 0, 0, 0, 0, 0 , "0.3*Las_H", 0, 0, 0, 0, 0, 0 , 0 , 0, 0, 0, 0, 0, 0 , 0 ), nrow = 7, ncol = 7, byrow=TRUE) # Label columns and rows dimnames(A)[[1]] <- dimnames(A)[[2]] <- dims A ## Quality Latency Efficiency DTDysf HypSomnia f_Insomnia ## Quality "0" "0" "0" "0" "0" "0.3*Ins_Q" ## Latency "0" "0" "0" "0" "0" "0.3*Ins_L" ## Efficiency "0" "0" "0" "0" "0" "0.3*Ins_E" ## DTDysf "0" "0" "0" "0" "0" "0" ## HypSomnia "0" "0" "0" "0" "0" "0" ## f_Insomnia "0" "0" "0" "0" "0" "0" ## f_Lassitude "0" "0" "0" "0" "0" "0" ## f_Lassitude ## Quality "0" ## Latency "0" ## Efficiency "0" ## DTDysf "0.3*Las_D" ## HypSomnia "0.3*Las_H" ## f_Insomnia "0" ## f_Lassitude "0" Looks good so far. The last step is to plug the A matrix into the as.mxMatrix() function to make it usable for the later step. A <- as.mxMatrix(A) $~$ $$\mathbf{S}$$ Matrix In the $$\mathbf{S}$$ matrix, we specify the variances we want to estimate. We want to estimate the variance for all observed variables, as well as the correlation between our latent variables. We set the correlation of our latent variables with themselves to be $$1$$. We use a starting value of $$0.2$$ for the variances in the observed variables, and $$0.3$$ for the correlations. Here is how we construct the matrix: # Make a diagonal matrix for the variances Vars <- Diag(c("0.2*var_Q", "0.2*var_L", "0.2*var_E", "0.2*var_D", "0.2*var_H")) # Make the matrix for the latent variables Cors <- matrix(c(1, "0.3*cor_InsLas", "0.3*cor_InsLas", 1), nrow=2, ncol=2) # Combine S <- bdiagMat(list(Vars, Cors)) # Label columns and rows dimnames(S)[[1]] <- dimnames(S)[[2]] <- dims S ## Quality Latency Efficiency DTDysf HypSomnia ## Quality "0.2*var_Q" "0" "0" "0" "0" ## Latency "0" "0.2*var_L" "0" "0" "0" ## Efficiency "0" "0" "0.2*var_E" "0" "0" ## DTDysf "0" "0" "0" "0.2*var_D" "0" ## HypSomnia "0" "0" "0" "0" "0.2*var_H" ## f_Insomnia "0" "0" "0" "0" "0" ## f_Lassitude "0" "0" "0" "0" "0" ## f_Insomnia f_Lassitude ## Quality "0" "0" ## Latency "0" "0" ## Efficiency "0" "0" ## DTDysf "0" "0" ## HypSomnia "0" "0" ## f_Insomnia "1" "0.3*cor_InsLas" ## f_Lassitude "0.3*cor_InsLas" "1" And again, we transform the matrix using as.mxMatrix(). S <- as.mxMatrix(S) $~$ $$\mathbf{F}$$ Matrix The $$\mathbf{F}$$ matrix is quite easily specified: in the diagonal elements of observed variables, we fill in 1. Elsewhere we specify 0. In the $$\mathbf{F}$$ matrix, we only select the rows in which at least on element is not zero. # Construct diagonal matrix F <- Diag(c(1, 1, 1, 1, 1, 0, 0)) # Only select non-null rows F <- F[1:5,] # Specify row and column labels dimnames(F)[[1]] <- dims[1:5] dimnames(F)[[2]] <- dims F ## Quality Latency Efficiency DTDysf HypSomnia f_Insomnia ## Quality 1 0 0 0 0 0 ## Latency 0 1 0 0 0 0 ## Efficiency 0 0 1 0 0 0 ## DTDysf 0 0 0 1 0 0 ## HypSomnia 0 0 0 0 1 0 ## f_Lassitude ## Quality 0 ## Latency 0 ## Efficiency 0 ## DTDysf 0 ## HypSomnia 0 F <- as.mxMatrix(F) $~$ Model fitting Now, it is time to fit our proposed model to the data. To do this, we use the tssem2() function. We only have to provide the stage 1 model cfa1, the three matrices, and specify diag.constraints=FALSE, because we are not fitting a mediation model. I save the resulting object as cfa2 and then access it using summary(). cfa2 <- tssem2(cfa1, Amatrix = A, Smatrix = S, Fmatrix = F, diag.constraints = FALSE) summary(cfa2) ## ## Call: ## wls(Cov = pooledS, aCov = aCov, n = tssem1.obj$total.n, Amatrix = Amatrix,
##     Smatrix = Smatrix, Fmatrix = Fmatrix, diag.constraints = diag.constraints,
##     cor.analysis = cor.analysis, intervals.type = intervals.type,
##     mx.algebras = mx.algebras, model.name = model.name, suppressWarnings = suppressWarnings,
##     silent = silent, run = run)
##
## 95% confidence intervals: z statistic approximation
## Coefficients:
##            Estimate Std.Error   lbound   ubound z value
## Las_D      0.679955  0.075723 0.531541 0.828370  8.9795
## Ins_E      0.760455  0.061964 0.639009 0.881901 12.2726
## Las_H      0.641842  0.072459 0.499825 0.783859  8.8580
## Ins_L      0.590630  0.052649 0.487439 0.693821 11.2182
## Ins_Q      0.569435  0.052425 0.466684 0.672187 10.8619
## cor_InsLas 0.377691  0.047402 0.284785 0.470596  7.9679
##                         Pr(>|z|)
## Las_D      < 0.00000000000000022 ***
## Ins_E      < 0.00000000000000022 ***
## Las_H      < 0.00000000000000022 ***
## Ins_L      < 0.00000000000000022 ***
## Ins_Q      < 0.00000000000000022 ***
## cor_InsLas  0.000000000000001554 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Goodness-of-fit indices:
##                                                Value
## Sample size                                4496.0000
## Chi-square of target model                    7.8204
## DF of target model                            4.0000
## p value of target model                       0.0984
## Number of constraints imposed on "Smatrix"    0.0000
## Chi-square of independence model            501.6766
## DF of independence model                     10.0000
## RMSEA                                         0.0146
## RMSEA lower 95% CI                            0.0000
## RMSEA upper 95% CI                            0.0297
## SRMR                                          0.0436
## TLI                                           0.9806
## CFI                                           0.9922
## AIC                                          -0.1796
## BIC                                         -25.8234
## OpenMx status1: 0 ("0" or "1": The optimization is considered fine.
## Other values indicate problems.)

We again see that the OpenMx status1 is 0, meaning that the optimization worked fine. We get the estimates for the different paths between the latent variables and the observed symptoms, such as $$0.68$$ for $$Lassitude \rightarrow Daytime Dysfunction$$ (Las_D). The important part, however, it to check how well the assumed model fits our data. We can have a look at the second, third and fourth row of the Goodness-of-fit indices, where we see that the goodness of fit test is $$\chi^{2}_{4,4496} = 7.82$$, which is not significant $$p=0.098$$. On the other hand, we see that the Root Mean Square Error of Approximation (RMSEA; Steiger and Lind (1980)) value is $$RMSEA=0.0146$$. As a rule of thumb, a model can be considered to fit the data well when the $$RMSEA$$ is close to $$0.05$$ and smaller than $$0.10$$. The indices for this model are therefore somewhat conflicting, but generally indicate that the model may not fit all of the data closely. A potential way to explore this would be to conduct a subgroup analysis, in which the model is fitted to subgroups of studies which share some kind of characteristic (e.g. age group, country of origin, etc.) to see if there might be subclusters in which the model fits better than others.

Please be aware that a common problem in SEM studies is that researchers often only focus on their one proposed model, and if it fits the data well. If it is found that the model has a close fit for the data, it is often directly assumed that the data prove the underlying structure or theory. This is problematic, because for the same data, more than one model can achieve a good fit at the same time. It is therefore necessary to also check for alternative model hypotheses and structures. If the alternative model fits the data well too, it becomes less clear if our proposed structure really is the “correct” one.

#### 14.3.2.3 Plotting the Model

After we fit the model, metaSEM makes it quite easy for us to visualize it graphically. To do this, we first have to install and load the semPlot package (Epskamp 2019).

library(semPlot)

To plot the model, we have to convert it into a format that semPlot can use. We can use the meta2semPlot() function to do this.

cfa.plot <- meta2semPlot(cfa2)

We can then use the semPaths() function in semPlot to generate the plot. This function has many parameters, which you can access by typing ?semPaths into the Console, and then hitting Enter. Here is how my code looks like, and the resulting plot:

# Create Plot labels (left to right, bottom to top)
labels <- c("Sleep\nQuality","Sleep\nLatency","Sleep\nEfficiency","Daytime\nDysfunction",
"Hyper-\nsomnia","Insomnia", "Lassitude")

# Plot
semPaths(cfa.plot,
whatLabels = "est",
edge.color = "black",
nodeLabels = labels)

### References

Thompson, Bruce. 2004. Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association.

Koffel, Erin, and David Watson. 2009. “The Two-Factor Structure of Sleep Complaints and Its Relation to Depression and Anxiety.” Journal of Abnormal Psychology 118 (1). American Psychological Association: 183.

Steiger, James H, and John C Lind. 1980. “Statistically Based Tests for the Number of Common Factors, Paper Presented at the Annual Meeting of the Psychometric Society.” Iowa City, IA.

Epskamp, Sacha. 2019. SemPlot: Path Diagrams and Visual Analysis of Various Sem Packages’ Output. https://CRAN.R-project.org/package=semPlot.