Factor Analysis
Factor analysis (FA) assumes that there are a number of underlying (latent) factors affecting the observed scores on items/tests. In other words, the traits underlying a test might be multidimensional.
When a new measure is established, researchers often make assumptions about the traits that each item/subtest would measure. If the items are indeed loading on different factors in the same way as proposed, the test is said to have high factorial validity.
Factor analysis can be divided into two types:
Exploratory factor analysis (EFA): method to explore the underlying structure of a set of observed variables, and is a crucial step in the scale development process. For example, how many factors are needed to sufficiently explain the observed score variance? Which is the relationship between each item and factor?
Confirmatory factor analysis (CFA): method to verify a hypothesized factor structure. In CFA, theory specifies which factor(s) an item should load on. Other entries are restricted to 0. You want to see if the hypothesized model fits the data well.
7.5 Exploratory Factor Analysis (EFA)
If 2 items/tests have high observed-score correlations, they might be related to (or loading on) some common factor. To get a general idea about the correlation between items/subtests, we can use the cor
function in R to obtain a correlation matrix.
For example, the scores.txt
file provides (fictitious) scores of 100 subjects on 8 tests.
scores <- read.table("https://raw.githubusercontent.com/sunbeomk/PSYC490/main/scores.txt")
head(scores)
## test.1 test.2 test.3 test.4
## 1 5 4 3 5
## 2 5 6 4 5
## 3 4 3 3 4
## 4 4 4 4 3
## 5 4 5 2 4
## 6 5 5 5 5
## test.5 test.6 test.7 test.8
## 1 4 4 3 3
## 2 4 5 4 4
## 3 4 3 3 4
## 4 5 4 4 5
## 5 3 4 2 3
## 6 6 5 5 6
We can obtain the score correlation between tests by using
## test.1 test.2 test.3
## test.1 1.00 0.74 0.01
## test.2 0.74 1.00 -0.02
## test.3 0.01 -0.02 1.00
## test.4 0.79 0.74 0.00
## test.5 -0.03 -0.03 0.83
## test.6 0.79 0.82 -0.07
## test.7 0.01 -0.03 0.80
## test.8 0.04 -0.01 0.81
## test.4 test.5 test.6
## test.1 0.79 -0.03 0.79
## test.2 0.74 -0.03 0.82
## test.3 0.00 0.83 -0.07
## test.4 1.00 0.00 0.75
## test.5 0.00 1.00 -0.08
## test.6 0.75 -0.08 1.00
## test.7 0.04 0.82 -0.06
## test.8 0.00 0.77 -0.01
## test.7 test.8
## test.1 0.01 0.04
## test.2 -0.03 -0.01
## test.3 0.80 0.81
## test.4 0.04 0.00
## test.5 0.82 0.77
## test.6 -0.06 -0.01
## test.7 1.00 0.80
## test.8 0.80 1.00
The correlation matrix suggests that:
- tests \(1, 2, 4, 6\) are highly correlated, and
- tests \(3, 5, 7, 8\) are also highly correlated.
- The test correlations across these two groups are almost all zero.
This suggests that there may be \(2\) underlying factors, tests \(1\), \(2\), \(4\), \(6\) may load on one, and tests \(3\), \(5\), \(7\), \(8\) may load on the other.
7.5.1 factanal()
function
To conduct FA, we can use the built-in function in R: factanal()
Let’s first try doing EFA with 1 factor using the scores
data. Here you assume there is only one underlying factor.
## [1] "factanal"
##
## Call:
## factanal(x = scores, factors = 1)
##
## Uniquenesses:
## test.1 test.2 test.3 test.4
## 1.000 0.999 0.174 1.000
## test.5 test.6 test.7 test.8
## 0.188 0.995 0.201 0.226
##
## Loadings:
## Factor1
## test.1
## test.2
## test.3 0.909
## test.4
## test.5 0.901
## test.6
## test.7 0.894
## test.8 0.880
##
## Factor1
## SS loadings 3.216
## Proportion Var 0.402
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 327.47 on 20 degrees of freedom.
## The p-value is 1.92e-57
## [1] 0.003100294
## [1] 0.003100294 -0.028298495
## [3] 0.908886177 0.006756035
## [5] 0.901239123 -0.067765586
## [7] 0.893647841 0.879585184
## [1] 3.216031
## [1] 0.4020039
There are a few things we can see from the outputs:
- Each indicator (test)’s uniqueness: % of variance unexplained by the general factor.
- Loadings: You can see that some loadings are blank. In fact, loadings smaller than .1 in absolute value are omitted (so that a sparse structure is displayed more clearly). Tests 3,5,7,8 are the only ones with nontrivial loadings on the single factor.
- Looking at
Proportion Var
: About 40.2% variance in the data is explained by this single factor model.
To choose the number of factors, what you can do is to gradually increase the number of factors and observe the changes in the proportion of variance explained.
Now, rerun the factanal()
, but this time, assuming there are 2 underlying factors:
##
## Call:
## factanal(x = scores, factors = 2)
##
## Uniquenesses:
## test.1 test.2 test.3 test.4
## 0.230 0.227 0.174 0.279
## test.5 test.6 test.7 test.8
## 0.187 0.166 0.201 0.226
##
## Loadings:
## Factor1 Factor2
## test.1 0.877
## test.2 0.879
## test.3 0.909
## test.4 0.849
## test.5 0.901
## test.6 0.910
## test.7 0.894
## test.8 0.879
##
## Factor1
## SS loadings 3.217
## Proportion Var 0.402
## Cumulative Var 0.402
## Factor2
## SS loadings 3.092
## Proportion Var 0.387
## Cumulative Var 0.789
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 15.18 on 13 degrees of freedom.
## The p-value is 0.296
With 2 factors, the proportion of explained variance (Cumulative var
) increased to \(78.9\%\).
7.5.2 EFA with Covariance Matrix
If the complete response data is not available, and we only have the covariance or correlation matrix containing the covariance/correlations between the indicators, we can still conduct factor analysis with this as the input.
The cov()
function creates a covariance matrix of a data frame.
## test.1 test.2 test.3
## test.1 0.90 0.70 0.01
## test.2 0.70 0.97 -0.01
## test.3 0.01 -0.01 0.82
## test.4 0.68 0.67 0.00
## test.5 -0.02 -0.03 0.71
## test.6 0.66 0.71 -0.06
## test.7 0.01 -0.02 0.63
## test.8 0.04 -0.01 0.68
## test.4 test.5 test.6
## test.1 0.68 -0.02 0.66
## test.2 0.67 -0.03 0.71
## test.3 0.00 0.71 -0.06
## test.4 0.83 0.00 0.60
## test.5 0.00 0.90 -0.07
## test.6 0.60 -0.07 0.77
## test.7 0.03 0.67 -0.05
## test.8 0.00 0.68 -0.01
## test.7 test.8
## test.1 0.01 0.04
## test.2 -0.02 -0.01
## test.3 0.63 0.68
## test.4 0.03 0.00
## test.5 0.67 0.68
## test.6 -0.05 -0.01
## test.7 0.76 0.64
## test.8 0.64 0.86
We can use the covariance matrix as our input for the factanal()
function. However, this time, we need to specify two things.
First, the first argument should be
covmat =
to notify that the input data is a covariance matrix.Secondly, we need to specify a third argument,
n.obs =
which is the sample size you used for computing the covariance matrix.
##
## Call:
## factanal(factors = 2, covmat = cov_mat, n.obs = nrow(scores))
##
## Uniquenesses:
## test.1 test.2 test.3 test.4
## 0.230 0.227 0.174 0.279
## test.5 test.6 test.7 test.8
## 0.187 0.166 0.201 0.226
##
## Loadings:
## Factor1 Factor2
## test.1 0.877
## test.2 0.879
## test.3 0.909
## test.4 0.849
## test.5 0.901
## test.6 0.910
## test.7 0.894
## test.8 0.879
##
## Factor1
## SS loadings 3.217
## Proportion Var 0.402
## Cumulative Var 0.402
## Factor2
## SS loadings 3.092
## Proportion Var 0.387
## Cumulative Var 0.789
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 15.18 on 13 degrees of freedom.
## The p-value is 0.296
7.5.3 Factor scores
In addition to the loading structure, you may also want to know the factor scores of each individual. For example, what is examinee 1’s score on factor 1? factor 2?
This can be obtained from the factanal
output if you additionally specify scores = 'regression'
.
## Factor1 Factor2
## 1 -0.88097000 0.03016465
## 2 -0.04549636 0.98800767
## 3 -0.62996212 -1.09674458
## 4 0.49201233 -0.61045599
## 5 -1.80701705 -0.21758291
## 6 1.63726192 0.79152799
The individual indicator/subtest scores would be the weighted sum of the factor scores, where the weights are the determined by factor loadings.
7.5.4 Rotational Indeterminacy
There exist infinitely many possible solutions to the EFA. This is called rotational indeterminacy. We can rotate the factors, so that the loadings will be as close as possible to a desired structure.
There are two types of rotation methods:
Orthogonal: The underlying factors after rotation will be uncorrelated.
Oblique: The underlying factors after rotation can be correlated.
The default rotation is "varimax"
in factanal()
function which is one of the orthogonal rotation methods. Let’s check the correlation between the factor scores obtained from an orthogonal rotation.
varimax <- factanal(scores,
factors = 2,
rotation="varimax",
scores="regression")
cor(varimax$scores)
## Factor1
## Factor1 1.000000000
## Factor2 -0.001527159
## Factor2
## Factor1 -0.001527159
## Factor2 1.000000000
One of the oblique rotation methods is "promax"
rotation. The factors are allowed to be correlated with each other.
promax <- factanal(scores,
factors = 2,
rotation = "promax",
scores = "regression")
cor(promax$scores)
## Factor1 Factor2
## Factor1 1.00000000 0.01563374
## Factor2 0.01563374 1.00000000
7.5.5 fa()
function in psych
package
The psych
package provides another useful function for factor analysis. You can fit a EFA model with fa()
function. For example, the following code chunk will fit a 3-factor model with "varimax"
rotation.
library(psych)
output2 <- fa(scores, # input data
nfactors = 2, # number of factors
rotate = "varimax", # rotation
scores = "regression") # factor score estimation
output2$loadings # factor loadings
##
## Loadings:
## MR1 MR2
## test.1 0.882
## test.2 0.872
## test.3 0.909
## test.4 0.854
## test.5 0.900
## test.6 0.907
## test.7 0.896
## test.8 0.879
##
## MR1 MR2
## SS loadings 3.217 3.093
## Proportion Var 0.402 0.387
## Cumulative Var 0.402 0.789
## test.1 test.2 test.3
## 0.2214045 0.2375525 0.1730854
## test.4 test.5 test.6
## 0.2706205 0.1907075 0.1720624
## test.7 test.8
## 0.1974814 0.2267084
## test.1 test.2 test.3
## 0.7785955 0.7624475 0.8269146
## test.4 test.5 test.6
## 0.7293795 0.8092925 0.8279376
## test.7 test.8
## 0.8025186 0.7732916
## test.1 test.2 test.3
## 1.0000000 1.0000000 1.0000000
## test.4 test.5 test.6
## 0.9999999 1.0000000 0.9999999
## test.7 test.8
## 1.0000000 1.0000000
7.6 Confirmatory Factor Analysis (CFA)
We can use CFA to test the hypothesis about the structure of the latent factor.
We will use the lavaan
package in R to fit a CFA model with the scores
dataset. lavaan
is a package for fitting a variety of latent variable models such as confirmatory factor analysis and structural equation modeling. Let’s install the lavaan
package.
Before fitting the model, change the variable names in scores
data.
## [1] "test.1" "test.2" "test.3"
## [4] "test.4" "test.5" "test.6"
## [7] "test.7" "test.8"
## [1] "x1" "x2" "x3" "x4" "x5"
## [6] "x6" "x7" "x8"
Fitting a CFA model in lavaan
requires a special model syntax. Let’s say we fit a 2-factor CFA model where the first factor f1
is measured by x1
, x2
, x4
, and x6
. The second factor f2
is measured by x3
, x5
, x7
, and x8
.
model <- '
f1 =~ x1 + x2 + x4 + x6
f2 =~ x3 + x5 + x7 + x8
'
fit <- cfa(model = model, data = scores)
class(fit)
## [1] "lavaan"
## attr(,"package")
## [1] "lavaan"
In the model
, =~
indicates that the latent factor on the left side is loaded on the observed indicator listed on the right side. One can also specify the covariance between two variables (observed/latent) using ~~
. For identifiabilty, the first factor loadings of each factor is fixed at \(1\) by default.
## [1] "lavaan.summary"
## [2] "list"
## lavaan 0.6.16 ended normally after 31 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 17
##
## Number of observations 100
##
## Model Test User Model:
##
## Test statistic 18.461
## Degrees of freedom 19
## P-value (Chi-square) 0.492
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate
## f1 =~
## x1 1.000
## x2 1.044
## x4 0.929
## x6 0.959
## f2 =~
## x3 1.000
## x5 1.039
## x7 0.947
## x8 0.993
## Std.Err z-value P(>|z|)
##
##
## 0.086 12.178 0.000
## 0.082 11.372 0.000
## 0.074 13.006 0.000
##
##
## 0.074 14.113 0.000
## 0.068 13.833 0.000
## 0.074 13.334 0.000
##
## Covariances:
## Estimate
## f1 ~~
## f2 -0.019
## Std.Err z-value P(>|z|)
##
## 0.072 -0.261 0.794
##
## Variances:
## Estimate
## .x1 0.208
## .x2 0.218
## .x4 0.230
## .x6 0.129
## .x3 0.141
## .x5 0.167
## .x7 0.151
## .x8 0.192
## f1 0.686
## f2 0.669
## Std.Err z-value P(>|z|)
## 0.039 5.336 0.000
## 0.041 5.269 0.000
## 0.040 5.733 0.000
## 0.029 4.503 0.000
## 0.029 4.928 0.000
## 0.033 5.124 0.000
## 0.029 5.288 0.000
## 0.035 5.537 0.000
## 0.125 5.464 0.000
## 0.115 5.835 0.000
## lhs op rhs exo est
## 1 f1 =~ x1 0 1.00000000
## 2 f1 =~ x2 0 1.04383789
## 3 f1 =~ x4 0 0.92940237
## 4 f1 =~ x6 0 0.95915104
## 5 f2 =~ x3 0 1.00000000
## 6 f2 =~ x5 0 1.03850718
## 7 f2 =~ x7 0 0.94689409
## 8 f2 =~ x8 0 0.99291302
## 9 x1 ~~ x1 0 0.20750205
## 10 x2 ~~ x2 0 0.21807407
## 11 x4 ~~ x4 0 0.22968811
## 12 x6 ~~ x6 0 0.12926998
## 13 x3 ~~ x3 0 0.14102341
## 14 x5 ~~ x5 0 0.16717930
## 15 x7 ~~ x7 0 0.15143132
## 16 x8 ~~ x8 0 0.19247750
## 17 f1 ~~ f1 0 0.68559785
## 18 f2 ~~ f2 0 0.66937669
## 19 f1 ~~ f2 0 -0.01886384
## se z
## 1 0.00000000 NA
## 2 0.08571524 12.1779725
## 3 0.08172641 11.3721184
## 4 0.07374465 13.0063821
## 5 0.00000000 NA
## 6 0.07358280 14.1134511
## 7 0.06845160 13.8330457
## 8 0.07446644 13.3336985
## 9 0.03888987 5.3356330
## 10 0.04139139 5.2685852
## 11 0.04006638 5.7326890
## 12 0.02870499 4.5033979
## 13 0.02861807 4.9277744
## 14 0.03262978 5.1235183
## 15 0.02863529 5.2882767
## 16 0.03476103 5.5371636
## 17 0.12546986 5.4642433
## 18 0.11471032 5.8353661
## 19 0.07222424 -0.2611844
## pvalue
## 1 NA
## 2 0.000000e+00
## 3 0.000000e+00
## 4 0.000000e+00
## 5 NA
## 6 0.000000e+00
## 7 0.000000e+00
## 8 0.000000e+00
## 9 9.521182e-08
## 10 1.374792e-07
## 11 9.885080e-09
## 12 6.687552e-06
## 13 8.317158e-07
## 14 2.998865e-07
## 15 1.234741e-07
## 16 3.074093e-08
## 17 4.648857e-08
## 18 5.367258e-09
## 19 7.939503e-01