Factor Analysis

Factor analysis (FA) assumes that there are a number of underlying (latent) factors affecting the observed scores on items/tests. In other words, the traits underlying a test might be multidimensional.

When a new measure is established, researchers often make assumptions about the traits that each item/subtest would measure. If the items are indeed loading on different factors in the same way as proposed, the test is said to have high factorial validity.

Factor analysis can be divided into two types:

Exploratory factor analysis (EFA): method to explore the underlying structure of a set of observed variables, and is a crucial step in the scale development process. For example, how many factors are needed to sufficiently explain the observed score variance? Which is the relationship between each item and factor?
Confirmatory factor analysis (CFA): method to verify a hypothesized factor structure. In CFA, theory specifies which factor(s) an item should load on. Other entries are restricted to 0. You want to see if the hypothesized model fits the data well.

7.5 Exploratory Factor Analysis (EFA)

If 2 items/tests have high observed-score correlations, they might be related to (or loading on) some common factor. To get a general idea about the correlation between items/subtests, we can use the cor function in R to obtain a correlation matrix.

For example, the scores.txt file provides (fictitious) scores of 100 subjects on 8 tests.

scores <- read.table("https://raw.githubusercontent.com/sunbeomk/PSYC490/main/scores.txt")
head(scores)

##   test.1 test.2 test.3 test.4
## 1      5      4      3      5
## 2      5      6      4      5
## 3      4      3      3      4
## 4      4      4      4      3
## 5      4      5      2      4
## 6      5      5      5      5
##   test.5 test.6 test.7 test.8
## 1      4      4      3      3
## 2      4      5      4      4
## 3      4      3      3      4
## 4      5      4      4      5
## 5      3      4      2      3
## 6      6      5      5      6

We can obtain the score correlation between tests by using

cor_subtests <- cor(scores)

round(cor_subtests, 2)

##        test.1 test.2 test.3
## test.1   1.00   0.74   0.01
## test.2   0.74   1.00  -0.02
## test.3   0.01  -0.02   1.00
## test.4   0.79   0.74   0.00
## test.5  -0.03  -0.03   0.83
## test.6   0.79   0.82  -0.07
## test.7   0.01  -0.03   0.80
## test.8   0.04  -0.01   0.81
##        test.4 test.5 test.6
## test.1   0.79  -0.03   0.79
## test.2   0.74  -0.03   0.82
## test.3   0.00   0.83  -0.07
## test.4   1.00   0.00   0.75
## test.5   0.00   1.00  -0.08
## test.6   0.75  -0.08   1.00
## test.7   0.04   0.82  -0.06
## test.8   0.00   0.77  -0.01
##        test.7 test.8
## test.1   0.01   0.04
## test.2  -0.03  -0.01
## test.3   0.80   0.81
## test.4   0.04   0.00
## test.5   0.82   0.77
## test.6  -0.06  -0.01
## test.7   1.00   0.80
## test.8   0.80   1.00

The correlation matrix suggests that:

tests $1, 2, 4, 6$ are highly correlated, and
tests $3, 5, 7, 8$ are also highly correlated.
The test correlations across these two groups are almost all zero.

This suggests that there may be $2$ underlying factors, tests $1$ , $2$ , $4$ , $6$ may load on one, and tests $3$ , $5$ , $7$ , $8$ may load on the other.

7.5.1 `factanal()` function

To conduct FA, we can use the built-in function in R: factanal()

Let’s first try doing EFA with 1 factor using the scores data. Here you assume there is only one underlying factor.

efa_1 <- factanal(x = scores, factors = 1)
class(efa_1)

## [1] "factanal"

efa_1

## 
## Call:
## factanal(x = scores, factors = 1)
## 
## Uniquenesses:
## test.1 test.2 test.3 test.4 
##  1.000  0.999  0.174  1.000 
## test.5 test.6 test.7 test.8 
##  0.188  0.995  0.201  0.226 
## 
## Loadings:
##        Factor1
## test.1        
## test.2        
## test.3  0.909 
## test.4        
## test.5  0.901 
## test.6        
## test.7  0.894 
## test.8  0.880 
## 
##                Factor1
## SS loadings      3.216
## Proportion Var   0.402
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 327.47 on 20 degrees of freedom.
## The p-value is 1.92e-57

loadings <- efa_1$loadings # Factor loadings
loadings[1]

## [1] 0.003100294

as.numeric(loadings)

## [1]  0.003100294 -0.028298495
## [3]  0.908886177  0.006756035
## [5]  0.901239123 -0.067765586
## [7]  0.893647841  0.879585184

sum(loadings^2) # SS Loadings

## [1] 3.216031

sum(loadings^2) / ncol(scores) # Proportion Var

## [1] 0.4020039

There are a few things we can see from the outputs:

Each indicator (test)’s uniqueness: % of variance unexplained by the general factor.
Loadings: You can see that some loadings are blank. In fact, loadings smaller than .1 in absolute value are omitted (so that a sparse structure is displayed more clearly). Tests 3,5,7,8 are the only ones with nontrivial loadings on the single factor.
Looking at Proportion Var: About 40.2% variance in the data is explained by this single factor model.

To choose the number of factors, what you can do is to gradually increase the number of factors and observe the changes in the proportion of variance explained.

Now, rerun the factanal(), but this time, assuming there are 2 underlying factors:

factanal(x = scores, 
         factors = 2)

## 
## Call:
## factanal(x = scores, factors = 2)
## 
## Uniquenesses:
## test.1 test.2 test.3 test.4 
##  0.230  0.227  0.174  0.279 
## test.5 test.6 test.7 test.8 
##  0.187  0.166  0.201  0.226 
## 
## Loadings:
##        Factor1 Factor2
## test.1          0.877 
## test.2          0.879 
## test.3  0.909         
## test.4          0.849 
## test.5  0.901         
## test.6          0.910 
## test.7  0.894         
## test.8  0.879         
## 
##                Factor1
## SS loadings      3.217
## Proportion Var   0.402
## Cumulative Var   0.402
##                Factor2
## SS loadings      3.092
## Proportion Var   0.387
## Cumulative Var   0.789
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 15.18 on 13 degrees of freedom.
## The p-value is 0.296

With 2 factors, the proportion of explained variance (Cumulative var) increased to $78.9\%$ .

7.5.2 EFA with Covariance Matrix

If the complete response data is not available, and we only have the covariance or correlation matrix containing the covariance/correlations between the indicators, we can still conduct factor analysis with this as the input.

The cov() function creates a covariance matrix of a data frame.

cov_mat <- cov(scores)
round(cov_mat, 2)

##        test.1 test.2 test.3
## test.1   0.90   0.70   0.01
## test.2   0.70   0.97  -0.01
## test.3   0.01  -0.01   0.82
## test.4   0.68   0.67   0.00
## test.5  -0.02  -0.03   0.71
## test.6   0.66   0.71  -0.06
## test.7   0.01  -0.02   0.63
## test.8   0.04  -0.01   0.68
##        test.4 test.5 test.6
## test.1   0.68  -0.02   0.66
## test.2   0.67  -0.03   0.71
## test.3   0.00   0.71  -0.06
## test.4   0.83   0.00   0.60
## test.5   0.00   0.90  -0.07
## test.6   0.60  -0.07   0.77
## test.7   0.03   0.67  -0.05
## test.8   0.00   0.68  -0.01
##        test.7 test.8
## test.1   0.01   0.04
## test.2  -0.02  -0.01
## test.3   0.63   0.68
## test.4   0.03   0.00
## test.5   0.67   0.68
## test.6  -0.05  -0.01
## test.7   0.76   0.64
## test.8   0.64   0.86

We can use the covariance matrix as our input for the factanal() function. However, this time, we need to specify two things.

First, the first argument should be covmat = to notify that the input data is a covariance matrix.
Secondly, we need to specify a third argument, n.obs = which is the sample size you used for computing the covariance matrix.

factanal(covmat = cov_mat, 
         factors = 2, 
         n.obs = nrow(scores))

## 
## Call:
## factanal(factors = 2, covmat = cov_mat, n.obs = nrow(scores))
## 
## Uniquenesses:
## test.1 test.2 test.3 test.4 
##  0.230  0.227  0.174  0.279 
## test.5 test.6 test.7 test.8 
##  0.187  0.166  0.201  0.226 
## 
## Loadings:
##        Factor1 Factor2
## test.1          0.877 
## test.2          0.879 
## test.3  0.909         
## test.4          0.849 
## test.5  0.901         
## test.6          0.910 
## test.7  0.894         
## test.8  0.879         
## 
##                Factor1
## SS loadings      3.217
## Proportion Var   0.402
## Cumulative Var   0.402
##                Factor2
## SS loadings      3.092
## Proportion Var   0.387
## Cumulative Var   0.789
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 15.18 on 13 degrees of freedom.
## The p-value is 0.296

7.5.3 Factor scores

In addition to the loading structure, you may also want to know the factor scores of each individual. For example, what is examinee 1’s score on factor 1? factor 2?

This can be obtained from the factanal output if you additionally specify scores = 'regression'.

output <- factanal(x = scores, 
                   factors = 2, 
                   scores = "regression")
head(output$scores)

##       Factor1     Factor2
## 1 -0.88097000  0.03016465
## 2 -0.04549636  0.98800767
## 3 -0.62996212 -1.09674458
## 4  0.49201233 -0.61045599
## 5 -1.80701705 -0.21758291
## 6  1.63726192  0.79152799

The individual indicator/subtest scores would be the weighted sum of the factor scores, where the weights are the determined by factor loadings.

7.5.4 Rotational Indeterminacy

There exist infinitely many possible solutions to the EFA. This is called rotational indeterminacy. We can rotate the factors, so that the loadings will be as close as possible to a desired structure.

There are two types of rotation methods:

Orthogonal: The underlying factors after rotation will be uncorrelated.
Oblique: The underlying factors after rotation can be correlated.

The default rotation is "varimax" in factanal() function which is one of the orthogonal rotation methods. Let’s check the correlation between the factor scores obtained from an orthogonal rotation.

varimax <- factanal(scores, 
                    factors = 2, 
                    rotation="varimax", 
                    scores="regression")
cor(varimax$scores)

##              Factor1
## Factor1  1.000000000
## Factor2 -0.001527159
##              Factor2
## Factor1 -0.001527159
## Factor2  1.000000000

One of the oblique rotation methods is "promax" rotation. The factors are allowed to be correlated with each other.

promax <- factanal(scores, 
                   factors = 2, 
                   rotation = "promax", 
                   scores = "regression")
cor(promax$scores)

##            Factor1    Factor2
## Factor1 1.00000000 0.01563374
## Factor2 0.01563374 1.00000000

7.5.5 `fa()` function in `psych` package

The psych package provides another useful function for factor analysis. You can fit a EFA model with fa() function. For example, the following code chunk will fit a 3-factor model with "varimax" rotation.

library(psych)
output2 <- fa(scores, # input data
              nfactors = 2, # number of factors
              rotate = "varimax", # rotation
              scores = "regression") # factor score estimation

output2$loadings # factor loadings

## 
## Loadings:
##        MR1    MR2   
## test.1         0.882
## test.2         0.872
## test.3  0.909       
## test.4         0.854
## test.5  0.900       
## test.6         0.907
## test.7  0.896       
## test.8  0.879       
## 
##                  MR1   MR2
## SS loadings    3.217 3.093
## Proportion Var 0.402 0.387
## Cumulative Var 0.402 0.789

output2$uniquenesses # uniqueness

##    test.1    test.2    test.3 
## 0.2214045 0.2375525 0.1730854 
##    test.4    test.5    test.6 
## 0.2706205 0.1907075 0.1720624 
##    test.7    test.8 
## 0.1974814 0.2267084

output2$communality # communality

##    test.1    test.2    test.3 
## 0.7785955 0.7624475 0.8269146 
##    test.4    test.5    test.6 
## 0.7293795 0.8092925 0.8279376 
##    test.7    test.8 
## 0.8025186 0.7732916

output2$uniquenesses + output2$communalities

##    test.1    test.2    test.3 
## 1.0000000 1.0000000 1.0000000 
##    test.4    test.5    test.6 
## 0.9999999 1.0000000 0.9999999 
##    test.7    test.8 
## 1.0000000 1.0000000

7.6 Confirmatory Factor Analysis (CFA)

We can use CFA to test the hypothesis about the structure of the latent factor.

We will use the lavaan package in R to fit a CFA model with the scores dataset. lavaan is a package for fitting a variety of latent variable models such as confirmatory factor analysis and structural equation modeling. Let’s install the lavaan package.

install.pcakges("lavaan")

library(lavaan)

Before fitting the model, change the variable names in scores data.

names(scores)

## [1] "test.1" "test.2" "test.3"
## [4] "test.4" "test.5" "test.6"
## [7] "test.7" "test.8"

names(scores) <- paste0("x", 1:8)
names(scores)

## [1] "x1" "x2" "x3" "x4" "x5"
## [6] "x6" "x7" "x8"

Fitting a CFA model in lavaan requires a special model syntax. Let’s say we fit a 2-factor CFA model where the first factor f1 is measured by x1, x2, x4, and x6. The second factor f2 is measured by x3, x5, x7, and x8.

model <- '
f1 =~ x1 + x2 + x4 + x6
f2 =~ x3 + x5 + x7 + x8
'
fit <- cfa(model = model, data = scores)
class(fit)

## [1] "lavaan"
## attr(,"package")
## [1] "lavaan"

In the model, =~ indicates that the latent factor on the left side is loaded on the observed indicator listed on the right side. One can also specify the covariance between two variables (observed/latent) using ~~. For identifiabilty, the first factor loadings of each factor is fixed at $1$ by default.

result <- summary(fit)
class(result)

## [1] "lavaan.summary"
## [2] "list"

print(result)

## lavaan 0.6.16 ended normally after 31 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        17
## 
##   Number of observations                           100
## 
## Model Test User Model:
##                                                       
##   Test statistic                                18.461
##   Degrees of freedom                                19
##   P-value (Chi-square)                           0.492
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate
##   f1 =~                    
##     x1                1.000
##     x2                1.044
##     x4                0.929
##     x6                0.959
##   f2 =~                    
##     x3                1.000
##     x5                1.039
##     x7                0.947
##     x8                0.993
##   Std.Err  z-value  P(>|z|)
##                            
##                            
##     0.086   12.178    0.000
##     0.082   11.372    0.000
##     0.074   13.006    0.000
##                            
##                            
##     0.074   14.113    0.000
##     0.068   13.833    0.000
##     0.074   13.334    0.000
## 
## Covariances:
##                    Estimate
##   f1 ~~                    
##     f2               -0.019
##   Std.Err  z-value  P(>|z|)
##                            
##     0.072   -0.261    0.794
## 
## Variances:
##                    Estimate
##    .x1                0.208
##    .x2                0.218
##    .x4                0.230
##    .x6                0.129
##    .x3                0.141
##    .x5                0.167
##    .x7                0.151
##    .x8                0.192
##     f1                0.686
##     f2                0.669
##   Std.Err  z-value  P(>|z|)
##     0.039    5.336    0.000
##     0.041    5.269    0.000
##     0.040    5.733    0.000
##     0.029    4.503    0.000
##     0.029    4.928    0.000
##     0.033    5.124    0.000
##     0.029    5.288    0.000
##     0.035    5.537    0.000
##     0.125    5.464    0.000
##     0.115    5.835    0.000

result$pe

##    lhs op rhs exo         est
## 1   f1 =~  x1   0  1.00000000
## 2   f1 =~  x2   0  1.04383789
## 3   f1 =~  x4   0  0.92940237
## 4   f1 =~  x6   0  0.95915104
## 5   f2 =~  x3   0  1.00000000
## 6   f2 =~  x5   0  1.03850718
## 7   f2 =~  x7   0  0.94689409
## 8   f2 =~  x8   0  0.99291302
## 9   x1 ~~  x1   0  0.20750205
## 10  x2 ~~  x2   0  0.21807407
## 11  x4 ~~  x4   0  0.22968811
## 12  x6 ~~  x6   0  0.12926998
## 13  x3 ~~  x3   0  0.14102341
## 14  x5 ~~  x5   0  0.16717930
## 15  x7 ~~  x7   0  0.15143132
## 16  x8 ~~  x8   0  0.19247750
## 17  f1 ~~  f1   0  0.68559785
## 18  f2 ~~  f2   0  0.66937669
## 19  f1 ~~  f2   0 -0.01886384
##            se          z
## 1  0.00000000         NA
## 2  0.08571524 12.1779725
## 3  0.08172641 11.3721184
## 4  0.07374465 13.0063821
## 5  0.00000000         NA
## 6  0.07358280 14.1134511
## 7  0.06845160 13.8330457
## 8  0.07446644 13.3336985
## 9  0.03888987  5.3356330
## 10 0.04139139  5.2685852
## 11 0.04006638  5.7326890
## 12 0.02870499  4.5033979
## 13 0.02861807  4.9277744
## 14 0.03262978  5.1235183
## 15 0.02863529  5.2882767
## 16 0.03476103  5.5371636
## 17 0.12546986  5.4642433
## 18 0.11471032  5.8353661
## 19 0.07222424 -0.2611844
##          pvalue
## 1            NA
## 2  0.000000e+00
## 3  0.000000e+00
## 4  0.000000e+00
## 5            NA
## 6  0.000000e+00
## 7  0.000000e+00
## 8  0.000000e+00
## 9  9.521182e-08
## 10 1.374792e-07
## 11 9.885080e-09
## 12 6.687552e-06
## 13 8.317158e-07
## 14 2.998865e-07
## 15 1.234741e-07
## 16 3.074093e-08
## 17 4.648857e-08
## 18 5.367258e-09
## 19 7.939503e-01

7.7 Your turn

Run a 3-factor EFA model using the factanal() function with the scores data set. Use the default rotation "varimax". What is the proportion of explained variance?
Find the test with the largest uniqueness.