Explorative Faktorenanalyse

_Explorative Verfahren

Prof. Dr. Armin Eichinger

TH Deggendorf

21.11.2024

Einleitung

Allgemeines

Explorative statistische Technik
Ziel: Klassifikation von Variablen gemäß ihrer korrelativen Beziehungen in voneinander unabhängige Gruppen
Dazu wird eine künstliche Variable (= Faktor) ermittelt, die mit den anderen Variablen möglichst hoch korreliert

Allgemeines

Wenn der Einfluss des ersten Faktors aus den Korrelationen der Variablen untereinander entfernt wird, bleibt nur mehr Korrelation übrig, die nicht durch den Faktor erklärt werden kann
Restkorrelation wird durch weiteren Faktor zu erklären versucht
Ergebnis: Wechselseitig voneinander unabhängige Faktoren, die die Zusammenhänge zwischen den Variablen erklären
Je höher die Korrelationen der Variablen untereinander, desto weniger Faktoren nötig

Motivation

Datenreduktion ohne großen Informationsverlust
Generierung von Hypothesen zur Struktur der Merkmale
Bestimmung der Dimensionalität komplexer Merkmale
Überprüfung der (Ein-)Dimensionalität

Vorgehen

Variablenauswahl und Korrelationsmatrix
Bestimmung der Zahl der Faktoren (Parallel-Analyse)
Extraktion der Faktoren
Interpretation der Faktoren (meist nach Rotation)
Bestimmung der Faktorwerte

Wichtige Begriffe

Faktorladung
Korrelation Variable - Faktor
Kommunalität
Summe der quadrierten Faktorladungen einer Variablen über alle Faktoren; in welchem Maß wird die Varianz einer Variablen durch die Faktoren erfasst
Eigenwert
Wie viel der Gesamtvarianz aller Variablen erfasst ein bestimmter Faktor
Faktorwerte
Wie stark sind die in einem Faktor zusammengefassten Merkmale bei einem Objekt ausgeprägt: Position des Objekts auf dem Faktor

Umsetzung

Eignung der Korrelationsmatrix

Bartlett-Test: Überprüft die Nullhypothese, dass die Variablen aus einer unkorrelierten Grundgesamtheit stammen
Measure of Sampling Adequacy (MSA) (für Interessierte: Diagonalelement der Anti-Image-Korrelationsmatrix; \(MSA_i = \frac{\sum_{j \neq i} r_{ij}^2}{\sum_{j \neq i} r_{ij}^2 + \sum_{j \neq i} q_{ij}^2}\))
Kaiser-Meyer-Olkin-Kriterium: Für die Matrix im Ganzen
Handreichung zur Interpretation der Werte:

MSA	Bewertung	Deutsche Übersetzung
MSA ≥ 0,9	marvelous	„erstaunlich“
MSA ≥ 0,8	meritorious	„verdienstvoll“
MSA ≥ 0,7	middling	„ziemlich gut“
MSA ≥ 0,6	mediocre	„mittelmäßig“
MSA ≥ 0,5	miserable	„kläglich“
MSA < 0,5	unacceptable	„untragbar“

Anzahl der Faktoren

[naja…] Kaiser-(Guttman-)Kriterium: Nur Faktoren mit Eigenwert größer 1 werden berücksichtigt
[naja…] Scree-Test: Knick
[ja!] Parallel-Analyse: Nur Faktoren, die links/oberhalb vom Schnittpunkt der simulierten Zufalls-Linie liegen

Rotation

Wird durchgeführt, um die Faktoren leichter interpretieren zu können
Varimax: Orthogonale Rotation
Nach dem Varimax-Kriterium werden die Faktoren orthogonal so rotiert, dass die Varianz der quadrierten Ladungen maximiert wird
Es gibt auch nicht-orthogonale Rotationsverfahren, die wir aber nicht besprechen

Beispiel Regionen 🗺️

Daten

Wir haben für 12 Regionen folgende Variablen vorliegen:

Bevölkerungsdichte (Einwohner je km²): ED
Bruttoinlandsprodukt (BIP) je Einwohner (DM): BIP
Anteil der Erwerbstätigen in der Landwirtschaft (Prozent): EL
Wachstumsrate des BIP (in den letzten zehn Jahren): WBIP
Allgemeine Geburtenziffer (Lebendgeborene je 1000 Einwohner): GEB
Wanderungssaldo (Zu- minus Fortwanderungen je 1000 Einw.): WS

Frage: Gibt es eine sparsamere Struktur, die diesen Variablen zugrunde liegt?

regionen_data <- read.csv("./data/regionen.csv", sep=";", dec = ",")
head(regionen_data)

     ED   BIP   EL WBIP  GEB   WS
1 212.4 20116  9.8 53.0  8.4 -0.7
2 623.7 24966  3.4 73.1  6.1  3.4
3  93.1 19324 23.6 47.9 12.3 -1.9
4 236.8 23113  8.7 66.8  8.7  2.0
5 412.0 23076  8.9 46.9  8.0 -3.1
6 566.7 24516  6.1 44.3  8.6 -3.0

# Bartlett-Test
cortest.bartlett(regionen_data)

$chisq
[1] 48.88061

$p.value
[1] 1.831945e-05

$df
[1] 15

# Matrix der Korrelationen
cor.plot(regionen_data)

Analyse: Anzahl der Faktoren

Anzahl der Faktoren: Parallel-Test

# Parallel-Test
fa.parallel(regionen_data, fa="fa")

Parallel analysis suggests that the number of factors =  1  and the number of components =  NA

kmo_result <- KMO(regionen_data)
kmo_result

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = regionen_data)
Overall MSA =  0.69
MSA for each item = 
  ED  BIP   EL WBIP  GEB   WS 
0.80 0.76 0.77 0.48 0.75 0.48

Analyse: EFA (nur Code)

# X? Faktoren (vgl. Parallel-Test), varimax-Rotation
efa_result <- fa(regionen_data, nfactors = 2, rotate = "varimax")

print(efa_result, digits=2, cut=0.3, sort=TRUE)

# Ggf. Faktor-Scores berechnen
factor_scores <- factor.scores(regionen_data,f=efa_result) 
head(factor_scores$scores)

# Diagramm der Faktorenladungen 
fa.diagram(efa_result)

# Weitere Diagramme

# Achsen festlegen
xlim = c(-2, 2)
ylim = c(-1.5, 1.5)

# Variablen im Faktorraum
plot(factor_scores$scores, xlim=xlim,ylim=ylim)
text(factor_scores$scores, labels = c(1:12), cex = 0.9, pos = 1, font = 1, col = "black")

# Achsen festlegen
xlim = c(-1, 1)
ylim = c(-1, 1.5)

# Regionen im Faktorraum (Faktorwerte)
plot(efa_result$loadings, xlim=xlim,ylim=ylim)
text(efa_result$loadings, labels = colnames(regionen_data), cex = 0.9, pos = 1, font = 1, col = "black")

Analyse: EFA (Code + Output)

# X? Faktoren (vgl. Parallel-Test), varimax-Rotation
efa_result <- fa(regionen_data, nfactors = 2, rotate = "varimax")

print(efa_result, digits=2, cut=0.3, sort=TRUE)

Factor Analysis using method =  minres
Call: fa(r = regionen_data, nfactors = 2, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
     item   MR1   MR2   h2     u2 com
ED      1 -0.95       0.92  0.080 1.1
BIP     2 -0.92       0.86  0.142 1.0
EL      3  0.92       0.86  0.139 1.0
GEB     5  0.78       0.70  0.299 1.3
WS      6        0.99 1.01 -0.009 1.1
WBIP    4        0.83 0.69  0.309 1.0

                       MR1  MR2
SS loadings           3.25 1.79
Proportion Var        0.54 0.30
Cumulative Var        0.54 0.84
Proportion Explained  0.64 0.36
Cumulative Proportion 0.64 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 factors are sufficient.

df null model =  15  with the objective function =  5.99 with Chi Square =  48.88
df of  the model are 4  and the objective function was  0.25 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.04 

The harmonic n.obs is  12 with the empirical chi square  0.12  with prob <  1 
The total n.obs was  12  with Likelihood Chi Square =  1.72  with prob <  0.79 

Tucker Lewis Index of factoring reliability =  1.331
RMSEA index =  0  and the 90 % confidence intervals are  0 0.296
BIC =  -8.22
Fit based upon off diagonal values = 1

# Ggf. Faktor-Scores berechnen
factor_scores <- factor.scores(regionen_data,f=efa_result) 
head(factor_scores$scores)

              MR1        MR2
[1,]  0.517827928 -0.6374872
[2,] -1.660317949  1.1297825
[3,]  1.897558319 -1.1397679
[4,]  0.009766179  0.5512555
[5,] -0.327046953 -1.2442510
[6,] -1.021926139 -1.1920421

# Diagramm der Faktorenladungen 
fa.diagram(efa_result)

# Weitere Diagramme

# Achsen festlegen
xlim = c(-2, 2)
ylim = c(-1.5, 1.5)

# Regionen im Faktorraum (Faktorwerte)
plot(factor_scores$scores, xlim=xlim,ylim=ylim)
text(factor_scores$scores, labels = c(1:12), cex = 0.9, pos = 1, font = 1, col = "black")

# Achsen festlegen
xlim = c(-1, 1)
ylim = c(-1, 1.5)

# Variablen im Faktorraum
plot(efa_result$loadings, xlim=xlim,ylim=ylim)
text(efa_result$loadings, labels = colnames(regionen_data), cex = 0.9, pos = 1, font = 1, col = "black")

Beispiel Big Five 🖐

Daten

480 Studierende der FU Berlin
Persönlichkeitsfragebogen zur Erhebung der Big Five
Ziel der Analyse: Prüfung der Dimensionalität einer Skala
Eignung für FA: Bartlett-Test
Blick auf die Korrelationen

# Daten einlesen
bigfive.items <- read.csv("./data/bigfive_items.csv")

#Liefert die ersten Zeilen des Datensatzes
head(bigfive.items)

  bf01 bf02 bf03 bf04 bf05 bf06 bf07 bf08 bf09 bf10 bf11 bf12 bf13 bf14 bf15
1    4    2    4    4    4    4    2    4    4    2    5    5    5    4    3
2    5    4    4    4    4    5    4    3    4    3    4    3    5    3    3
3    3    2    2    4    4    4    1    3    4    5    5    5    3    3    3
4    5    5    3    4    5    4    3    4    5    5    3    4    5    4    5
5    3    4    5    5    3    3    4    5    3    5    5    5    3    3    4
6    4    4    1    3    4    4    2    4    5    3    2    1    3    4    5
  bf16 bf17 bf18 bf19 bf20
1    4    5    4    2    3
2    5    4    4    3    4
3    3    2    3    5    3
4    5    5    5    4    3
5    3    3    3    4    4
6    3    2    3    1    3

# Bartlett-Test
cortest.bartlett(bigfive.items)

$chisq
[1] 3516.047

$p.value
[1] 0

$df
[1] 190

# Matrix der Korrelationen
cor.plot(bigfive.items)

Analyse: Anzahl der Faktoren

# Parallel-Test
fa.parallel(bigfive.items,fa="fa")

Parallel analysis suggests that the number of factors =  5  and the number of components =  NA

kmo_result <- KMO(bigfive.items)
kmo_result

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = bigfive.items)
Overall MSA =  0.76
MSA for each item = 
bf01 bf02 bf03 bf04 bf05 bf06 bf07 bf08 bf09 bf10 bf11 bf12 bf13 bf14 bf15 bf16 
0.79 0.82 0.74 0.67 0.66 0.76 0.71 0.81 0.66 0.67 0.88 0.72 0.74 0.82 0.78 0.73 
bf17 bf18 bf19 bf20 
0.84 0.83 0.78 0.84

Analyse: EFA

# 5 Faktoren (vgl. Parallel-Test), varimax-Rotation
efa_result <- fa(bigfive.items, nfactors = 5, rotate = "varimax")

#print(efa_result)    # unübersichtlich
print(efa_result, digits=2, cut=0.3, sort=TRUE)

Factor Analysis using method =  minres
Call: fa(r = bigfive.items, nfactors = 5, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
     item   MR1   MR3   MR5   MR2   MR4   h2   u2 com
bf07    7  0.87                         0.77 0.23 1.0
bf03    3  0.85                         0.75 0.25 1.1
bf20   20  0.72                         0.54 0.46 1.1
bf11   11  0.54                         0.35 0.65 1.4
bf09    9        0.87                   0.78 0.22 1.0
bf05    5        0.84                   0.72 0.28 1.1
bf18   18        0.56                   0.38 0.62 1.4
bf14   14        0.53                   0.32 0.68 1.3
bf08    8              0.69             0.52 0.48 1.1
bf15   15              0.66             0.44 0.56 1.0
bf17   17              0.66             0.50 0.50 1.3
bf02    2              0.62             0.48 0.52 1.5
bf13   13                    0.79       0.63 0.37 1.0
bf16   16                    0.66       0.48 0.52 1.2
bf01    1                    0.64       0.44 0.56 1.1
bf06    6                    0.59       0.43 0.57 1.5
bf04    4                          0.75 0.61 0.39 1.2
bf10   10                          0.74 0.57 0.43 1.1
bf12   12                          0.54 0.36 0.64 1.5
bf19   19                          0.53 0.30 0.70 1.1

                       MR1  MR3  MR5  MR2  MR4
SS loadings           2.40 2.14 2.05 2.03 1.75
Proportion Var        0.12 0.11 0.10 0.10 0.09
Cumulative Var        0.12 0.23 0.33 0.43 0.52
Proportion Explained  0.23 0.21 0.20 0.20 0.17
Cumulative Proportion 0.23 0.44 0.64 0.83 1.00

Mean item complexity =  1.2
Test of the hypothesis that 5 factors are sufficient.

df null model =  190  with the objective function =  7.46 with Chi Square =  3516.05
df of  the model are 100  and the objective function was  0.58 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic n.obs is  480 with the empirical chi square  149.48  with prob <  0.00099 
The total n.obs was  480  with Likelihood Chi Square =  272.57  with prob <  6.4e-18 

Tucker Lewis Index of factoring reliability =  0.901
RMSEA index =  0.06  and the 90 % confidence intervals are  0.052 0.069
BIC =  -344.81
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR1  MR3  MR5  MR2  MR4
Correlation of (regression) scores with factors   0.94 0.93 0.88 0.89 0.89
Multiple R square of scores with factors          0.88 0.86 0.78 0.80 0.78
Minimum correlation of possible factor scores     0.76 0.73 0.56 0.60 0.57

# Ggf. Faktor-Scores berechnen
factor_scores <- factor.scores(bigfive.items,f=efa_result) 
head(factor_scores$scores)

            MR1        MR3        MR5        MR2        MR4
[1,] -0.2942431  0.5655061 -0.8039519  0.9097265 -0.5314245
[2,]  0.6996562  0.1140782 -1.0950517  1.7576682 -0.3637073
[3,] -1.3632562  0.2396799 -1.8892588 -0.8847961  1.2190297
[4,] -0.7614742  2.0042014  0.5966607  1.2421843  0.6402666
[5,]  1.4406285 -1.3067569 -0.1530218 -1.0222403  1.3640817
[6,] -1.8992144  1.1886144 -0.1339070 -0.8794378 -1.4975414

# Diagramm der Faktorenladungen 
fa.diagram(efa_result)

Itemanalyse

# Auswahl der Items zu Gewissenhaftigkeit
gewiss <- select(bigfive.items, bf03, bf07, bf11, bf20)
alpha(gewiss)


Reliability analysis   
Call: alpha(x = gewiss)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
      0.84      0.84    0.81      0.56 5.2 0.012  3.5 0.83     0.55

    95% confidence boundaries 
         lower alpha upper
Feldt     0.82  0.84  0.86
Duhachek  0.82  0.84  0.86

 Reliability if an item is dropped:
     raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
bf03      0.76      0.75    0.69      0.51 3.1    0.019 0.0168  0.45
bf07      0.75      0.75    0.68      0.50 3.1    0.019 0.0091  0.49
bf11      0.86      0.86    0.82      0.67 6.2    0.011 0.0064  0.65
bf20      0.80      0.80    0.76      0.57 3.9    0.016 0.0295  0.49

 Item statistics 
       n raw.r std.r r.cor r.drop mean  sd
bf03 480  0.88  0.87  0.84   0.76  3.4 1.0
bf07 480  0.88  0.87  0.85   0.77  3.3 1.1
bf11 480  0.70  0.72  0.55   0.51  4.0 0.9
bf20 480  0.82  0.82  0.72   0.67  3.2 1.0

Non missing response frequency for each item
        1    2    3    4    5 miss
bf03 0.03 0.16 0.35 0.32 0.14    0
bf07 0.05 0.19 0.33 0.30 0.12    0
bf11 0.01 0.05 0.19 0.40 0.35    0
bf20 0.04 0.21 0.32 0.32 0.10    0

… Schließlich

Empfehlungen

… für eine FA nach den Regeln der Kunst. Wenn wir eine EFA nur als Erkundungswerkzeug verwenden, können wir diese Empfehlungen entsprechend entspannen.

Metrisch skalierte Variable
Fallzahl sollte mindestens dreimal so groß wie die Zahl der Variablen sein
Für eine stabile Lösung: n > 250 (Bühner, 2021)
Zur Interpretation nur Faktorladungen > 0,5 verwenden
Faktorenanzahl: Parallel-Analyse, ggf. Kaiser-Kriterium (ungeeignet für viele Variablen)
Rotation: Varimax

Empfehlungen

Bortz & Schuster (2010):
- Interpretation eines Faktors, wenn vier Variablen > 0,60 laden
- allgemein: zehn Variablen mit Ladung > 0,40
Hair et al. (2010):

Factor Loading	Sample size
0.30	350
0.35	250
0.40	200
0.45	150
0.50	120
0.55	100
0.60	85
0.65	70
0.70	60
0.75	50