<- seq(100, 1000, 100)
n plot(log(n), log((n + 2)/24))
3 Path analysis
An easy and convenient representation of the relationships among a number of variables is using path diagram, we have saw a lot in past chapters. A path diagram can be viewed as a hypothesised/theory-based model, specifying the structure among variables in interest. We collect data to test whether our sample support the proposed model. Basically, path analysis is the analysis of the “path”.
3.1 Path diagram

- Rectangle: observed or manifest variable
- Circle: latent variable (i.e. error, factor)
- Single-headed arrow: linear relationship between two variables, starts from an independent variable and ends on a dependent variable.
- Double-headed arrow: variance of a variable or covariance between two variables.
3.2 What is the difference between multiple regression and path analysis
- In path analysis, a variable can be both dependent variable and independent variable at the same time, e.g. mediator, in multiple regression, dependent variable can not be independent variable.
- Multiple regression is more restricted in terms of the type of hypothesis it can test. We can only test whether a independent variable can effectively predict dependent variable while controlling other predictors; whereas path analysis can do more than that, e.g. mediation mechanism.
3.3 Effects Decomposition
When doing path analysis, we impose a theoretical structure upon our variables and derive
3.3.1 Four types of effects
- Causal effect (most important)
- Direct effect: a direct effect is represented by a single causal arrow from a independent variable to a dependent variable, i.e.
leads to , the change in will “cause” a change in ; - Indirect effect: an indirect effect is represented by two or more spliced causal arrows between an independent variable and a dependent variable, i.e.
leads to , leads to , the causal effect of on is transmitted via ;
- Non-causal effect
- Undecomposed effect: when there exists causal effect between an independent variable and a dependent variable, the causal effect accounts for part of the covariance/correlation between these two variables, the rest part of their covariance/correlation is undecomposed effect because it can not be decomposed clearly. This is usually because there is another independent variable that is in the same regression model but correlated with the very independent variable, we do not know which independent variable should be responsible for the leftover covariance/correlation. i.e.
and are two predictors of , the change in can spread into (because correlated), and the change in lead to change in ; - Spurious covariance/correlation: the part of covariance/correlation between two dependent variables that roots in having same independent variable or correlated independent variables, i.e.
is the predictor of both and , the change in will lead to change in and , making and seemingly correlated.
3.3.2 Effects decomposition via tracing diagram and
Let’s look at an example, we have four variable:
- SES: social economic status
- IQ: intelligence quotient
- nACH: need for achievement
- GPA: grade-point average
We specify the relationship among this four variables as

The corresponding regressions are
In the following figures, red represents direct effects, blue represent indirect effects, green represents undecomposited effects, and grey represents spurious covariance/correlation.
- The decomposition of

- The decomposition of

- The decomposition of

- The decomposition of

- The decomposition of

By standardize all variables, the variance terms becomes 1 and disappear, the covariance terms become correlations, and the regression coefficients become standardized ones, we have

3.3.3 Homework
ex3.11
- Decompose
;
3.4 Modeling process
After specifying a theoretical structure, the next step of SEM analysis (including path analysis), is modeling, i.e. collecting data and evaluating model. When dealing with multiple variables, it is very likely that we can have a group of competing models, we do not know which one is the best, we have to perform model evaluation in roughly two steps:
- model estimation: obtain the optimal estimates of unknown parameters and model-data fit for these model;
- model comparison: compare the model-data fit to find the best model.
3.4.1 Model estimation
3.4.1.1 Four basic matrices in SEM

At population level, when studying a set of mutually correlated variables, the covariance matrix of these variables is
At sample level, the sample estimator of
If we misspecified a wrong model, then the discrepancy between
3.4.1.2 Discrepancy function
The purpose of model estimation is to find the best sample estimates for the unknown parameters in a given model. The “best” is achieved by minimize the discrepancy using sample. For example (this example is taken from the lecture note of SEM class taught by Professor Zhang Zhiyong at University of Notre Dame), we have two random variables
We assume the true relationship between
where the unknown parameters are
then
When
and
It is clear that for this given model,
Noted that,
3.4.1.3 Common estimation methods (skimming)
- Ordinal least square (OLS)
where
- Multivariate normal distribution maximum likelihood (NML)
NML is usually the default estimation method in most SEM software. Note that,
3.4.2 Model fit
After fitting a specified model to data, we answer the question “how good is our model” by model-data fit. Model fit indices abound, most of them are directly based on likelihood ratio test (LRT).
3.4.2.1 The best model and the worst model
In SEM, the best model we can fit to data is the

The worst model we can fit to data is the baseline model/null model. In this model, all
3.4.2.2 Incremental measure of fit
In SEM,
For CFI and TLI:
indicates good model-data fit indicates acceptable model-data fit
A confidence interval can be computed for the RMSEA. Its formula is based on the non-central
indicates a close fit of a model indicates a reasonable model indicates a bad model
Note that CFI, TLI and RMSEA treat a
The SRMR is an absolute measure of fit and is defined as the standardized difference between the observed correlation and the predicted correlation. It is a positively biased measure and that bias is greater for small
3.4.2.3 Comparative measure of fit
Flowing are 3 commonly used comparative measure of fit, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and sample-size adjusted BIC (SABIC).
Lower values of AIC indicate a better fit and so the model with the lowest AIC is the best fitting model. There are somewhat different formulas given for the AIC in the literature, but those differences are not really meaningful as it is the difference in AIC that really matters. The AIC makes the researcher pay a penalty of two for every parameter that is estimated.
BIC increases the penalty as sample size increases. The BIC places a high value on parsimony. The SABIC or SABIC like the BIC places a penalty for adding parameters based on sample size, but not as high a penalty as the BIC.
3.4.3 Model comparison
In SEM, there are two types of model comparison:
- nested model comparison,
- unnested model comparison,
Nested model comparison is usually conducted using LRT-based
A model can be seen as a special case of another by imposing constraints (force to be 0) on parameters. If the model fit of a complex model was good, then constraints can be set to test the resulting simpler model.
Unnested model comparison is usually conducted using fit indices (i.e., 1-factor model vs 2-factor model).
It is usually recommended to report multiple fit indices when comparing models (nested and non-nested), so that we can have more information. But the problem is that fit indices can disagree with each other and we do not know which one is right.
Lai’s paper
3.4.4 JARS Reporting Standards
Some recommendations related to what we just learned:
- State the software (including version) used in the analysis. Also state the estimation method used and justify its use (i.e., whether its assumptions are supported by the data, if methods assuming multivariate normality was used, report statistics that measure univariate or multivariate skewness and kurtosis that support the assumption of normal distributions, otherwise state the strategy used to address nonnormality, such as use of a different estimation method that does not assume normality or use of normalizing transformations of the scores).
- Disclose any default criteria in the software, such as the maximum number of iterations or level of tolerance, that were adjusted in order to achieve a converged and admissible solution.
- Report fit statistics or indices about global (omnibus) fit interpreted using criteria justified by citation of most recent evidence-based recommendations for all models to be interpreted.
- State the strategy or criteria used to select one model over another if alternative models were compared. Report results of difference tests for comparisons between alternative models.
- Indicate whether one or more interpreted models was a product of respecification. If so, then describe the method used to search for misspecified parameters. State which parameters were fixed or freed to produce the interpreted model. Also provide a theoretical or conceptual rationale for parameters that were fixed or freed after specification searching.
- Report both unstandardized and standardized estimates for all estimated parameters.