6 Model fit

There are many fit statistics that can be used at various levels (item fit, person fit, model fit). There is no agreed-upon best fit measure, hence software typically offers a plethora of fit outcomes. IRTPRO, for example, reports the following goodness of fit measures:

  • The values of \(-2\) loglikelihood, Akaike Information Criterion (AIC) (Akaike, 1974) and the Bayesian Information Criterion (BIC) (Schwarz, 1978) for model comparison.

  • (In some cases) the overall likelihood ratio test against the general multinomial alternative.

  • For some models, the M2 statistic (Maydeu-Olivares & Joe, 2005, 2006).

  • LD indexes (Chen & Thissen, 1997) to detect violations of local dependence.

  • \(S-\chi^2\) item fit statistic (Orlando & Thissen, 2000, 2003).

Below we only focus on Orlando and Thissen’s item fit statistics.

As a general principle, fit statistics look at residuals. For a given IRT model, the item and ability parameters are estimated. This allows computing model-predicted scores (expected scores). The predicted results are then compared to the observed results. The residuals are the differences between the observed and the expected scores. If a model fits well, the residuals are expected to be small.

Yen (1981) suggested an item fit statistic known as \(Q_1\). It is based on ordering the examinees by their \(\theta\) estimates and then dividing them in 10 subgroups such that the number of examinees per group is very similar. The formula is

\[\begin{equation} Q_{1i} = \sum_{k=1}^{10} \frac{N_k(O_{ik}-E_{ik})^2}{E_{ik}(1-E_{ik})}, \tag{6.1} \end{equation}\]

where

  • \(O_{ik}\) is the observed proportion of examinees in subgroup \(k\) who answered item \(i\) correctly;

  • \(E_{ik}\) is the expected (from the model) proportion of examinees in subgroup \(k\) who should answer item \(i\) correctly, computed at the subgroup’s mean \(\theta\) value;

  • \(N_k\) is the number of examinees in subgroup \(k\).

Yen’s \(Q_1\) is very similar to Bock’s (1972) \(\chi^2\) index (Equation 9.3 in Embretson & Reise, 2000), except that Bock’s statistic did not rely on a fixed number of subgroups and the \(E_{ik}\) values were computed at the median (instead of the mean) \(\theta\) value for subgroup \(k\).

One problem with the \(Q_1\) index is that it depends on the estimated \(\theta\) values to obtain the observed proportions in each subgroup. That is, the observed proportions depend on the model fit. This is undesirable. Moreover, the number of subgroups and the cutoff values between subgroups are arbitrary and may affect the results.

Orlando and Thissen (2000) proposed grouping examinees based on the observed data only (thus, not on \(\theta\)). The formula is given by

\[\begin{equation} S-X_i^2 = \sum_{k=1}^{I-1} \frac{N_k(O_{ik}-E_{ik})^2}{E_{ik}(1-E_{ik})}. \tag{6.2} \end{equation}\]

Now the groups are based on the observed number-correct score, that is, the distribution of the total number of correct answers in the sample of examinees. The observed proportions \(O_{ik}\) are straightforwardly computed. The expected proportions \(E_{ik}\) are more difficult to compute (see Orlando & Thissen, 2000, for the formulas). This statistic is \(\chi^2\) distributed with \((I-1-m)\) degrees of freedom (\(I=\) number of items, \(m=\) number of item \(i\)’s parameters). As usual in \(\chi^2\) tests, cells with low frequencies need to be collapsed (and the degrees of freedom accordingly adjusted).