Exercise 15 DIF analysis of polytomous questionnaire items using ordinal logistic regression

Data file likertGHQ28sex.txt
R package lordif, psych

15.1 Objectives

The objective of this exercise is to screen polytomous test items for Differential Item Functioning (DIF) using ordinal logistic regression. Item DIF is present when people with the same trait level but from different groups have different probabilities of selecting the item response categories. In this example, we will screen for DIF with respect to gender.

15.2 Worked Example - Screening GHQ Somatic Symptoms items for gender DIF

To complete this exercise, you need to repeat the analysis from a worked example below, and answer some questions. To practice further, you may repeat the analyses for the remaining GHQ-28 subscales.

Data for this exercise come from the Medical Research Council National Survey of Health and Development (NSHD), also known as the British 1946 birth cohort study. We considered these data in Exercise 13.

To remind you, the data pertain to a wave of interviewing undertaken in 1999, when the participants were aged 53. At that point, N = 2,901 participants (1,422 men and 1,479 women) completed the 28-item version of General Health Questionnaire (GHQ-28). The GHQ-28 is designed to measure 4 facets of mental/emotional distress - somatic symptoms (items 1-7), anxiety/insomnia (items 8-14), social dysfunction (items 15-21) and severe depression (items 22-28).

The focus of our analysis here will be the Somatic Symptoms scale, measured by 7 items. For each item, respondents need to think how they felt in the past 2 weeks:

Have you recently:
1) been feeling perfectly well and in good health?
2) been feeling in need of a good tonic?
3) been feeling run down and out of sorts?  
4) felt that you are ill?
5) been getting any pains in your head?
6) been getting a feeling of tightness or pressure in your head?
7) been having hot or cold spells?

Item responses are coded so that the score of 1 indicates good health or lack of distress; and the score of 4 indicates poor health or a lot of distress. Therefore, high scores on the questionnaire indicate emotional distress.

Step 1. Opening and examining the data

If you completed Exercise 13, the easiest thing to do is to open the project you created back then and continue working with it.

If you need to start from scratch, download the data file likertGHQ28sex.txt into a new folder and create a project associated with it. As the file here is in the tab-delimited format (.txt), use function read.table() to import data into the data frame we will call GHQ.

GHQ <- read.table(file="likertghq28sex.txt", header=TRUE)

The object GHQ should appear on the Environment tab. Click on it and the data will be displayed on a separate tab. You should see 33 variables, beginning with the 7 items for Somatic Symptoms scale, SOMAT_1, SOMAT_2, … SOMAT_7, and followed by items for the remaining subscales. They are followed by participant sex (variable sexm0f1 coded male = 0 and female = 1). The last 4 variables in the file are sum scores (sums of relevant 7 items) for each of the 4 subscales. There are NO missing data.

Step 2. Preparing the grouping variable, exploring group differences

Next, you need to prepare the grouping variable for DIF analyses. The variable sexm0f1 coded as male = 0 and female = 1. ATTENTION: this means that female is the focal group (group which will be the focus of analysis, and will be compared to the reference group, male). Give value labels to the sex variable as follows:

GHQ$sexm0f1 <- factor(GHQ$sexm0f1,
                  levels = c(0,1),
                  labels = c("male", "female"))

Run the command head(GHQ$sexf0m1) to check that the correct labels for sexm0f1 indeed appeared in the data frame.

Next, let’s obtain and examine the descriptive statistics for items and subscale scores by sex. The easiest way to do this is to use the function describeBy(x, group, ...) from the psych package. The function provides descriptive statistics for all variables in the data frame x by grouping variable group.

library(psych)

describeBy(GHQ[ ,c("SOMAT_1","SOMAT_2","SOMAT_3","SOMAT_4",
                   "SOMAT_5","SOMAT_6","SOMAT_7","SUM_SOM")], group=GHQ$sexm0f1)
## 
##  Descriptive statistics by group 
## group: male
##         vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## SOMAT_1    1 1422  2.05 0.49      2    2.03 0.00   1   4     3 0.85     3.68
## SOMAT_2    2 1422  1.74 0.75      2    1.64 1.48   1   4     3 0.80     0.25
## SOMAT_3    3 1422  1.71 0.73      2    1.61 1.48   1   4     3 0.77     0.06
## SOMAT_4    4 1422  1.44 0.68      1    1.31 0.00   1   4     3 1.47     1.70
## SOMAT_5    5 1422  1.24 0.50      1    1.14 0.00   1   4     3 2.01     3.64
## SOMAT_6    6 1422  1.20 0.49      1    1.08 0.00   1   4     3 2.41     5.26
## SOMAT_7    7 1422  1.20 0.52      1    1.07 0.00   1   4     3 2.79     7.85
## SUM_SOM    8 1422 10.58 2.82     10   10.15 2.97   7  26    19 1.49     2.71
##           se
## SOMAT_1 0.01
## SOMAT_2 0.02
## SOMAT_3 0.02
## SOMAT_4 0.02
## SOMAT_5 0.01
## SOMAT_6 0.01
## SOMAT_7 0.01
## SUM_SOM 0.07
## ------------------------------------------------------------ 
## group: female
##         vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## SOMAT_1    1 1479  2.09 0.59      2    2.08 0.00   1   4     3 0.58     1.49
## SOMAT_2    2 1479  1.87 0.79      2    1.80 1.48   1   4     3 0.63    -0.11
## SOMAT_3    3 1479  1.86 0.81      2    1.79 1.48   1   4     3 0.62    -0.26
## SOMAT_4    4 1479  1.49 0.74      1    1.34 0.00   1   4     3 1.44     1.40
## SOMAT_5    5 1479  1.39 0.62      1    1.28 0.00   1   4     3 1.46     1.52
## SOMAT_6    6 1479  1.30 0.59      1    1.17 0.00   1   4     3 1.95     3.13
## SOMAT_7    7 1479  1.77 0.81      2    1.67 1.48   1   4     3 0.81    -0.01
## SUM_SOM    8 1479 11.77 3.45     11   11.34 2.97   7  27    20 1.18     1.56
##           se
## SOMAT_1 0.02
## SOMAT_2 0.02
## SOMAT_3 0.02
## SOMAT_4 0.02
## SOMAT_5 0.02
## SOMAT_6 0.02
## SOMAT_7 0.02
## SUM_SOM 0.09

QUESTION 1. Who has the higher average score on item SOMAT_7 (“been having hot or cold spells”) – males or females? Who scored higher on the Somatic Symptoms (SUM_SOM) on average – males or females? Interpret the means for SOMAT_7 in the light of the SUM_SOM means.

Now you are ready for DIF analyses.

Step 3. Running basic DIF analysis for item SOMAT_7

We will use package lordif to run the DIF analysis. This package has great functions that automate the use of ordinal logistic regression to detect DIF.

Function rundif() is the basic function for detecting polytomous DIF items. It performs an ordinal (common odds-ratio) logistic regression DIF analysis by comparing the endorsement levels for each response category of given items with the matching variable - the variable reflecting the scale score - for individuals from the focal and referent groups. The function has the following format:

rundif(item, resp, theta, gr, criterion, pseudo.R2, R2.change, ...)

where item is one item or a collection of items which we examine for DIF; resp is a data frame containing item responses; theta is a conditioning (matching) variable; gr is a variable identifying groups; criterion is the criterion for flagging DIF items (i.e., “CHISQR”, “R2”, or “BETA”); pseudo.R2 is a pseudo R-squared measure (i.e., “McFadden”, “Nagelkerke”, or “CoxSnell”), and R2.change is the R-squared change for the pseudo R-squared criterion.

We will examine item SOMAT_7 for DIF, so we write item="SOMAT_7". Responses to this item are contained in the 7th column of GHQ dataframe, so we write resp=GHQ[,7] or resp=GHQ[,"SOMAT_7"]. We will use the sum score of Somatic Symptoms items, SUM_SOM, as the matching (conditioning) variable, so we write theta=GHQ$SUM_SOM. If we did not have that sum score already in the dataframe, we could easily compute it as the sum of all item scores using function rowSums(GHQ[,1:7]) (this would work fine because all items in GHQ are coded appropriately to represent distress and there are no missing data). Our grouping variable is sexm0f1, so we write gr=GHQ$sexm0f1.

Finally, we need to decide what criterion we will use for detecting DIF. As our sample size is very large, the statistical significance will be almost guaranteed even for very small effects. Therefore, we will rely on effect size measures, pseudo R-squared (because R-squared is, strictly speaking, is not defined for categorical data) to detect sizable differences between the focal and referent groups. We will use the Nagelkerke R-squared as it the measure well described and used in DIF analysis (unlike other R-squared options, it can theoretically reach 1), and we use 0.035 as the increments in R-squared corresponding to small, medium and large effects.

library(lordif)

# run basic DIF procedure for item SOMAT_7 
DIF7 <- rundif(item="SOMAT_7", resp=GHQ["SOMAT_7"], theta=rowSums(GHQ[,1:7]), 
              gr=GHQ$sexm0f1, 
              criterion = "R2", pseudo.R2 = "Nagelkerke", R2.change = 0.035)
# output DIF results
print(DIF7)
## $stats
##      item ncat chi12 chi13 chi23 beta12 pseudo12.McFadden pseudo13.McFadden
## 1 SOMAT_7    4     0     0     0 0.0052            0.0829            0.0875
##   pseudo23.McFadden pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1            0.0046              0.1241              0.1305              0.0063
##   pseudo12.CoxSnell pseudo13.CoxSnell pseudo23.CoxSnell df12 df13 df23
## 1            0.1046              0.11            0.0053    1    2    1
## 
## $flag
## [1] TRUE

Now examine the output carefully. You can see:

  • ncat - number of response categories;

  • chi12- p-values for the chi-square change from the baseline model (Model 1) with only trait score accounting for all differences in items scores to the Uniform DIF model (Model 2) where the grouping variable also accounts for differences;

  • chi13- p-values for the chi-square change from the baseline model (Model 1) with only trait score accounting for all differences in items scores to the Non-Uniform DIF model (Model 3) where the the grouping variable AND interaction between trait score and grouping variable also account for differences;

  • chi23- p-values for the chi-square change from the Uniform DIF model (Model 2) to the Non-Uniform DIF model (Model 3).

The next important values are pseudo12.Nagelkerke, pseudo13.Nagelkerke, and pseudo23.Nagelkerke, describing the effect sizes corresponding to differences between Model 1 and 2, Model 1 and 3 and Model 2 and 3 as described above. Effects of added predictors tested with the chi-square statistic may be significant, but they may be trivial in size. To judge whether the DIF effects are consequential, effect sizes need to be computed and evaluated. Nagelkerke R Square is recommended for judging the effect size of logistic regression models. These are interpreted together with the chi-squared results. Fundamentally, significant chi12 AND substantial pseudo12 indicates Uniform DIF; significant chi13 AND substantial pseudo13 indicates either Uniform DIF or Non-uniform DIF or both; and significant chi23 AND substantial pseudo23 indicates Non-uniform DIF.

You can use the following decision rules (Jodoin and Gierl, 2001) to judge whether DIF is present, and its size:

DIF classification Conditions
Large DIF Chi-square significant and Nagelkerke R2 change ≥ 0.07
Moderate DIF Chi-square significant and Nagelkerke R2 change between 0.035 and 0.07
Negligible DIF Chi-square insignificant or Nagelkerke R2 change < 0.035

QUESTION 2. Looking at the chi-square and Nagelkerke pseudo R-squared in the output, is EITHER Uniform or Non-uniform DIF* present for item SOMAT_7? Report the respective increment for Nagelkerke R2, and your conclusions about DIF effect size. Check your conclusions with the output, section $flag, where any DIF effects are flagged.

QUESTION 3. Looking at the chi-square and Nagelkerke pseudo R-squared in the output, is Uniform DIF present for item SOMAT_7? Report the respective increment for Nagelkerke R2, and your conclusions about DIF effect size.

QUESTION 4. Looking at the chi-square and Nagelkerke pseudo R-squared in the output, is Non-uniform DIF present for item SOMAT_7? Report the respective increment for Nagelkerke R2, and your conclusions about DIF effect size.

You can also run the above basic procedure for all 7 items simultaneously calling

# run basic DIF procedure for all Somatic Symptom items

DIF <- rundif(item=c("SOMAT_1","SOMAT_2","SOMAT_3","SOMAT_4",
                     "SOMAT_5","SOMAT_6","SOMAT_7"),
       resp=GHQ[,1:7], theta=rowSums(GHQ[,1:7]), gr=GHQ$sexm0f1,
       criterion = "R2", pseudo.R2 = "Nagelkerke", R2.change = 0.035)

QUESTION 5. Obtain print(DIF) output and examine it. Can you see any more DIF items?

Step 4. DIF analyses with purification

The above analyses used the scale score SUM_SOM as the matching (conditioning) variable. This score is made up of the items scores. But what if we have some DIF items in the scale - surely then, the total score will be “contaminated” by DIF? Indeed, there are methods that remove the effects of DIF items from the matching (conditioning) score. Package lordif offers one of such “purification” methods, whereby items flagged for DIF are removed from the “anchor” set of common items from which the trait score is estimated. This function uses IRT theta estimates (not the simple sum score) as the matching/conditioning variable. The graded response model (GRM) is used for IRT trait estimation as default. The procedure runs iteratively until the same set of items is flagged over two consecutive iterations, unless anchor items are specified.

The lordif function has the following format:

lordif(resp.data, group, selection = NULL, criterion = c("Chisqr", "R2", "Beta"), pseudo.R2 = c("McFadden", "Nagelkerke", "CoxSnell"), R2.change = 0.02, anchor = NULL, ...)

pureDIF <- lordif(resp.data=GHQ[,1:7], group=GHQ$sexm0f1, 
                  criterion = "R2", pseudo.R2 = "Nagelkerke", R2.change = 0.035)
print(pureDIF)
## Call:
## lordif(resp.data = GHQ[, 1:7], group = GHQ$sexm0f1, criterion = "R2", 
##     pseudo.R2 = "Nagelkerke", R2.change = 0.035)
## 
##   Number of DIF groups: 2 
## 
##   Number of items flagged for DIF: 1 of 7 
## 
##   Items flagged: 7 
## 
##   Number of iterations for purification: 2 of 10 
## 
##   Detection criterion: R2 
## 
##   Threshold: R-square change >= 0.035 
## 
##   item ncat pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1    1    4              0.0011              0.0039              0.0028
## 2    2    4              0.0000              0.0001              0.0001
## 3    3    4              0.0000              0.0001              0.0000
## 4    4    4              0.0035              0.0041              0.0005
## 5    5    3              0.0105              0.0107              0.0002
## 6    6    3              0.0027              0.0031              0.0004
## 7    7    4              0.1732              0.1856              0.0124

The output obtained by calling print() makes it clear that after iterative purification, there is only one item flagged as DIF, and it is item number 7 (SOMAT_7). So we obtain the same results as without purification. However, the effect sizes obtained are slightly different, presumably we used the IRT estimated trait scores rather than sum scores as the matching variable.

QUESTION 6. Report the DIF effect sizes for SOMAT_7 based on the DIF procedure with purification.

Step 5. Describing and interpreting the DIF effects

QUESTION 7. Given the item text and the characteristics of the sample, try to interpret the DIF effect that you found. Who have higher expected score on item SOMAT_7 after controlling for the overall Somatic Symptoms score – males or females? Can you interpret / explain this finding substantively?

Finally, let’s plot the results visually. Package lordif has its own function plot.lordif(), which plots diagnostic graphs for items flagged for DIF. It has the following format:

plot.lordif(x, labels = c("Reference", "Focal"), ...)

In our case, males are the reference group, females is the focal group, therefore we call:

# graphics.off() is used to keep the outputs in the Plot tab
plot.lordif(pureDIF, labels=c("Male","Female"), graphics.off())

You can use the arrows to navigate between the plots in the Plot tab. The first output is “Item True Score Functions”. Here, you see the expected item score for men (black line) and women (dashed red line) for each value of the IRT “true score” (or theta score, in the standardized metric). The large discrepancy across all theta values shows uniform DIF. The “Item Response Functions” output provides the probabilities of endorsing each response category for men (black line) and women (dashed red line) for each value of the theta score. You can see that the biggest difference between sexes is observed for the first response category (“not at all”). It is the presence of hot spells and not their amount that seems to make the biggest difference between sexes.

The next group of plots show the “Test Characteristic Curve” or TCC. The TCCs differ by about 0.5 points at most. That means that for the same true Somatic Symptom score, women are expected to have the sum score of at most 0.5 points higher than men. This difference is most pronounced for average true scores.

Step 6. Saving your work

After you finished this exercise, save your R script by pressing the Save icon. Give the script a meaningful name, for example “GHQ-28 ordinal DIF”.

When closing the project by pressing File / Close project, make sure you select Save when prompted to save your ‘Workspace image’ (with extension .RData).

15.3 Solutions

Q1. Females have higher mean SOMAT_7 item score (1.77) than males (1.20). Females also have higher SUM_SOM (mean=11.77) than males (mean=10.58). Knowing this, the differences in the item means are actually expected, as they may be explained by the difference in Somatic Symptom levels. The question is whether the differences in responses are fully explained by the differences in trait levels.

Q2. chi13for SOMAT_7 is highly significant with the p-value reported as 0.000 (p<0.001). Corresponding pseudo13.Nagelkerke is 0.1305, which is larger than 0.07 (cut-off for large effect size), therefore SOMAT_7 demonstrates LARGE DIF (either uniform or non-uniform or both). This is confirmed by the $flag equal TRUE for SOMAT_7.

Q3. chi12for SOMAT_7 is highly significant with the p-value reported as 0.000 (p<0.001). Corresponding pseudo12.Nagelkerke is 0.1241, which is larger than 0.07, therefore SOMAT_7 demonstrates LARGE Uniform DIF.

Q4. chi23for SOMAT_7 is highly significant with the p-value reported as 0.000 (p<0.001). Corresponding pseudo23.Nagelkerke is 0.0063, which is smaller than 0.035, therefore any Non-uniform DIF effects are trivial; which means thare are NO Non-uniform DIF.

Q5. There are no more DIF items apart from SOMAT_7.

print(DIF)
## $stats
##      item ncat chi12  chi13  chi23 beta12 pseudo12.McFadden pseudo13.McFadden
## 1 SOMAT_1    4 0.000 0.0000 0.3373 0.0360            0.0080            0.0082
## 2 SOMAT_2    4 0.000 0.0000 0.0112 0.0241            0.0037            0.0047
## 3 SOMAT_3    4 0.000 0.0000 0.0218 0.0248            0.0039            0.0048
## 4 SOMAT_4    4 0.000 0.0000 0.0459 0.0652            0.0163            0.0170
## 5 SOMAT_5    4 0.068 0.1671 0.6181 0.0096            0.0008            0.0009
## 6 SOMAT_6    4 0.302 0.5136 0.6052 0.0069            0.0003            0.0004
## 7 SOMAT_7    4 0.000 0.0000 0.0000 0.0052            0.0829            0.0875
##   pseudo23.McFadden pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1            0.0002              0.0095              0.0097              0.0002
## 2            0.0010              0.0040              0.0051              0.0011
## 3            0.0008              0.0035              0.0042              0.0007
## 4            0.0008              0.0171              0.0179              0.0008
## 5            0.0001              0.0011              0.0011              0.0001
## 6            0.0001              0.0004              0.0005              0.0001
## 7            0.0046              0.1241              0.1305              0.0063
##   pseudo12.CoxSnell pseudo13.CoxSnell pseudo23.CoxSnell df12 df13 df23
## 1            0.0075            0.0077            0.0002    1    2    1
## 2            0.0036            0.0045            0.0010    1    2    1
## 3            0.0031            0.0037            0.0006    1    2    1
## 4            0.0143            0.0149            0.0007    1    2    1
## 5            0.0008            0.0009            0.0001    1    2    1
## 6            0.0003            0.0003            0.0001    1    2    1
## 7            0.1046            0.1100            0.0053    1    2    1
## 
## $flag
## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Q6. pseudo12.Nagelkerke= 0.1732, which indicates large Uniform DIF. pseudo23.Nagelkerke= 0.0124, which indicates negligible non-Uniform DIF.

Q7. We found LARGE Uniform DIF for item SOMAT_7. We also found that females have higher expected score on the item given the same scale score SUM_SOM as males (see Q1). Females agree more with the symptom “having hot and cold spells” than males with the same level of Somatic symptoms. This is probably because many women in this sample are going through menopause (remember, the participants were aged 53 in this wave of testing), and experience hot flashes that men do not experience.