Exercise 14 DIF analysis of dichotomous questionnaire items using logistic regression

Data file EPQ_N_demo.txt
R package fmsb

14.1 Objectives

The objective of this exercise is to learn how to screen dichotomous test items for Differential Item Functioning (DIF) using logistic regression. Item DIF is present when people of the same ability (or people with the same trait level) but from different groups have different probabilities of passing/endorsing the item. In this example, we will screen for DIF with respect to gender.

14.2 Worked Example - Screening EPQ Neuroticism items for gender DIF

To complete this exercise, you need to repeat the analysis from a worked example below, and answer some questions.

This exercise makes use of the data we considered in Exercises 6, 8 and 12. These data come from a a large cross-cultural study (Barrett, Petrides, Eysenck & Eysenck, 1998), with N = 1,381 participants who completed the Eysenck Personality Questionnaire (EPQ). The focus of our analysis here will be the Neuroticism/Anxiety (N) scale, measured by 23 items with only two response options - either “YES” or “NO”, for example:

N_3     Does your mood often go up and down?
N_7     Do you ever feel "just miserable" for no reason?
N_12    Do you often worry about things you should not have done or said?
etc.

You can find the full list of EPQ Neuroticism items in Exercise 6. Please note that all items indicate “Neuroticism” rather than “Emotional Stability” (i.e. there are no counter-indicative items).

Step 1. Opening and examining the data

If you have already worked with this data set in previous exercises, the simplest thing to do is to continue working within the project created back then. In RStudio, select File / Open Project and navigate to the folder and the project you created. You should see the data frame EPQ appearing in your Environment tab, together with other objects you created and saved.

If you have not completed previous exercises or have not saved your work, or simply want to start from scratch, download the data file EPQ_N_demo.txt into a new folder and follow instructions from Exercise 6 on creating a project and importing the data.

EPQ <- read.delim(file="EPQ_N_demo.txt")

The object EPQ should appear on the Environment tab. Click on it and the data will be displayed on a separate tab. As you can see, there are 26 variables in this data frame, beginning with participant id, age and sex (0 = female; 1 = male). These demographic variables are followed by 23 item responses, which are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses, marked with NA.

Step 2. Creating the trait (matching) variable, and the grouping variable

Any DIF analysis begins with creating a variable that represents the trait score on which test takers will be matched. (Remember that DIF is the difference in probability of endorsing an item for people with the same trait score? We need to compute the trait score, to control for it in analyses).

You should already know how to compute the sum score when some item responses are missing. We did this in Exercise 1. You use the base R function rowMeans() to compute the average item score (omitting NA values from calculation, na.rm=TRUE), and then multiply the result by 23 (the number of items in the Neuroticism scale). This will essentially replace any missing responses with the mean for that individual.

Noting that the item responses are located in columns 4 to 26, compute the Neuroticism trait score (call it Nscore), and append it to the dataframe as a new variable:

EPQ$Nscore <- rowMeans(EPQ[ ,4:26], na.rm=TRUE)*23

Next, we need to prepare the grouping variable for DIF analyses. The variable sex is coded as 0 = female; 1 = male. ATTENTION: this means that male is the focal group (group which will be the focus of analysis, and will be compared to the reference group - here, female). To make it easy to interpret the DIF analyses, give value labels to this variable as follows:

EPQ$sex <- factor(EPQ$sex,
                  levels = c(0,1),
                  labels = c("female", "male"))

Run the command head(EPQ) again to check that the new variable Nscore and the correct labels for sex indeed appeared in the data frame.

Next, let’s obtain and examine the item means, and the means of Nscore by sex. An easy way to do this is to apply base R function colMeans() to only males or females from the sample:

colMeans(EPQ[EPQ$sex=="female",4:27], na.rm=TRUE)
##        N_3        N_7       N_12       N_15       N_19       N_23       N_27 
##  0.6800459  0.6486797  0.8339061  0.2980437  0.7019563  0.5189873  0.5137931 
##       N_31       N_34       N_38       N_41       N_47       N_54       N_58 
##  0.3839080  0.6877153  0.5977011  0.3487833  0.5986239  0.3390805  0.6953036 
##       N_62       N_66       N_68       N_72       N_75       N_77       N_80 
##  0.2935780  0.6160920  0.4954128  0.6923077  0.4170507  0.4602992  0.7238205 
##       N_84       N_88     Nscore 
##  0.8513825  0.9033372 13.3015915
colMeans(EPQ[EPQ$sex=="male",4:27], na.rm=TRUE)
##        N_3        N_7       N_12       N_15       N_19       N_23       N_27 
##  0.5660750  0.3661417  0.7322835  0.2727273  0.3641732  0.5059055  0.4043393 
##       N_31       N_34       N_38       N_41       N_47       N_54       N_58 
##  0.2500000  0.4477318  0.4192913  0.2366864  0.5551181  0.2696850  0.5944882 
##       N_62       N_66       N_68       N_72       N_75       N_77       N_80 
##  0.2381890  0.5255906  0.3333333  0.4665354  0.2603550  0.3333333  0.4110672 
##       N_84       N_88     Nscore 
##  0.7381890  0.8698225 10.1637778

QUESTION 1. Who has the higher proportion of endorsing item N_19 – males or females? (Hint. Remember that for binary items coded 0/1, the item mean is also the proportion of endorsement.) Who score higher on the Neuroticism scale (Nscore) on average – males or females? Interpret the means for N_19 in the light of the Nscore means.

Now you are ready for DIF analyses.

Step 3. Specifying logistic regression models

Now, let’s run DIF analyses for item N_19 (Are your feelings easily hurt?). We will create 3 logistic regression models. The first, Baseline model, will include the total Neuroticism score as the only predictor of N_19. Because this item was designed to measure Neuroticism, Nscore should positively and significantly predict responses to this item.

By adding sex as another predictor in the second model, we will check for uniform DIF (main effect of sex). If sex adds significantly (in terms of chi-square value) and substantially (in terms of Nagelkerke R2) over and above Nscore, males and females have different odds of saying “YES” to N_19, given their Neuroticism score. This means that uniform DIF is present.

By adding Nscore by sex interaction as another predictor in the third model, we will check for non-uniform DIF. If the interaction term adds significantly (in terms of chi-square value) and substantially (in terms of Nagelkerke R2) over and above Nscore and sex, non-uniform DIF is present.

We will use the R base function glm() (stands for ‘generalized linear model’) to specify the three logistic regression models:

# Baseline model 
Baseline <- glm(N_19 ~ Nscore, data=EPQ, family=binomial(link="logit"))
# Uniform DIF model 
dif.U <- glm(N_19 ~ Nscore + sex, data=EPQ, family=binomial(link="logit"))
# Non-Uniform DIF model
dif.NU <- glm(N_19 ~ Nscore + sex + Nscore:sex, data=EPQ, family=binomial(link="logit"))

You can see that the model syntax above is very simple. We basically saying that “N_19 is regressed on (~) Nscore””; or that “N_19 is regressed on (~) Nscore and sex”, or that “N_19 is regressed on (~) Nscore, sex and Nscore by sex interaction” (Nscore:sex). We ask the function to perform logistic regression (family = binomial(link="logit")). And of course, we pass the dataset (data = EPQ) where all the variables can be found.

Type the models one by one into your script, and run them. Objects Baseline, dif.U and dif.NU should appear in your Environment. Next, you will obtain and interpret various outputs generated from the results of these models.

Step 4. Testing for significance of uniform and non-uniform effects of sex

To test if the main effect of sex or the interaction between sex and Nscore added significantly to the Baseline model, use the base R anova() function. It analyses not only variance components (the ANOVA as most students know it), but also deviance components (what is minimized in logistic regression). The chi-square statistic is used to test the significance of contributions of each added predictor, in the order in which they appear in the regression equation. To get this breakdown, apply the anova() function to the final model, dif.NU.

anova(dif.NU, test= "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: N_19
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        1376     1875.8              
## Nscore      1   622.32      1375     1253.5 < 2.2e-16 ***
## sex         1    57.42      1374     1196.1 3.513e-14 ***
## Nscore:sex  1     1.21      1373     1194.9    0.2717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each row shows the contribution of each added predictor to the model’s chi-square. The Deviance column shows the chi-square statistic for the model with each subsequent predictor added (starting from NULL model with just intercept and no predictors, then with Nscore predictor added, then with sex added and finally Nscore:sex). The Df column (first column) shows the degrees of freedom for each added predictor, which is 1 degree of freedom every time. The column Pr(>Chi) is the p-value.

Examine the output and try to judge whether the predictors added at each step contributed significantly to the prediction of N_19.

QUESTION 2. Is the Baseline model (with Nscore as the only predictor) significant? Try to report the chi-square statistic for this model using the template below: Baseline – NULL: diff.chi-square (df= __ , N=__ ) = _______, p = _______.

QUESTION 3. Does sex add significantly over and above Nscore? What is the increment in chi-square compared to the Baseline model? What does this mean in terms of testing for Uniform DIF? dif.U – Baseline: diff.chi-square (df= __ , N=__ ) = _______, p = _______.

QUESTION 4. Does Nscore by sex interaction add significantly over and above Nscore and sex? What is the increment in chi-square compared to the dif.U model? What does this mean in terms of testing for Non-Uniform DIF? dif.NU – dif.U: diff.chi-square (df= __ , N=__ ) = _______, p = _______.

Step 5. Evaluating effect sizes for uniform and non-uniform effects of sex

Effects of added predictors may be significant, but they may be trivial in size. To judge whether the differences between groups while controlling for the latent trait (DIF) are consequential, effect sizes need to be computed and evaluated. Nagelkerke R Square is recommended for judging the effect size of logistic regression models.

To obtain Nagelkerke R2, we will use package fmsb. Install this package on your computer and load it into memory. Then apply function NagelkerkeR2() to the results of 3 models you produced earlier.

library(fmsb)

NagelkerkeR2(Baseline)
## $N
## [1] 1377
## 
## $R2
## [1] 0.4887726
NagelkerkeR2(dif.U)
## $N
## [1] 1377
## 
## $R2
## [1] 0.5237134
NagelkerkeR2(dif.NU)
## $N
## [1] 1377
## 
## $R2
## [1] 0.524433

Look at the output and note that the functions return 2 values – the sample size on which the calculation was made, and the actual R2.

QUESTION 5. Report the Nagelkerke R2 for each model below: Baseline: Nagelkerke R2 = _______ dif.U: Nagelkerke R2 = _______ dif.NU: Nagelkerke R2 = _______

Finally, let’s compute the increments in Nagelkerke R2 for dif.U compared to Baseline, and dif.NU compared to dif.U. We will refer directly to the $R2 values of the models:

# compare model dif.U against Baseline - Uniform DIF effect size
NagelkerkeR2(dif.U)$R2 - NagelkerkeR2(Baseline)$R2
## [1] 0.03494082
# compare model dif.NU against dif.U - Non-Uniform DIF effect size
NagelkerkeR2(dif.NU)$R2 - NagelkerkeR2(dif.U)$R2
## [1] 0.0007195194

Now, refer to the following decision rules to judge whether DIF is present or not:

Large DIF: Chi-square significant and Nagelkerke R2 change ≥ 0.07
Moderate DIF: Chi-square significant and Nagelkerke R2 change between 0.035 and 0.07
Negligible DIF: Chi-square insignificant or Nagelkerke R2 change < 0.035

QUESTION 6. What are the increments for Nagelkerke R2, and what do you conclude about Differential Item Functioning?

Step 6. Describing the effects: regression coefficients

Finally, obtain the regression coefficients of the final model, by running summary(dif.NU). You will see the sign of the effects and their significance. However, remember that B coefficients in logistic regression are on the log-odds scale and therefore they are not interpreted. Instead, request exp(B) and interpret them as odds ratios.

exp(coef(dif.NU))
##    (Intercept)         Nscore        sexmale Nscore:sexmale 
##     0.06050411     1.35852112     0.20692613     1.04251259

Now, try to report the size of the individual variables’ effects.

QUESTION 7. Write sentences describing the odds ratios, in terms of change in the DV accompanying increases in the IV. Report significant effects only. Use the below templates.

A one-point increase in Nscore was associated with a _______ times _____________________(increase or decrease) in the odds of endorsing item N_19.

Male sex was associated with a _______ times ________________(increase or decrease) in the odds of endorsing item N_19.

QUESTION 8. Finally, try to interpret any moderate or large DIF effects that you found (ignore negligible DIF). Who have higher expected probabilities of endorsing item N_19 – males or females? Can you interpret / explain this finding substantively?

Step 7. Saving your work

After you finished this exercise, save your R script by pressing the Save icon. Give the script a meaningful name, for example “EPQ_N logistic regression”.

When closing the project by pressing File / Close project, make sure you select Save when prompted to save your ‘Workspace image’ (with extension .RData).

14.3 Solutions

Q1. Females endorse item N_19 more frequently (0.70 or 70% of them endorse it) than males (0.36 or 36%). Females also have higher Nscore (mean=13.30) than males (mean=10.16). Knowing this, the differences in the item endorsement rates are actually expected, as they may be explained by the difference in Neuroticism levels. The question is whether the differences in responses are fully explained by the trait level.

Q2. The Baseline model predicts the item response significantly (as we would expect); chi-square (1, N = 1377)=622.32, p < .001.
[Hint. To determine the sample size on which the regression model was run, you can look at ‘Resid.DF’ column. The NULL model always has N-1 degrees of freedom. So, we can use the Resid.DF entry against the NULL model (1376) to calculate N=1377. This is smaller than the total sample size for EPQ (1381) because a few responses on N_19 were missing and these cases were deleted listwise by the regression procedure.]

Q3. Sex adds significantly to prediction. The increment is diff.chi-square (1, N = 1377) =57.42, p < .001. This means that Uniform DIF might be present (to judge its effect size, we will need to look at the Nagelkerke R2).

Q4. Nscore by sex interaction does not add significantly to prediction. The increment diff.chi-square (1, N = 1377) =1.21, p = .272. It means that there is no Non-Uniform DIF, regardless of the effect size.

Q5. Baseline: Nagelkerke R2 = 0.4888; dif.U: Nagelkerke R2 = 0.5237; dif.NU: Nagelkerke R2 = 0.5244

Q6. Nagelkerke R2 increment from Baseline model to dif.U model is 0.035. According to the DIF classification rules, this just qualifies for moderate DIF (because the effect was significant – see Q3, and the effect size is exactly at the cut-off for moderate DIF).

Nagelkerke R2 increment from dif.U model to dif.NU model is 0.0007. According to the DIF classification rules, there is no DIF (i.e. DIF is negligible), because the effect was insignificant – see Q4, and the effect size is tiny.

Q7. A one-point increase in Nscore was associated with a 1.359 times increase in the odds of endorsing the item. Male sex was associated with a 0.207 times decrease in the odds of endorsing the item. [NOTE that values above 1 are associated with an increase, and below 1 with a decrease in odds].

Q8. We found moderate Uniform DIF for item N_19. We also found that females have higher odds of endorsing the item given the same Nscore as males (because males have lower odds – see Q7). It appears that females admit to their “feelings being easily hurt” (see text of N_19) easier than males with the same Neuroticism level. This might be because expressing their feeling is more socially acceptable for females.