Exercise 5 Reliability analysis of polytomous questionnaire data

Data file SDQ.RData
R package psych

5.1 Objectives

The purpose of this exercise is to learn how to estimate the test score reliability by different methods: test-retest, split-half and “internal consistency” (Cronbach’s alpha). You will also learn how to judge whether test items contribute to measurement.

5.2 Study of a community sample using the Strength and Difficulties Questionnaire (SDQ)

In this exercise, we will again work with the self-rated version of Strengths and Difficulties Questionnaire (SDQ), a brief behavioural screening questionnaire about children and adolescents of 3-16 year of age. The data set, SDQ.RData, which was used in Exercises 1 and 2, contains responses of N=228 pupils from the same school to the SDQ. The SDQ was administered twice, the first time when the children just started secondary school (were in Year 7), and one year later (were in year 8).

To remind you, the SDQ measures 5 facets, with 5 items each:

Emotional Symptoms somatic worries unhappy clingy afraid
Conduct Problems tantrum obeys* fights lies steals
Hyperactivity restles fidgety distrac reflect* attends*
Peer Problems loner friend* popular* bullied oldbest
Pro-social consid shares caring kind helpout

NOTE that 5 items marked in the above table with asterisks * represent behaviours counter-indicative of the scales they intend to measure, so that higher scale scores correspond to lower item scores.

Every item response is coded according to the following response options: 0 = “Not true” 1 = “Somewhat true” 2 = “Certainly true”

There are some missing responses. In particular, some pupils have no data for the second measurement, possibly because they were absent on the day of testing or moved to a different secondary school.

5.3 Worked Example - Estimating reliability for SDQ Conduct Problems

To complete this exercise, you need to repeat a worked example below, which comprises reliability analysis of the facet Conduct Problems. Once you feel confident, you can complete the exercise for the remaining SDQ facets.

Step 1. Creating/Opening project

If you completed Exercise 1, you should have already downloaded file SDQ.RData, and created a project associated with it. Please find and open that project now. In that project, you should have already computed the scale scores for Conduct Problems (Time 1 and Time 2). You should have also reverse-coded counter indicative items. Both of those steps are essential for running the reliability analysis. If you have not done so, please go to Exercise 1, and follow the instructions to complete these steps.

Step 2. Test-retest reliability

When test scores on two occasions are available, we can estimate test reliability by computing the correlation between them (correlate the scale score at Time 1 with the scale score at Time 2) using the base R function cor.test().

Remind yourself of the variables in the SDQ data frame (which should be available when you open the project from Exercise 1) by calling names(SDQ). Among other variables, you should have S_conduct and S_conduct2, representing the test scores for Conduct problems at Time 1 and Time 2, respectively, which you computed then. Correlate them to obtain the test-retest reliability.

cor.test(SDQ$S_conduct, SDQ$S_conduct2)
## 
##  Pearson's product-moment correlation
## 
## data:  SDQ$S_conduct and SDQ$S_conduct2
## t = 7.895, df = 170, p-value = 3.421e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3992761 0.6195783
## sample estimates:
##       cor 
## 0.5179645

QUESTION 1. What is the test-retest reliability for the SDQ Conduct Problem scale? Try to interpret this result.

Step 3. Internal consistency reliability (coefficient alpha)

To compute the coefficient alpha, you will need to again refer to the items indicating Conduct Problems.

It is important that before submitting them to analyses, the items should be appropriately coded - that is, any counter-indicative items should be reverse coded using function reverse.code() from package psych. This is because alpha is computed as an average of items’ covariances (“raw” alpha) or correlations (“standardized” alpha). If any items correlate negatively with the rest (which counter-indicative items should), these negative correlations will cancel out positive correlations and alpha will be spuriously low (or even negative). This is obviously wrong, because we compute reliability of the test score, and when we actually computed the test score we reversed negatively keyed items. We should do the same when computing alpha.

Fortunately, you should have already prepared the correctly coded item scores in Exercise 1. They should be stored in the object R_conduct. If you have not done this, please go back to Exercise 1 and apply the function reverse.code() to the 5 items measuring Conduct Problems to create R_conduct.

library(psych)

items_conduct <- c("tantrum","obeys","fights","lies","steals")
R_conduct <- reverse.code(keys=c(1,-1,1,1,1), SDQ[items_conduct])

Now all that is left to do is to run function alpha() from package psych on the correctly coded set R_conduct. There are various arguments you can control in this function, and most defaults are fine, but I suggest you set cumulative=TRUE (the default is FALSE). This will ensure that statistics in the output are given for the sum score (“cumulative” of item scores) rather than for the average score (deafult). We computed the sum score for Conduct Problems, so we want the output to match the score we computed.

alpha(R_conduct, cumulative=TRUE)
## 
## Reliability analysis   
## Call: alpha(x = R_conduct, cumulative = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
##       0.72      0.73     0.7      0.35 2.7 0.028  2.1 2.1     0.33
## 
##  lower alpha upper     95% confidence boundaries
## 0.66 0.72 0.77 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## tantrum      0.62      0.65    0.59      0.31 1.8    0.041 0.0100  0.28
## obeys-       0.65      0.66    0.61      0.33 2.0    0.035 0.0078  0.33
## fights       0.67      0.66    0.61      0.33 2.0    0.034 0.0094  0.30
## lies         0.70      0.71    0.66      0.38 2.5    0.031 0.0086  0.38
## steals       0.71      0.72    0.68      0.39 2.6    0.031 0.0096  0.43
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean   sd
## tantrum 226  0.79  0.76  0.68   0.59 0.57 0.72
## obeys-  228  0.71  0.72  0.64   0.52 0.58 0.60
## fights  228  0.66  0.73  0.64   0.52 0.19 0.44
## lies    226  0.70  0.64  0.49   0.43 0.54 0.72
## steals  227  0.57  0.62  0.45   0.38 0.19 0.49
## 
## Non missing response frequency for each item
##            0    1    2 miss
## tantrum 0.56 0.31 0.13 0.01
## obeys-  0.48 0.46 0.06 0.00
## fights  0.82 0.16 0.02 0.00
## lies    0.59 0.27 0.14 0.01
## steals  0.86 0.10 0.04 0.00

QUESTION 2. What is the alpha (raw_alpha is the most appropriate statistic to report for reliability of the raw scale score) for the Conduct Problems scale? Try to interpret the size of alpha bearing in mind the definition of reliability as “the proportion of variance in the observed score due to true score”.

Now examine the output in more detail. There are other useful statistics printed in the same line with raw_alpha. Note the average_r, which is the average correlation between the 5 items of this facet. The std.alpha is computed from these.

Other useful stats are mean (which is the mean of the sum score) and sd (which is the Standard Deviation of the sum score). It is very convenient that these are calculated by function alpha() because you can get these even without computing the Conduct Problem scale score! If you wish, check them against the actual sum score stats using describe(SDQ$S_conduct). Note that this is why I suggested you set cumulative=TRUE, so you can get stats for the sum score, not the average score.

Now examine the output “Reliability if an item is dropped”. The first column will give you the expected “raw_alpha” for a 4-item scale without this particular item in it (if this item was dropped). This is useful for seeing whether the item makes a good contribution to measurement provided by the scale. If this expected alpha is lower than the actual reported alpha, the item improves the test score reliability. If it is higher than the actual alpha, the item actually reduces the score reliability. You may wonder how it is possible, since items are supposed to add to the reliability? Essentially, such an item contributes more noise (to the error variance) than signal (to the true score variance).

QUESTION 3. Judging by the “Reliability if an item is dropped” output, do all of the items contribute positively to the test score reliability? Which item provides biggest contribution?

Now examine the “Item statistics” output. Two statistics I want to draw your attention to are:

raw.r - The correlation of each item with the total score. This value is always inflated because the item is correlated with the scale in which it is already included!

r.drop - The correlation of this item with the scale WITHOUT this item (with the scale compiled from the remaining items). This is a more realistic indicator than raw.r of how closely each item is associated with the scale.

Both raw.r and r.drop should be POSITIVE. If for any item these values are negative, you must check whether the item was coded appropriately; for example, if all counter-indicative items were reverse coded. To help you, the output marks all reverse coded items with “-” sign.

QUESTION 4. Which item has the highest correlation with the remaining items (“r.drop” value)? Look up this item’s text in the SDQ data frame.

Step 4. Split-half reliability

Finally, we will request the split-half reliability coefficients for this scale. I said “coefficients” rather than “coefficient” because there are lots of ways in which the test can be split into two halves, each giving a slightly different estimate of reliability.

You will use function splitHalf() from package psych on the appropriately coded item set R_conduct. I suggest you set use="complete" to make sure that only the complete cases (without missing data) are used, to avoid different samples to be used in different splittings of the test. We should also set covar=TRUE, to base all estimates on item raw scores and covariances rather than correlations (the default in splitHalf()). This is to make our estimates comparable with the “raw_alpha” we obtained from alpha().

splitHalf(R_conduct, use="complete", covar=TRUE)
## Split half reliabilities  
## Call: splitHalf(r = R_conduct, covar = TRUE, use = "complete")
## 
## Maximum split half reliability (lambda 4) =  0.76
## Guttman lambda 6                          =  0.7
## Average split half reliability            =  0.72
## Guttman lambda 3 (alpha)                  =  0.72
## Guttman lambda 2                          =  0.73
## Minimum split half reliability  (beta)    =  0.65
## Average interitem covariance =  0.12  with median =  0.12

The function prints estimates of reliability based on different splittings of the test. For short tests like the Conduct Problems scale, the function actually performs all possible splits, and calculates and prints their maximum (“Maximum split half reliability (lambda 4)”), minimum (“Minimum split half reliability (beta)”), and average (“Average split half reliability”). You can see them in the output, and see that the estimates actually vary quite a bit, from 0.65 to 0.76 depending on the way in which the test was split.

Lee Cronbach showed in his famous 1951 paper that, theoretically, the coefficient alpha is “the mean of all split-half coefficients resulting from different splittings of a test”. But of course, it is much easier to compute alpha by using the formula rather than running all possible splittings and estimating the split half coefficients for them (and correcting these coefficients for unequal test lengths because with 5 items, there will always be a 3-item half and a 2-item half). So, in the output you will see both, the “Average split half reliability” = 0.72 and “alpha” = 0.72 (also known as “Guttman lambda 3”). In this case, they are exactly the same, which is lucky considering the amount of calculations and corrections involved with computing the average of all split-half coefficients.

You can also refer to the “raw_alpha” result we obtained with function alpha(), and you will see that it is the same as we obtained with the function splitHalf().

This completes the Worked Example.

QUESTION 4. Repeat the steps in the Worked Example for the Hyperactivity facet. Compute the test-retest reliability, alpha and split-half reliabilities (for Time 1 only).

Step 5. Saving your work

It is important that you save all new objects created, because you will need some of these again in Exercise 7. When closing the project, make sure you save the entire work space, and your script.

save(SDQ, file="SDQ_saved.RData")

5.4 Further practice - Reliabilities of the remaining SDQ facets

If you want to practice further, you can pick any of the remaining SDQ facets.

Use the table below to enter your results (2 decimal points is fine).

Test-retest Alpha Split-half (ave)
Emotional Symptoms
Conduct Problems 0.52 0.72 0.72
Hyperactivity
Peer Problems
Pro-social

NOTE. Don’t be surprised if the average split-half coefficient does not always equal alpha.

Based on the analyses of scales that you have completed, try to answer the following questions:

QUESTION 6. Which method for estimating reliability gives the lowest/highest estimate? Why?

QUESTION 7. Which method do you think is best for estimating the precision of measurement in this study?

5.5 Solutions

Q1. The correlation 0.518 is positive, “large” in terms of effect size, but not very impressive as an estimate of reliability because it suggests that only about 52% of variance in the test score is due to true score, and the rest is due to error. It suggests that Conduct Problems is either not a very stable construct, or it is not measured accurately by the SDQ. It is impossible to say which is the case from this one correlation.

Q2. The estimated (raw) alpha is 0.72, which suggests that approximately 72% of variance in the Conduct problem score is due to true score. This is an acceptable level of reliability for a short screening questionnaire.

Q3. “Reliability if an item is dropped” output shows that every item contributes positively to measurement, because the reliability would go down from the current 0.72 if we dropped any of them. For example, if we dropped the item tantrum, alpha would be 0.62, which is much worse than the current alpha. If the item tantrum was dropped, alpha would reduce most, so this item provides most contribution to reliability of this scale.

Q4. Item tantrum has the highest correlation with the scale (“r.drop”=0.59). This item reads: “I get very angry and often lose my temper”. It is typical that items with highest item-total correlation are also those contributing most to alpha.

Q5.

Test-retest Alpha Split-half (ave)
Emotional Symptoms 0.49 0.74 0.7
Conduct Problems 0.52 0.72 0.72
Hyperactivity 0.65 0.76 0.75
Peer Problems 0.51 0.53 0.51
Pro-social 0.53 0.65 0.64

Q6. The test-retest method provides the lowest estimates, which is not surprising considering that the interval between the two testing sessions was one year. Particularly low is the correlation between Emotional Problems at Time 1 and Time 2, while its internal consistency is higher, which indicates that Emotional Problems are more transient at this age than, for example, Hyperactivity.

Q7. The substantial differences between the test-retest and alpha estimates for all but one scale (Peer Problems) suggest that the test-retest method likely under-estimates the reliability due to instability of the constructs measured after such a long interval (one year). So, alpha and split-half coefficients are more appropriate as estimates of reliability here. Alpha is to be preferred to split-half coefficients since the latter vary widely depending on the way the test is split.

For Peer Problems, both test-retest and alpha give similar results, so the low test-retest cannot be interpreted as necessarily low stability - it may be that the construct is relatively stable but not measured very accurately at each time point.