25.1 Inter-rate reliability methods

Calculation to judge the degree of agreement between the choices made by two or more independent judges

Other packages are

vcd for visualization
DescTools

25.1.1 Percent Agreement

$\frac{\text{number of agreement}}{\text{number of total}} \times 100$

library("irr")

## Loading required package: lpSolve

data("diagnoses", package = "irr")
data("anxiety", package = "irr")
head(diagnoses,10)

##                     rater1                  rater2                  rater3
## 1              4. Neurosis             4. Neurosis             4. Neurosis
## 2  2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
## 3  2. Personality Disorder        3. Schizophrenia        3. Schizophrenia
## 4                 5. Other                5. Other                5. Other
## 5  2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
## 6            1. Depression           1. Depression        3. Schizophrenia
## 7         3. Schizophrenia        3. Schizophrenia        3. Schizophrenia
## 8            1. Depression           1. Depression        3. Schizophrenia
## 9            1. Depression           1. Depression             4. Neurosis
## 10                5. Other                5. Other                5. Other
##              rater4           rater5           rater6
## 1       4. Neurosis      4. Neurosis      4. Neurosis
## 2          5. Other         5. Other         5. Other
## 3  3. Schizophrenia 3. Schizophrenia         5. Other
## 4          5. Other         5. Other         5. Other
## 5       4. Neurosis      4. Neurosis      4. Neurosis
## 6  3. Schizophrenia 3. Schizophrenia 3. Schizophrenia
## 7  3. Schizophrenia         5. Other         5. Other
## 8  3. Schizophrenia 3. Schizophrenia      4. Neurosis
## 9       4. Neurosis      4. Neurosis      4. Neurosis
## 10         5. Other         5. Other         5. Other

agree(diagnoses)

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 6 
##   %-agree = 16.7

# library(vcd)
# # Create the plot
# p <- agreementplot(anxiety)

25.1.2 Cohen’s Kappa

(Cohen 1960)

$k = \frac{p_o - p_e}{1 - p_e} = 1 - \frac{1 - p_o}{1- p_e}$

where

$p_o$ = relative observed agreement among raters
$p_e$ = hypothetical probability of chance agreement

strict agreements between raters

appropriate in case of 2 ordinal or nominal varibles

Based on (Landis and Koch 1977)’s guide, we have

Degree	Decision
0.01 – 0.20	slight agreement
0.21 – 0.40	fair agreement
0.41 – 0.60	moderate agreement
0.61 – 0.80	substantial agreement
0.81 – 1.00	almost perfect or perfect agreement

# Unweighted kappa for 2 nominal or 2 ordinal categorical
kappa2(diagnoses[, c("rater1", "rater2")], weight = "unweighted") # two ordinal variables only, allows partial agreement

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.651 
## 
##         z = 7 
##   p-value = 2.63e-12

# Weighted kappa ordinal scales
kappa2(diagnoses[, c("rater1", "rater2")], weight = "equal") # linear weightes of the differences

##  Cohen's Kappa for 2 Raters (Weights: equal)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.633 
## 
##         z = 5.43 
##   p-value = 5.52e-08

kappa2(diagnoses[, c("rater1", "rater2")], weight = "squared") # squared weightes of the differences

##  Cohen's Kappa for 2 Raters (Weights: squared)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.655 
## 
##         z = 3.91 
##   p-value = 9.37e-05

p-value less than 0.05, mean that raters agree more than what you would expect by chance.

25.1.3 Fleiss’kappa

for two or more categorical variables (nominal or ordinal)

for three or more raters

0 = no agreement
1 = perfect agreement

# no assumption of same raters for all subjects
kappam.fleiss(diagnoses[, 1:3])

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 30 
##    Raters = 3 
##     Kappa = 0.534 
## 
##         z = 9.89 
##   p-value = 0

References

Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.

Landis, J. Richard, and Gary G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159. https://doi.org/10.2307/2529310.