Exercise 4 Thurstonian scaling

Data file JobFeatures.txt
R package psych

4.1 Objectives

The purpose of this exercise is to learn how to perform Thurstonian scaling of preferential choice data. The objective of Thurstonian scaling is to estimate means of psychological values (or utilities) of the objects under comparison. That is, given the observed rank orders of the objects, we want to know what utility distributions could give rise to such orderings. To estimate unobserved utilities from observed preferences, we need to make assumptions about the distributions of utilities. In this analysis, normal distributions with equal variance are assumed (so-called Thurstone Case V scaling).

4.2 Study of preferences for job features

Data used in this exercise was collected as part of a study of work motivation by Dr Ilke Inceoglu. In this study, N=1079 participants were asked to rank order the following nine job features according to the extent they would want them in their ideal job:

1.  Supportive Environment (Support)
2.  Challenging Work (Challenge)
3.  Career Progression (Career)
4.  Ethical Job (Ethics)
5.  Job Control, Personal Impact (Autonomy)
6.  Development (Development)
7.  Social Interaction (Interaction)
8.  Competitive Environment (Competition)
9.  Pleasant environment and Safety (Safety)

Rank 1 given to any job feature means that this job feature was the most important for this participant, rank 9 means that it was least important.

4.3 Worked Example - Thurstonian scaling of ideal job features

To complete this exercise, you need to repeat the analysis from a worked example below.

Step 1. Reading and examining data

To read the data file JobFeatures.txt into RStudio, I will use read.delim(), a basic R function for reading text files (in this case, a tab-delimited file). I will read the data into a new object (data frame), which I will call (arbitrarily) JobFeatures.

JobFeatures <- read.delim("JobFeatures.txt")

You should see a new object JobFeatures appear on the Environment tab (usually top right RStudio panel). This tab will show any objects currently in the work space. According to the description in the Environment tab, the data set contains 1079 obs.of 9 variables.

You can press on JobFeatures object. This will make View(JobFeatures) command run on Console, which in turn will open the data frame in its own tab named JobFeatures. Examine the data by scrolling down and across. The data set contains 1079 rows (cases, observations) on 9 variables: Support, Challenge, Career, etc. You can obtain the names of all variables in the data frame JobFeatures by using the basic function names():

names(JobFeatures)
## [1] "Support"     "Challenge"   "Career"      "Ethics"      "Autonomy"   
## [6] "Development" "Interaction" "Competition" "Safety"

Each row represents someone’s ranking of the 9 job features. Function head() will display first few participants with their rankings (and tail() will display last few):

head(JobFeatures)
##   Support Challenge Career Ethics Autonomy Development Interaction Competition
## 1       8         3      4      5        2           6           1           7
## 2       7         5      1      6        2           8           3           9
## 3       5         8      1      9        6           2           3           4
## 4       7         6      8      9        3           4           2           5
## 5       1         4      8      3        9           2           6           7
## 6       6         1      3      7        8           5           2           4
##   Safety
## 1      9
## 2      4
## 3      7
## 4      1
## 5      5
## 6      9

Step 2. Performing Thurstonian scaling

Before doing the analysis, load package psych to enable its features.

library(psych)

To perform Thurstonian scaling, we will use function thurstone(). For details about how this function works, run the following code:

help("thurstone")

You will get full documentation displayed in the Help tab. As described in the documentation, this function has the following general form:

thurstone(x, ranks = FALSE, digits = 2)

Each component in the parentheses is called “argument”, and so thurstone() function has three arguments. The first argument (x) is the data to be scaled and it needs to be supplied because there is no default value. The second argument is the type of data you supply: either actual ranks, or percentages of preferences summarised in a square matrix). By default, the summarised percentages are assumed (ranks=FALSE). The third argument is the number of digits (decimal places) reported for the goodness of fit index (default is 2). The second and the third arguments are optional because they have default values. If you are happy with the defaults, you do not need to even include these arguments in your syntax.

Since our data are raw rankings, we need to set ranks=TRUE. Type and run the following code from your Script. It may take a little while, but you will eventually get the results on your Console!

thurstone(JobFeatures, ranks=TRUE)
## Thurstonian scale (case 5) scale values 
## Call: thurstone(x = JobFeatures, ranks = TRUE)
## [1] 0.97 0.93 0.91 0.92 0.60 1.04 0.63 0.00 0.23
## 
##  Goodness of fit of model   1

Let’s interpret this output. First, the title of the function and the full command are printed. Next, the estimated population means of the psychological values (or utilities) for 9 job features are printed (starting with [1], which simply indicates the row number of output data). To make sense of the reported means, you have to match them with the nine features. This is easy, as the means are reported in the order of variables in the data frame.

Every mean reflects how valuable/desirable on average this particular job feature is. A high mean indicates that participants value this job feature highly in relation to other features. Because preferences are always relative, however, it is impossible to uniquely identify all the means (for explanation, you may see McDonald (1999), chapter 18, “Pair Comparison Model”). Therefore, one of the means has to be fixed to some arbitrary value. It is customary to fix the mean of the least preferred feature to 0. Then, all the remaining means are positive.

QUESTION 1 Try to interpret the reported means. Which job feature was least wanted? What was the utility mean of this feature? What was the most wanted feature, and what was its utility mean? Looking at the mean values, how would you interpret the relative values of Autonomy and Interaction? What else can you say about the relative utility values of the job features?

Step 3. Calculating pairwise preference proportions

Now, let’s run the same function but assign its results to a new object called (arbitrarily) scaling. Type and execute the following command:

scaling <- thurstone(JobFeatures, ranks = TRUE)

This time, there is no output, and it looks like nothing happened. However, the same analysis was performed but now its results are stored in scaling rather than printed out. To see what is stored in scaling, call function ls() that will list all object’s constituents:

ls(scaling)
## [1] "Call"     "choice"   "GF"       "residual" "scale"

You can check what is stored inside any of these constituent parts by referring to them by full name - starting with scaling followed by the $ sign, for example:

scaling$scale
## [1] 0.97 0.93 0.91 0.92 0.60 1.04 0.63 0.00 0.23

This will print out the 9 utility means, which we already examined.

scaling$choice
##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 0.5000000 0.4735867 0.4745134 0.4745134 0.3632994 0.5078777 0.3605190
##  [2,] 0.5264133 0.5000000 0.4670992 0.4949027 0.3753475 0.5301205 0.3623726
##  [3,] 0.5254866 0.5329008 0.5000000 0.4986098 0.3753475 0.5236330 0.3901761
##  [4,] 0.5254866 0.5050973 0.5013902 0.5000000 0.3632994 0.5180723 0.3781279
##  [5,] 0.6367006 0.6246525 0.6246525 0.6367006 0.5000000 0.6737720 0.5050973
##  [6,] 0.4921223 0.4698795 0.4763670 0.4819277 0.3262280 0.5000000 0.2928638
##  [7,] 0.6394810 0.6376274 0.6098239 0.6218721 0.4949027 0.7071362 0.5000000
##  [8,] 0.7961075 0.8341057 0.8127896 0.8090825 0.7126969 0.8526413 0.7775718
##  [9,] 0.7747915 0.7173309 0.7405005 0.7432808 0.6598703 0.8044486 0.6950880
##            [,8]      [,9]
##  [1,] 0.2038925 0.2252085
##  [2,] 0.1658943 0.2826691
##  [3,] 0.1872104 0.2594995
##  [4,] 0.1909175 0.2567192
##  [5,] 0.2873031 0.3401297
##  [6,] 0.1473587 0.1955514
##  [7,] 0.2224282 0.3049120
##  [8,] 0.5000000 0.6135310
##  [9,] 0.3864690 0.5000000

This will print a 9x9 matrix containing proportions of participants in the sample who preferred the feature in the column to the feature in the row. This is a summary of the rank preferences that we did not have in the beginning, but R conveniently calculated it for us. In the “choice” matrix, the rows and columns are in the order of variables in the original file.

Let’s examine the “choice” matrix more carefully. Look for the entry in row [8] and column [6]. This value, 0.8526413, represents the proportion of participants who preferred 8th feature, Competition, to 6th feature, Development, and it is the largest value in the “choice” matrix:

max(scaling$choice)
## [1] 0.8526413

This pair of features has the most decisive preference for one feature over the other.

QUESTION 2. How does the above result for choices for 8th feature, Competition, oevr 6th feature, Development, correspond to the estimated utility means?

Step 4. Assessing model fit

Now, let’s ask for model residuals:

scaling$residual
##               [,1]          [,2]         [,3]          [,4]         [,5]
##  [1,]  0.000000000  0.0133479521  0.001732662  0.0077151838 -0.005872352
##  [2,] -0.013347952  0.0000000000  0.022201895  0.0003878924 -0.005625238
##  [3,] -0.001732662 -0.0222018946  0.000000000  0.0073806440  0.004543318
##  [4,] -0.007715184 -0.0003878924 -0.007380644  0.0000000000  0.010887738
##  [5,]  0.005872352  0.0056252384 -0.004543318 -0.0108877381  0.000000000
##  [6,] -0.020341170 -0.0111159497 -0.028230409 -0.0278455199  0.005140288
##  [7,] -0.008084655 -0.0186493603 -0.001106985 -0.0074004791 -0.006785811
##  [8,]  0.036657121 -0.0096726558  0.004628757  0.0122844463  0.012984302
##  [9,] -0.007039854  0.0403013911  0.008671076  0.0106467596 -0.017008932
##               [,6]         [,7]         [,8]         [,9]
##  [1,]  0.020341170  0.008084655 -0.036657121  0.007039854
##  [2,]  0.011115950  0.018649360  0.009672656 -0.040301391
##  [3,]  0.028230409  0.001106985 -0.004628757 -0.008671076
##  [4,]  0.027845520  0.007400479 -0.012284446 -0.010646760
##  [5,] -0.005140288  0.006785811 -0.012984302  0.017008932
##  [6,]  0.000000000  0.049380030  0.002756144  0.015651117
##  [7,] -0.049380030  0.000000000  0.042051948  0.041174300
##  [8,] -0.002756144 -0.042051948  0.000000000 -0.021145626
##  [9,] -0.015651117 -0.041174300  0.021145626  0.000000000

This will print a 9x9 matrix containing differences between the observed proportions (the choice matrix) and the expected proportions (proportions preferring the feature in the row to the feature in the column, which would be expected based on the standard normal distributions of utilities around the means scaled as above). Residuals are the direct way of measuring whether a model (in this case, a model of unobserved utilities that Thurstone proposed) “fits” the observed data. Small residuals (near zero) indicate that there are small discrepancies between observed choices and choices predicted by the model; which means that the model we adopted is rather good.

Finally, let’s ask for a Goodness of Fit index:

scaling$GF
## [1] 0.9987548

According to documentation for thurstone() function, GF is calculated as:

1 - sum(squared residuals/squared observed)

If the residuals are close to zero, then their squared ratios to the observed proportions should be also close to zero. Therefore, the goodness of fit index of a well-fitting model should be close to 1.

The residuals in our analysis are all very small, which indicates a close correspondence between the observed choices (proportions of preferences for one feature over the other). The small residuals are reflected in the GF index, which is very close to 1. Overall, the Thurstone’s model fits the job-features data well.

4.4 Solutions

Q1. The smallest mean (0.00) corresponds to the 8th feature, Competition, and the highest mean (1.04) corresponds to the 6th feature, Development. This means that Competitive environment was least wanted, and opportunities for personal Development were most wanted by people in their ideal job. Other features were scaled somewhere between these two, with Safety having low mean (0.23) - barely higher than 0 for Competition, whereas Support, Challenge, Career and Ethics having similarly high means (around 0.9). Autonomy and Interaction have similar moderate means around 0.6.

Q2. The most decisive preference in terms of proportions of people chosing one feature over the other must have the largest distance/difference between the utilities (6th feature, Development, must have a much higher mean utility than 8th feature, Competition). This result is indeed in line with the results for the utility means, where Development mean was the highest at 1.04 and Competition was the lowest at 0.