Week 4 Model fitting

This week, we will fit model to real data and estimate parameters. To do so, we need three elements:
1. Data: The first experiment of Ratcliff and Rouder (1998). This can be downloaded from the R package “rtdists”.
2. Model: Drift-Diffusion model, which is programmed in the R package “rtdists”. The diffusion model usually requires at least five parameters (v, a, t0, z, s). To simplify the model fitting and parameter interpretation, we will focus on three free parameters and keep other parameters fixed.
3. Deviance function: The function is needed to “optimize” the parameter values and quantify discrepancy between model predictions and empirical data.

The first experiment of Ratcliff and Rouder (1998) required subjects to decide whether the overall brightness of pixel arrays displayed on a computer monitor was “high” or “low”.In addition, two main manipulations were implemented:
Strength: Proportion of white pixels (i.e., white pixel↑, the strength of brightness ↑)
Instruction: Speedy decision or Accurate decision.

If you want to know even more detail about the experiment and model, please check the links below:
- https://cran.r-project.org/web/packages/rtdists/rtdists.pdf
- http://pcl.missouri.edu/sites/default/files/ratcliff.rouder.1998.pdf

The code today is adapted from Mikhail Spektor’s, Laura Fontanesi’s (https://github.com/laurafontanesi/rlssm_R_workshop) and Farrell and Lewandowsky, 2018.

4.1 Packages and Data

4.1.1 Check and Download packages

As usual, setup your environment first.
- Change working directory to the ~/Desktop/CMcourse
- Open a R script and save it as week4 in the current folder.
- Write few comments on the first few rows of your R script, for instance,

# Week 4: Fitting models
# First and Last name
# date

To start the new project, clear working environment with rm(list=ls()).

# clear working environment
rm(list=ls())

Check if you have installed packages

# Check if you have "rtdists" package
require(rtdists)
require(ggplot2)

If the warning message shows “there is no package called rtdists”, then download rtdists package that contains few sequential-sampling models. Otherwise, you can skip this step. Two methods for package installation in R: Method 1:

install.packages("rtdists") # Installation might require restarting R
install.packages("ggplot2") # Installation might require restarting R

Method 2: Go to Tools -> Install Packages… -> select Repository (CRAN) in the Install from -> Specify the package name (e.g., rtdists) in Package and then click Install
- Important! Open packages that we need for the project at beginning of the script.

# Open rtdists package
library(rtdists)
library(ggplot2)

Once you received a package, open it and read the manual carefully. In particular, you need to know what inputs are required and what outputs are produced.

4.1.2 Download and preprocess Data

Before we start fitting the model to a dataset, it is important to understand data structure and model-free results. Here are the reasons:
1. data structure tells us which rows/columns represent variables of interest and how variables are stored (e.g., RT in ms or in seconds).
2. model-free results tell us how data looks like in general.

4.1.3 Load data

# Load data from the "rtdists" package
data(rr98)
# Double-click rr98 in your Environment panel (right-top window)
# In the dataset, there are 3 participants, who completed ~ 4000 trials in each condition (speed/accuracy).

4.1.4 Preprocess data

Sometimes we should reorganize the original dataset, so that we can efficiently and correctly use the value in the data for model fitting.

# Reorganize data
rr98 <- rr98[!rr98$outlier,]  # remove outliers, as in original paper
rr98[,'RT'] <- rr98$rt
rr98[,'accuracy'] = 0 # add a new column labeled as accuracy, where incorrect responses are coded as 0
rr98[rr98$correct, 'accuracy'] = 1 # and correct responses are coded as 1
rr98$choice = rr98$response
rr98$response = as.character(rr98$response)
rr98[rr98$correct == 1, "response"] = "upper"
rr98[rr98$correct == 0, "response"] = "lower"
head(rr98)

4.2 Descriptive statistic anlaysis (Model-free analysis)

Before the modeling exercise, check your data and see if you can already find something interesting.
In the last few weeks, we computed aggregate data for a group (week1) or for only one subject (week2 and week3). What if we have multiple subjects and multiple conditions? In this case, you might need aggregate function and summarize here we would like to compute mean rt and mean accuracy for each subject regardless of manipulations.

# calculate mean RT and mean accuracy for each subject
agg.id = aggregate(cbind(rr98$RT,rr98$accuracy), # targeted dependent variables
                by = list(rr98$id), # group by Independent variable of interest
                FUN = mean)
colnames(agg.id) <- c("SubjID", "Mean RT", "Mean accuracy")
agg.id

# Next, calculate mean RT and mean accuracy for each subject AND each condition
agg.id_instruct = aggregate(cbind(rr98$RT,rr98$accuracy), # targeted dependent variables
                by = list(rr98$id, rr98$instruction), # group by Independent variables of interest
                FUN = mean)
colnames(agg.id_instruct) <- c("SubjID", "Instruction", "Mean RT", "Mean accuracy")
agg.id_instruct

4.3 Fitting

4.3.1 Likelihood funciton (week5 for details)

# Create a deviance function to quantify the difference between predicted and actual data given model and parameter values
# Here, we compute a deviance as log likelihood: how likely the set of parameter values is true (or what is the probability of observing this RT given data and model)?
# Note: The ddiffusion model is the model, which requires rt, response and DDM-related parameters
diffusion_deviance <- function(a, v, t0, z, s, dat){
  log_lik <- sum(log(ddiffusion(rt=dat[dat["accuracy"] == 1, "RT"],  # dat
                                response = "upper", a=a, v=v, t0=t0, z=z, s=s)))
  log_lik <- log_lik +
    sum(log(ddiffusion(rt=dat[dat["accuracy"] == 0, "RT"],
                       response="lower", a=a, v=v, t0=t0, z=z, s=s)))
  return(-2*log_lik) # negative log likelihood (smaller is better)
}

# Always good to shorten the input. Here parameter values are stored in "pars". 
diffusion_deviance2 <- function(pars, dat)
{
  log_lik <- sum(log(ddiffusion(rt=dat[dat["accuracy"] == 1, "rt"],
                                response="upper",  v=pars[1], a=pars[2], t0=pars[3], z=pars[2]/2, s= 1)))
  log_lik <- log_lik +
    sum(log(ddiffusion(rt=dat[dat["accuracy"] == 0, "rt"],
                       response="lower",  v=pars[1], a=pars[2], t0=pars[3], z=pars[2]/2, s= 1)))
  if (any((-2*log_lik == 0))) return(1e6)
  return(-2*log_lik)
}

# Try random numbers for required parameters. Please note that the order of parameter is important. Computer uses these parameter values by order as you determined in the function. 
driftrate <- 0.05
threshold <- 1.2
nonDMtime <- 0.1

diffusion_deviance2(pars = c(driftrate, threshold, nonDMtime), dat=rr98)

4.3.2 Parameter estimation

So the issue here is, we cannot try different sets of parameter values until we find the one with lowest discrepancy.
To solve this issue, we have to determine the method of searching a set of parameters when we ask computer to help us. Specifically, computer needs to know:
1. starting parameter values
2. the boundary for each parameter
3. optimization method

We will use non-linear minimization with boundaries function nlminb to search the “optimal” set of parameter values.

# set up starting parameter values
starting_values <- c(runif(n=1, min=-5,  max=5),
                     runif(n=1, min=0.5,  max=5),
                     0.1)
names(starting_values) <- c("v","a","t0")
starting_values

# set up lower and upper boundary for each parameter
lb <-c(-5,0.5,0.1) # Lower Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time
ub <-c(5,5,0.5) # Upper Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time

# decide the optimization method. Here we selected nlminb
fit_result <- nlminb(start=starting_values, 
                     objective=diffusion_deviance2,
                     lower=lb,# the lower boundary for each parameter. Order is important
                     upper=ub, # the upper boundary for each parameter. Order is important
                     dat=rr98)
print(fit_result$par) # report the estimated parameters in the Console window

4.3.3 Parameter estimation for multiple subjects

So far, we estimated the parameters regardless of subjects and manipulations. To quantify individual differences in information processing, we can separately estimate parameters for each subject.

# Initialization
n<-0

# Pre-allocation (i.e., create an empty data frame and name each column)
results<-setNames(data.frame(matrix(ncol = 7, nrow = 0)), c("name", "Mean RT (ms)","Mean Accuracy","v", "a","t0 (ms)", "nll"))

# Main for loop to estimate parameters for each subject
for (pp in unique(rr98$id)) {
  print(pp) # display the current subject in the Console window, so that you know the current progress
  
  data_pp = rr98[rr98$id == pp,]
  
  fit <- nlminb(start=starting_values, 
                       objective=diffusion_deviance2,
                       lower=lb,# the lower boundary for each parameter. Order is important
                       upper=ub, # the upper boundary for each parameter. Order is important
                       dat=data_pp)
  deviance <- fit[["objective"]]
  params <- fit[["par"]]
  
  # store result for the first to the last subject
    n <- n+1 #update the current number of subject
  results[n,] <- c(pp,                      #subject's name
                   round(agg.id[n,2]*1000), #Mean RT (ms)
                   agg.id[n,3],             #Mean Accuracy
                   round(params[1],2),      #drift rate
                   round(params[2],2),      #threshold 
                   round(params[3]*1000),   #non-decision time (ms)
                   deviance)                #negative log likelihood
}
results

4.4 Exercise

Task: Separately perform the parameter estimation for each subject and each instruction condition (i.e., speed and accuracy). How to do it? What do you observe?
Note: The results might be different from the original paper (Ratcliff and Rouder, 1998). The main reason is they have different assumption about drift rate (i.e., drift rates are intensity-dependent).

4.5 Answers

# set up starting parameter values
starting_values <- c(runif(n=1, min=-5,  max=5),
                     runif(n=1, min=0.5,  max=5),
                     0.1)
names(starting_values) <- c("v","a","t0")
starting_values

# set up lower and upper boundary for each parameter
lb <-c(-5,0.5,0.1) # Lower Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time
ub <-c(5,5,0.5) # Upper Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time
n<-0
results<-setNames(data.frame(matrix(ncol = 8, nrow = 0)), c("name", "Instruction", "Mean RT (ms)","Mean Accuracy","v", "a","t0 (ms)", "nll"))

for (pp in unique(rr98$id)) {
  for (ins in unique(rr98$instruction)) {
    print(pp)
    n <- n+1
    data_pp = rr98[rr98$id == pp & rr98$instruction == ins,]
    
    fit <- nlminb(start=starting_values, 
                  objective=diffusion_deviance2,
                  lower=lb,# the lower boundary for each parameter. Order is important
                  upper=ub, # the upper boundary for each parameter. Order is important
                  dat=data_pp)
    deviance <- fit[["objective"]]
    params <- fit[["par"]]
  results[n,] <- c(pp,
                   ins,
                   round(agg.id_instruct[agg.id_instruct$SubjID == pp & agg.id_instruct$Instruction == ins,3]*1000),
                   agg.id_instruct[agg.id_instruct$SubjID == pp & agg.id_instruct$Instruction == ins,4],
                   round(params[1],2),
                   round(params[2],2),
                   round(params[3]*1000), 
                   deviance)
  }
}

results