Week 4 Model fitting
This week, we will fit model to real data and estimate parameters. To do so, we need three elements:
1. Data: The first experiment of Ratcliff and Rouder (1998). This can be downloaded from the R package “rtdists”.
2. Model: Drift-Diffusion model, which is programmed in the R package “rtdists”. The diffusion model usually requires at least five parameters (v, a, t0, z, s). To simplify the model fitting and parameter interpretation, we will focus on three free parameters and keep other parameters fixed.
3. Deviance function: The function is needed to “optimize” the parameter values and quantify discrepancy between model predictions and empirical data.
The first experiment of Ratcliff and Rouder (1998) required subjects to decide whether the overall brightness of pixel arrays displayed on a computer monitor was “high” or “low”.In addition, two main manipulations were implemented:
Strength: Proportion of white pixels (i.e., white pixel↑, the strength of brightness ↑)
Instruction: Speedy decision or Accurate decision.
If you want to know even more detail about the experiment and model, please check the links below:
- https://cran.r-project.org/web/packages/rtdists/rtdists.pdf
- http://pcl.missouri.edu/sites/default/files/ratcliff.rouder.1998.pdf
The code today is adapted from Mikhail Spektor’s, Laura Fontanesi’s (https://github.com/laurafontanesi/rlssm_R_workshop) and Farrell and Lewandowsky, 2018.
4.1 Packages and Data
4.1.1 Check and Download packages
As usual, setup your environment first.
- Change working directory to the ~/Desktop/CMcourse
- Open a R script and save it as week4 in the current folder.
- Write few comments on the first few rows of your R script, for instance,
# Week 4: Fitting models
# First and Last name
# date
- To start the new project, clear working environment with
rm(list=ls())
.
# clear working environment
rm(list=ls())
- Check if you have installed packages
# Check if you have "rtdists" package
require(rtdists)
require(ggplot2)
If the warning message shows “there is no package called rtdists”, then download rtdists package that contains few sequential-sampling models. Otherwise, you can skip this step. Two methods for package installation in R: Method 1:
install.packages("rtdists") # Installation might require restarting R
install.packages("ggplot2") # Installation might require restarting R
Method 2:
Go to Tools -> Install Packages… -> select Repository (CRAN) in the Install from -> Specify the package name (e.g., rtdists
) in Package and then click Install
- Important! Open packages that we need for the project at beginning of the script.
# Open rtdists package
library(rtdists)
library(ggplot2)
Once you received a package, open it and read the manual carefully. In particular, you need to know what inputs are required and what outputs are produced.
4.1.2 Download and preprocess Data
Before we start fitting the model to a dataset, it is important to understand data structure and model-free results. Here are the reasons:
1. data structure tells us which rows/columns represent variables of interest and how variables are stored (e.g., RT in ms or in seconds).
2. model-free results tell us how data looks like in general.
4.1.3 Load data
# Load data from the "rtdists" package
data(rr98)
# Double-click rr98 in your Environment panel (right-top window)
# In the dataset, there are 3 participants, who completed ~ 4000 trials in each condition (speed/accuracy).
4.1.4 Preprocess data
Sometimes we should reorganize the original dataset, so that we can efficiently and correctly use the value in the data for model fitting.
# Reorganize data
<- rr98[!rr98$outlier,] # remove outliers, as in original paper
rr98 'RT'] <- rr98$rt
rr98[,'accuracy'] = 0 # add a new column labeled as accuracy, where incorrect responses are coded as 0
rr98[,$correct, 'accuracy'] = 1 # and correct responses are coded as 1
rr98[rr98$choice = rr98$response
rr98$response = as.character(rr98$response)
rr98$correct == 1, "response"] = "upper"
rr98[rr98$correct == 0, "response"] = "lower"
rr98[rr98head(rr98)
4.2 Descriptive statistic anlaysis (Model-free analysis)
Before the modeling exercise, check your data and see if you can already find something interesting.
In the last few weeks, we computed aggregate data for a group (week1) or for only one subject (week2 and week3). What if we have multiple subjects and multiple conditions? In this case, you might need aggregate
function and summarize here we would like to compute mean rt and mean accuracy for each subject regardless of manipulations.
# calculate mean RT and mean accuracy for each subject
= aggregate(cbind(rr98$RT,rr98$accuracy), # targeted dependent variables
agg.id by = list(rr98$id), # group by Independent variable of interest
FUN = mean)
colnames(agg.id) <- c("SubjID", "Mean RT", "Mean accuracy")
agg.id
# Next, calculate mean RT and mean accuracy for each subject AND each condition
= aggregate(cbind(rr98$RT,rr98$accuracy), # targeted dependent variables
agg.id_instruct by = list(rr98$id, rr98$instruction), # group by Independent variables of interest
FUN = mean)
colnames(agg.id_instruct) <- c("SubjID", "Instruction", "Mean RT", "Mean accuracy")
agg.id_instruct
4.3 Fitting
4.3.1 Likelihood funciton (week5 for details)
# Create a deviance function to quantify the difference between predicted and actual data given model and parameter values
# Here, we compute a deviance as log likelihood: how likely the set of parameter values is true (or what is the probability of observing this RT given data and model)?
# Note: The ddiffusion model is the model, which requires rt, response and DDM-related parameters
<- function(a, v, t0, z, s, dat){
diffusion_deviance <- sum(log(ddiffusion(rt=dat[dat["accuracy"] == 1, "RT"], # dat
log_lik response = "upper", a=a, v=v, t0=t0, z=z, s=s)))
<- log_lik +
log_lik sum(log(ddiffusion(rt=dat[dat["accuracy"] == 0, "RT"],
response="lower", a=a, v=v, t0=t0, z=z, s=s)))
return(-2*log_lik) # negative log likelihood (smaller is better)
}
# Always good to shorten the input. Here parameter values are stored in "pars".
<- function(pars, dat)
diffusion_deviance2
{<- sum(log(ddiffusion(rt=dat[dat["accuracy"] == 1, "rt"],
log_lik response="upper", v=pars[1], a=pars[2], t0=pars[3], z=pars[2]/2, s= 1)))
<- log_lik +
log_lik sum(log(ddiffusion(rt=dat[dat["accuracy"] == 0, "rt"],
response="lower", v=pars[1], a=pars[2], t0=pars[3], z=pars[2]/2, s= 1)))
if (any((-2*log_lik == 0))) return(1e6)
return(-2*log_lik)
}
# Try random numbers for required parameters. Please note that the order of parameter is important. Computer uses these parameter values by order as you determined in the function.
<- 0.05
driftrate <- 1.2
threshold <- 0.1
nonDMtime
diffusion_deviance2(pars = c(driftrate, threshold, nonDMtime), dat=rr98)
4.3.2 Parameter estimation
So the issue here is, we cannot try different sets of parameter values until we find the one with lowest discrepancy.
To solve this issue, we have to determine the method of searching a set of parameters when we ask computer to help us. Specifically, computer needs to know:
1. starting parameter values
2. the boundary for each parameter
3. optimization method
We will use non-linear minimization with boundaries function nlminb
to search the “optimal” set of parameter values.
# set up starting parameter values
<- c(runif(n=1, min=-5, max=5),
starting_values runif(n=1, min=0.5, max=5),
0.1)
names(starting_values) <- c("v","a","t0")
starting_values
# set up lower and upper boundary for each parameter
<-c(-5,0.5,0.1) # Lower Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time
lb <-c(5,5,0.5) # Upper Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time ub
# decide the optimization method. Here we selected nlminb
<- nlminb(start=starting_values,
fit_result objective=diffusion_deviance2,
lower=lb,# the lower boundary for each parameter. Order is important
upper=ub, # the upper boundary for each parameter. Order is important
dat=rr98)
print(fit_result$par) # report the estimated parameters in the Console window
4.3.3 Parameter estimation for multiple subjects
So far, we estimated the parameters regardless of subjects and manipulations. To quantify individual differences in information processing, we can separately estimate parameters for each subject.
# Initialization
<-0
n
# Pre-allocation (i.e., create an empty data frame and name each column)
<-setNames(data.frame(matrix(ncol = 7, nrow = 0)), c("name", "Mean RT (ms)","Mean Accuracy","v", "a","t0 (ms)", "nll"))
results
# Main for loop to estimate parameters for each subject
for (pp in unique(rr98$id)) {
print(pp) # display the current subject in the Console window, so that you know the current progress
= rr98[rr98$id == pp,]
data_pp
<- nlminb(start=starting_values,
fit objective=diffusion_deviance2,
lower=lb,# the lower boundary for each parameter. Order is important
upper=ub, # the upper boundary for each parameter. Order is important
dat=data_pp)
<- fit[["objective"]]
deviance <- fit[["par"]]
params
# store result for the first to the last subject
<- n+1 #update the current number of subject
n <- c(pp, #subject's name
results[n,] round(agg.id[n,2]*1000), #Mean RT (ms)
3], #Mean Accuracy
agg.id[n,round(params[1],2), #drift rate
round(params[2],2), #threshold
round(params[3]*1000), #non-decision time (ms)
#negative log likelihood
deviance)
} results
4.4 Exercise
Task: Separately perform the parameter estimation for each subject and each instruction condition (i.e., speed and accuracy). How to do it? What do you observe? Note: The results might be different from the original paper (Ratcliff and Rouder, 1998). The main reason is they have different assumption about drift rate (i.e., drift rates are intensity-dependent).
4.5 Answers
# set up starting parameter values
<- c(runif(n=1, min=-5, max=5),
starting_values runif(n=1, min=0.5, max=5),
0.1)
names(starting_values) <- c("v","a","t0")
starting_values
# set up lower and upper boundary for each parameter
<-c(-5,0.5,0.1) # Lower Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time
lb <-c(5,5,0.5) # Upper Boundary for each parameter. Order is important. Here, drift rate, threshold and non-decision time
ub <-0
n<-setNames(data.frame(matrix(ncol = 8, nrow = 0)), c("name", "Instruction", "Mean RT (ms)","Mean Accuracy","v", "a","t0 (ms)", "nll"))
results
for (pp in unique(rr98$id)) {
for (ins in unique(rr98$instruction)) {
print(pp)
<- n+1
n = rr98[rr98$id == pp & rr98$instruction == ins,]
data_pp
<- nlminb(start=starting_values,
fit objective=diffusion_deviance2,
lower=lb,# the lower boundary for each parameter. Order is important
upper=ub, # the upper boundary for each parameter. Order is important
dat=data_pp)
<- fit[["objective"]]
deviance <- fit[["par"]]
params <- c(pp,
results[n,]
ins,round(agg.id_instruct[agg.id_instruct$SubjID == pp & agg.id_instruct$Instruction == ins,3]*1000),
$SubjID == pp & agg.id_instruct$Instruction == ins,4],
agg.id_instruct[agg.id_instructround(params[1],2),
round(params[2],2),
round(params[3]*1000),
deviance)
}
}
results