Chapter 10 EMC2 Functions
In the sections below, we explain the core EMC2 functions and how they are used to design a model.
10.1 Design
10.1.1 make_design
To design a model in EMC2, we use the make_design
function:
make_design(Flist,Ffactors,Rlevels,model,Clist=NULL,matchfun=NULL,constants=NULL,Fcovariates=NULL,Ffunctions=NULL,report_p_vector=TRUE)
The make_design
function takes in several key arguments that are needed to create a design and apply this to the model.
- Flist; Formula list. Linear models for each parameter type. For example, if we wanted to show an effect of SAT on parameter
b
, we would specify;b ~ SAT
. - Ffactors; Formula factors. This includes the subjects factor and all factors used in design formulas. For example we would need to include subjects and SAT for our SAT example;
subjects = subjects, SAT = 1:2
. HereSAT = 1:2
because there are two levels for this factor. - Rlevels; Response factor levels. This specifies the levels for the responses. Here it is easiest to use numeric factor levels such as;
Rlevels = 1:2
when there are two responses (let’s sayleft = 1
andright = 2
). - model; specify the type of EMC2 model you wish to use. These include
probit
,SDTgaussian
,rdm
,normal
,lnr
,lba
andddm
(however see Models in EMC2 to see all model types and parameterisations). - Clist; Contrast list. If not specified for a factor default contr.treatment.
Clist <- list(SAT=contr.helmert)
. You can also specify your own custom contrasts (more later).
- matchfun; A function for match scoring. Match scoring specifies whether a response was correct or not. It matches response with stimulus. For example, if stimulus was a left stimulus, a left response would be correct. Therefore we specify a matching function such as
matchfun=function(d)d$stim==d$lR
wherelR
is the latent response created in EMC2. - constants; In many models, we need to set constants for a model scale. For example, in SDT, we need to fix a parameter for one d’ such that the other d’s are relative to this (and they don’t move freely along the scale). Here we would specify
mean=0
. - Fcovariates; Factor Covariates. Covariate measures which may be included in the model and/or mapped to factors.
- Ffunctions; Factor Functions. Functions to specify specific parameterisations, or functions of parameters.
- report_p_vector; if true, the vector of parameters to be estimated will be returned. this is the same as doing
sampled_p_vector
.
10.1.2 design_model
Now that we have a design, we want to be able to use this for data generation and (more importantly) for model sampling. To do this, we need to link the data with the design. The design_model
function does this. Using this function will output a dadm
object.
design_model(data,design,model=NULL,prior = NULL,add_acc=TRUE,rt_resolution=0.001,verbose=TRUE,compress=TRUE,rt_check=TRUE)
Arguments
- data
; specify the data to be used (this can be generated data using the model as we show later).
- design
; specify the design that you made using make_design
.
- model=NULL
; if NULL, this will use the model from make_design
.
- prior = NULL
; If another prior is used, specify here. We show more on this topic later.
- rt_resolution=0.001
; when using models with rt, this function specifies the response time resolution (in seconds). Here, default is 0.001, which means that response times will be rounded at the fourth decimal place. This is useful for compression as response times generally have some measurement error, and by rounding at a specific resolution, we can make the likelihood estimation more efficient without sacrificing accuracy (up to a certain point, ~ 0.02)
- verbose=TRUE
; Specifies whether or not text output is given.
- compress=TRUE
; Specifies whether or not to compress the likelihood calculation. If true, any cells that have the same values (i.e., conditions, responses and rt’s) will be assigned the same likelihood to avoid calculating it many times. In RT models, the resolution impacts on compression as lower resolution will lead to higher compression (more similar rt’s)
- rt_check=TRUE
; checks for censoring and truncation. These are explained later.
10.1.3 Other design functions
map_p(p,dadm)
; Applying this function to a dadm and returns matrix of mapped parameters. This can help you to see how parameter values map to corresponding cells of the design.sampled_p_vector(design,model=NULL)
; Makes an empty parameter vector using the design that corresponds to the model. We find this function incredibly useful when initially designing our models. The first thing we do after doingdesign_model
is callsampled_p_vector
to make sure the parameters to be estimated match our expectations.get_design(samples)
; If a samples object is loaded in,get_design
will print out the design from the samples object.
10.2 Mapping
Similar to design functions, mapping functions are used when setting up the model to map parameters to factors of the design. These are mainly used when looking to see if the design functions worked, or when investigating sampled parameters.
get_pars(p_vector,dadm)
; Transform parameter vector (according to model transformations used), map this to the design, and transform mapped parameters.add_constants(p,constants)
; Augments parameter matrix or parameter vector with constant parameters (also used in data).map_mcmc(mcmc,design,model)
; Maps vector or matrix (usually mcmc object) of sampled parameters to native model parameterization.mapped_par(p_vector,design,model=NULL,digits=3,remove_subjects=TRUE)
; Shows augmented data and corresponding mapped parameters.mapped_name_list <- function(design,model,save_design=FALSE)
; makes a list, with entries for each parameter type, of names for mapped parameters or with unique design columns.mapped_designs <- function(samples,remove_subjects=TRUE)
; Shows augmented data and corresponding mapped parameterget_map <- function(samples,add_design=TRUE)
10.3 Simulation and Recovery
Simulation and recovery are important parts of the modelling process. Simulation refers to generating data out of a model, given a set of ‘generating’ parameters. Recovery refers to investigating whether the sampled values of a fitted model are close to the true parameter values that generated the data. If this is not the case, we may have a problem in model specification, or the model may be intractable/unidentifiable. It may also mean that there is not enough information present to inform reliable parameter estimates. We can think of simulation and recovery as a powerful exercise that acts as the modeler’s power test. EMC2 has several functions for assessing model recovery.
make_data(p_vector,design,model=NULL,trials=NULL,data=NULL,expand=1,mapped_p=FALSE,LT=NULL,UT=NULL,LC=NULL,UC=NULL)
p_vector
; parameter vector with specified values for each parameter of the model. This functions as the group level parameter vector.design
; A design as specified above.trials=NULL
; Number of trials to simulate per condition.data=NULL
; If data is included, the generation will follow the same format as the data.expand=1
; Multiplies the simulation. If trials are specified, expand will make the data x times larger for each subject. If data is specified, expand will generate x times the amount of data (in the same structure).mapped_p=FALSE
; For mapped parameters.LT=NULL,UT=NULL,LC=NULL,UC=NULL
; Parameters for truncation and censoring if true.plot_alpha_recovery(tabs,layout=c(2,3),do_ci=TRUE,ci_col="grey",cap=.05,do_rmse=TRUE,rmse_pos="topleft",rmse_digits=3,do_coverage=TRUE,coverage_pos="bottomright",coverage_digits=1)
; function used to plot the recovery of subject-wise parameter recovery.profile_pmwg(pname,p,p_min,p_max,dadm,n_point=100,main="")
; used to make profile plots. Profile plots are useful for checking the model is functioning as expected (within a range), where the highest likelihood should be located at the generating value and values either side show should a decrease (inverted U shape).
10.4 Data Plots
Before sampling, one may also wish to interrogate the data, to ensure there are no unexpected effects present. For each model type, we have designed functions to easily evaluate data.
plot_defective_density(data,subject=NULL,factors=NULL,layout=NULL,mfcol=TRUE,xlim=NULL,bw="nrd0",adjust=1,correct_fun=NULL,rt="top",accuracy="topright")
; plots the defective density of response times across conditions. This means that response time distributions across response types are plotted on the same figure which is then extrapolated out over all factors.plot_roc(data,zROC=FALSE,qfun=NULL,main="",lim=NULL)
; Plots the receiver operator curve (ROC; false alarms:hits) and zROC (z-scored ROC) over conditions.
10.5 Sampling
Now that we have built our design and dadm object, we need to create a sampling object. We do this using the make_samplers
function:
make_samplers <- function(data_list,design_list,model_list=NULL,type=c("standard","diagonal","blocked","factor","factorRegression","single")[1],n_chains=3,rt_resolution=0.02,prior_list = NULL,par_groups=NULL,n_factors=NULL,constraintMat = NULL,covariates=NULL)
data_list
; The data to be used in the model.design_list
; The design we made previously.model_list=NULL
; The model, if not specified in the design.type=c("standard","diagonal","blocked","factor","factorRegression","single")
; The hyper level (group) model.standard
uses a multivariate normal group level distribution;diagonal
uses independent normal distributions;blocked
specifies the standard approach with a blocked covariance structure;factor
uses a factor decomposition of the standard models’ covariance matrix;factorRegression
includes a factor regressor from covariates on the factor model;single
uses a single subject sampling scheme.n_chains=3
; The number of chains to run (3 is generally good).rt_resolution=0.02
; when using models with rt, this function specifies the response time resolution (in seconds). Here, default is 0.001, which means that response times will be rounded at the fourth decimal place. This is useful for compression as response times generally have some measurement error, and by rounding at a specific resolution, we can make the likelihood estimation more efficient without sacrificing accuracy (up to a certain point, ~ 0.02)prior_list = NULL
; Used to specify custom priors if they are required. These are generally the mean and mean variance at the group level.n_factors=NULL
; number of factors to use in the factor model.constraintMat = NULL
; used to specify the constraint when using a factor model.covariates=NULL
; used to specify which covariates are to be used.
Having made the design and checked the model and data, the next step is sampling (fitting/estimating the model). For this, there are several main functions available. Here, we only show the key functions for sampling, which are broken down into three main stages.
10.5.1 run_burn
Used for the burn-in stage, run_burn takes the pmwg::run_stage
for burn-in and adds automatic functions for convergence and efficient sampling, leading to an “automatic burn-in” stage. The burn-in stage takes particles from a mix of two proposal distributions - one centered on the individuals current particle (with group variance scaled by epsilon), and one centered on the group (with group variance).
run_burn(samplers,ndiscard=100,nstart=300,particles=NA, particle_factor = 50, start_mu = NULL, start_var = NULL,mix = NULL,verbose=TRUE,verbose_run_stage=FALSE,max_gd_trys=100,max_gd=1.2,thorough=TRUE,mapped=FALSE, step_size = NA,min_es=NULL,min_iter=NULL,max_iter=NULL,epsilon = NULL, epsilon_upper_bound=15, p_accept=0.7,cores_per_chain=1,cores_for_chains=NULL)
samplers
; an object created fromdesign_model
that includes the model, data and a design.n_discard=100
; the number of initial samples to discard.n_start=300
; the number of burn-in samples (after discarding). Here, with 100 discarded and 300 start, we would do 400 iterations and discard the first 300.particles=NA
; Number of particles per iteration. If NA,particle_factor
will be used.particle_factor=100
; The number of particles used per iteration, whereparticles = sqrt(n_pars)*particle_factor
.start_mu = NULL
;start_var = NULL
;mix = NULL
;verbose=TRUE
;verbose_run_stage=FALSE
max_gd_trys=100
; Gelman-Diag is used to check convergence of the chains in burn-in. This is the “auto” part to ensure all the chains have converged around the posterior. This argument specifies the number of times to try and achieve a specified Gelman-Diag criterion.max_gd=1.2
; The maximum Gelman-Diag criteria value for any of the measures used.thorough=TRUE
; if TRUE, Gelman-Diag computed across many metrics. If FALSE, Gelman-Diag only computed across the multivariate variance.mapped=FALSE
;step_size = NA
;min_es=NULL
min_iter=NULL
; Minimum number of iterations to do in burn-in. Here, if Gelman-Diag criteria is reached before min iterations, there will be extra iterations.max_iter=NULL
; Maximum number of iterations for burn-in.epsilon = NULL
;epsilon_upper_bound=15
;p_accept=0.7
; The proportional acceptance rate desired. Forp_accept = 0.7
the sampler will aim to obtain a new particle on 70% of iterations by adaptively changing the epsilon value for each subject, such that subjects with a high proportion of new particles will have a larger epsilon, consequently increasing the variance of the proposal distribution and leading to lower chances of a new particle (and vice-versa for a subject with a proportion of new particles).cores_per_chain=1
; The number of cores per chain. By default, 3 chains are run.cores_for_chains=NULL
; The number of chains to run across cores.
10.5.2 run_adapt
Used for the adaptation stage where an efficient proposal distribution is attempted to be created. This uses the same proposal distributions as in burn-in. run_adapt takes the pmwg::run_stage
for adaptation and adds automatic functions for convergence and efficient sampling, including specifying the minimum number of unique values and effective samples, as well as giving the user more control over the number of attempts, and frequency, at creating the efficient proposal distribution.
run_adapt <- function(samplers, max_trys=25, epsilon=NULL, particles=NA,particle_factor=40, p_accept=.7, cores_per_chain=1, cores_for_chains=NA, mix=NULL, n_cores_conditional=1, min_es=NULL, min_unique=200, step_size=25, thin=NULL,verbose=FALSE, verbose_run_stage = F)
samplers
; an object created fromdesign_model
that includes the model, data and a design.max_trys=25
; the maximum number of attempts made to construct the conditional efficient proposal distribution used in sampling.epsilon=NULL
;particles=NA
; Number of particles per iteration. If NA,particle_factor
will be used.particle_factor=40
; The number of particles used per iteration, whereparticles = sqrt(n_pars)*particle_factor
.p_accept=0.7
; The proportional acceptance rate desired. Forp_accept = 0.7
the sampler will aim to obtain a new particle on 70% of iterations by adaptively changing the epsilon value for each subject, such that subjects with a high proportion of new particles will have a larger epsilon, consequently increasing the variance of the proposal distribution and leading to lower chances of a new particle (and vice-versa for a subject with a proportion of new particles).cores_per_chain=1
; The number of cores per chain. By default, 3 chains are run.cores_for_chains=NULL
; The number of chains to run across cores.mix=NULL
;n_cores_conditional
;min_es=NULL
; The minimum number of effective samples for the adapt stage.min_unique=200
; The minimum number of unique particles per subject.step_size=25
; How often the conditional efficient distribution is attempted to be created.thin=NULL
; Thins the sampled object. Ifthin=5
, every 5th iteration will be kept.verbose=FALSE
;verbose_run_stage=FALSE
;
10.5.3 run_sample
Used for the sampling stage where the particles are drawn from (mainly) the efficient proposal distribution created in adaptation, as well as the other proposal distributions used in burn-in and adaptation. run_sample takes the pmwg::run_stage
for sampling and adds automatic functions for efficient sampling, including specifying the minimum number of unique values, effective sampling and any thinning to be done. run_sample is used to sample efficiently from the posterior mode to collect iter
number of samples.
run_sample <- function(samplers, iter=NA, verbose=FALSE, epsilon=NULL, particles=NA, particle_factor=25, p_accept=.7, cores_per_chain=1, cores_for_chains=NA, mix=NULL, n_cores_conditional=1, step_size=50, thin=NULL, verbose_run_stage=FALSE)
samplers
; an object created fromdesign_model
that includes the model, data and a design.iter=NA
;verbose=FALSE
;epsilon=NULL
;particles=NA
; Number of particles per iteration. If NA,particle_factor
will be used.particle_factor=25
; The number of particles used per iteration, whereparticles = sqrt(n_pars)*particle_factor
.p_accept=0.7
; The proportional acceptance rate desired. Forp_accept = 0.7
the sampler will aim to obtain a new particle on 70% of iterations by adaptively changing the epsilon value for each subject, such that subjects with a high proportion of new particles will have a larger epsilon, consequently increasing the variance of the proposal distribution and leading to lower chances of a new particle (and vice-versa for a subject with a proportion of new particles).cores_per_chain=1
; The number of cores per chain. By default, 3 chains are run.cores_for_chains=NULL
; The number of chains to run across cores.mix=NULL
;n_cores_conditional=1
;step_size=50
; How often the conditional efficient distribution is attempted to be created.thin=NULL
; Thins the sampled object. Ifthin=5
, every 5th iteration will be kept.verbose_run_stage=FALSE
;
10.5.4 run_emc
run_emc <- function(file_name,nsample=1000, ...)
We can use run_emc
when we are confident that our model will run. This does all steps (burn, adapt, sample) with default settings. This acts as a combined auto fitting function, getting and saving samples to disk. Note; samples must be first object in file_name.
10.5.5 run_chains
run_chains(samplers,iter=c(300,0,0),verbose=TRUE,verbose_run_stage=FALSE,mix=NULL,max_adapt_trys=10,particles=NA,particle_factor=100, p_accept= 0.7,cores_per_chain=1,cores_for_chains=NULL,min_unique=200,epsilon_upper_bound=2,epsilon = NULL, start_mu = NULL, start_var = NULL,pdist_update_n=50)
; used similar to pmwg::run_stage
however, runs a specified number of chains. This is used by the auto functions above.
run_gd(samplers,iter=NA,max_trys=100,verbose=FALSE,burn=FALSE,max_gd=1.1,thorough=TRUE, natural=FALSE,epsilon = NULL,pdist_update_n=50, particles=NA,particle_factor=100, p_accept=NULL,cores_per_chain=1,cores_for_chains=NA,mix=NULL,min_unique=200,epsilon_upper_bound=2,min_es=NULL,min_iter=NULL,max_iter=NULL)
; used to check the Gelman-Diag statistic. This is used by the run_burn
function.
10.6 Checks & Chain Statistics
Following sampling (and generally following burn-in), it is imperative to check the sampled chains. PMwG uses fewer chains than STAN or DE-MCMC - and are theoretically not required - however, we implement these to avoid instances of the sampler getting stuck in local maxima (i.e., regions of the parameter space that are locally ‘good’ but are not the globally optimal solution).
check_run(samples,pdf_name="check_run.pdf",interactive=TRUE,filter="sample",subfilter=0,layout=c(3,4),width=NULL,height=NULL)
; This function runs through a number of posterior checks on a sampled object and outputs these to a pdf. These checks relate to checking sampling behaviour and parameter chains.
chain_n(samplers)
; returns the length of stages for each chain. This is useful when checking whether chains ran for too long or not long enough.
check_adapt(samplers,verbose=TRUE)
; checks whether chains were adapted (in the adapt stage).
es_pmwg(pmwg_mcmc,selection="alpha",summary_alpha=mean,print_summary=TRUE,sort_print=TRUE,filter="burn",thin=1,subfilter=NULL)
; used to calculate the number of effective samples from the sampling.
gd_pmwg(pmwg_mcmc,return_summary=FALSE,print_summary=TRUE,digits_print=2,sort_print=TRUE,autoburnin=FALSE,transform=TRUE,selection="alpha",filter="sample",thin=1,subfilter=NULL,natural=FALSE,mapped=FALSE)
; returns R hat statistic. Prints multivariate summary and returns each participant unless multivariate as matrix unless !return_summary. Used to calculate the Gelman-Diag statistic across chains. Here we aim for values less than 1.1, however, in run_gd
we can aim for more strict values.
iat_pmwg(pmwg_mcmc,print_summary=TRUE,digits_print=2,sort_print=TRUE,summary_alpha=mean,selection="alpha",filter="sample",thin=1,subfilter=NULL)
; Integrated autocorrelation time. Prints multivariate summary and returns each participant unless multivariate as matrix unless !return_summary.
10.7 Chain Plotting
plot_chains(pmwg_mcmc,layout=NA,subject=NA,ylim=NULL,selection="alpha",filter="burn",thin=1,subfilter=NULL,plot_acf=FALSE,acf_chain=1, verbose=TRUE)
; plots the parameter estimates across chains per subject. Can also plot log-likelihood values. Useful for checking convergence.
plot_density(pmwg_mcmc,layout=c(2,3),selection="alpha",filter="burn",thin=1,subfilter=NULL,mapped=FALSE,plot_prior=TRUE,n_prior=1e3,xlim=NULL,ylim=NULL,show_chains=FALSE,do_plot=TRUE,subject=NA,pars=NULL,probs=c(.025,.5,.975),bw = "nrd0", adjust = 1)
; Plots the density for parameter estimates across chains. If show_chains superimposes destinies for each chain on same plot invisibly returns tables of true and 95% CIs (for all chains combined no matter what show_chains is).
10.8 Sample Stats
These functions are intended to assist with model inference, where we can check model comparison or make inference from estimated parameters. NOTE: the information criteria methods are not recommended, but can be used as a “quick check”. Instead, we recommend using IS^2, which returns marginal likelihood estimates for models given data.
p_test(x,y=NULL,p_name,mapped=FALSE,c_vector=NULL,x_name=NULL,y_name=NULL,mu=0,alternative = c("less", "greater")[1],probs = c(0.025,.5,.975),digits=2,p_digits=3,print_table=TRUE,x_filter="sample",x_selection="alpha",x_subfilter=0,y_filter="sample",y_selection="alpha",y_subfilter=0)
; used to test for differences/effects. p_test
is similar to the brms::t_test
function, where the user can specify a single parameter to test against 0 (i.e., presence of an effect), or can compare two parameters (difference; for example when cell coding is used).
10.9 Model Comparison
We use these functions below for typical (and fast) model comparison measures (i.e., information criterion methods), as well as more robust comparison measures (i.e., marginal likelihood estimates).
10.9.1 Information Criterions
pmwg_IC(samplers,filter="burn",subfilter=0,use_best_fit=FALSE,print_summary=FALSE,digits=0,subject=NULL)
; Information criterion values for deviance information criterion (DIC), Bayesian Predictive Information Criterion (BPIC), effective parameters, mean deviance, and deviance of mean.
compare_ICs(sList,filter="burn",subfilter=0,use_best_fit=FALSE,print_summary=FALSE,digits=0,digits_p=3,subject=NULL)
; Used to compare information criteria across models.
compare_IC <- function(sList,filter="sample",subfilter=0,use_best_fit=TRUE, print_summary=TRUE,digits=0,digits_p=3,subject=NULL)
10.9.2 Marginal Likelihood Estimation
Whilst IC methods have their advantages, we advise to use more robust model comparison metrics that account for the prior as well as the fit (rather than ad-hoc penalty for complexity). For this, we have integrated importance sampling squared (IS^2) functionality into EMC2. We include a chapter on Model Comparison with IS2 later on. The functions for IS^2 are;
run_IS2 <- function(samples, filter = "sample", subfilter = 0, IS_samples = 1000, stepsize_particles = 500, max_particles = 5000, n_cores = 1, df = 5)
std_error_IS2 <- function(IS_samples, n_bootstrap = 50000)
compare_MLL <- function(mll,nboot=100000,digits=2,print_summary=TRUE)
10.10 Posterior Prediction
Following model sampling, it is important to check the predictive qualities of the estimated model. In Bayesian model sampling, the descriptive adequacy of the model can be checked using posterior prediction, where estimated posterior values are taken and “plugged in” to the model to return synthetic data. We can then compare the synthetic data to the observed data to check the quality of fit. This can enable us to evaluate whether any inference is compromised by descriptive inadequacy.
post_predict(samples,hyper=FALSE,n_post=100,expand=1,filter="sample",subfilter=1,thin=1,n_cores=1,use_par=c("random","mean","median")[1])
; generates posterior predictive data given a sampled object.
- hyper=FALSE
; generate data at the group level only.
- n_post=100
; number of posterior predictive samples.
- expand=1
; if the data is expanded (i.e. > 1), the number of trials generated will be multiplied by this value for each posterior sample.
- filter="sample"
; focus on the sampling stage for posterior generation
- thin=1
; how many samples to thin the sampled object by
- use_par=c("random","mean","median")
; which method to use to sample posterior predictive values.
10.11 Fit Plots
Following posterior data prediction, we should plot the observed data against the predicted data. Here, we show common methods of plotting descriptive adequacy, which are model dependent.
plot_fit(data,pp,subject=NULL,factors=NULL,stat=NULL,stat_name="",ci=c(.025,.5,.975),do_plot=TRUE,xlim=NULL,ylim=NULL,layout=NULL,mfcol=TRUE,probs=c(1:99)/100,data_lwd=2,fit_lwd=1,qp_cex=1,q_points=c(.1,.3,.5,.7,.9),pqp_cex=.5,lpos="topleft",matchfun=NULL,signalFactor="S",zROC=FALSE,qfun=NULL,lim=NULL,rocfit_cex=.5)
; plots the observed data against predicted data, with various arguments relevant for different models.
10.12 Other
get_sigma(samps,filter="samples",thin=1,subfilter=0)
; Returns the variance-covariance matrix (for iterations). We often use cov2cor
to turn this to a correlation matrix, and corrplot
for plotting correlations
merge_samples(samples)
thin_chains(samplers,thin=5)
as_Mcmc(sampler,filter=stages,thin=1,subfilter=0,selection=c("alpha","mu","variance","covariance","correlation","LL","epsilon")[1])
parameters_data_frame(samples,filter="sample",thin=1,subfilter=0,mapped=FALSE,include_constants=FALSE, selection=c("alpha","mu","variance","covariance","correlation","LL","epsilon")[2])
; extracts and stacks chains into a matrix.
subject_names(samples)