Chapter 10 EMC2 Functions

In the sections below, we explain the core EMC2 functions and how they are used to design a model.

10.1 Design

10.1.1 make_design

To design a model in EMC2, we use the make_design function:

make_design(Flist,Ffactors,Rlevels,model,Clist=NULL,matchfun=NULL,constants=NULL,Fcovariates=NULL,Ffunctions=NULL,report_p_vector=TRUE)

The make_design function takes in several key arguments that are needed to create a design and apply this to the model.

  • Flist; Formula list. Linear models for each parameter type. For example, if we wanted to show an effect of SAT on parameter b, we would specify; b ~ SAT.
  • Ffactors; Formula factors. This includes the subjects factor and all factors used in design formulas. For example we would need to include subjects and SAT for our SAT example; subjects = subjects, SAT = 1:2. Here SAT = 1:2 because there are two levels for this factor.
  • Rlevels; Response factor levels. This specifies the levels for the responses. Here it is easiest to use numeric factor levels such as; Rlevels = 1:2 when there are two responses (let’s say left = 1 and right = 2).
  • model; specify the type of EMC2 model you wish to use. These include probit, SDTgaussian, rdm, normal, lnr, lba and ddm (however see Models in EMC2 to see all model types and parameterisations).
  • Clist; Contrast list. If not specified for a factor default contr.treatment. Clist <- list(SAT=contr.helmert). You can also specify your own custom contrasts (more later).
  • matchfun; A function for match scoring. Match scoring specifies whether a response was correct or not. It matches response with stimulus. For example, if stimulus was a left stimulus, a left response would be correct. Therefore we specify a matching function such as matchfun=function(d)d$stim==d$lR where lR is the latent response created in EMC2.
  • constants; In many models, we need to set constants for a model scale. For example, in SDT, we need to fix a parameter for one d’ such that the other d’s are relative to this (and they don’t move freely along the scale). Here we would specify mean=0.
  • Fcovariates; Factor Covariates. Covariate measures which may be included in the model and/or mapped to factors.
  • Ffunctions; Factor Functions. Functions to specify specific parameterisations, or functions of parameters.
  • report_p_vector; if true, the vector of parameters to be estimated will be returned. this is the same as doing sampled_p_vector.

10.1.2 design_model

Now that we have a design, we want to be able to use this for data generation and (more importantly) for model sampling. To do this, we need to link the data with the design. The design_model function does this. Using this function will output a dadm object.

design_model(data,design,model=NULL,prior = NULL,add_acc=TRUE,rt_resolution=0.001,verbose=TRUE,compress=TRUE,rt_check=TRUE)

Arguments - data; specify the data to be used (this can be generated data using the model as we show later). - design; specify the design that you made using make_design. - model=NULL; if NULL, this will use the model from make_design. - prior = NULL; If another prior is used, specify here. We show more on this topic later. - rt_resolution=0.001; when using models with rt, this function specifies the response time resolution (in seconds). Here, default is 0.001, which means that response times will be rounded at the fourth decimal place. This is useful for compression as response times generally have some measurement error, and by rounding at a specific resolution, we can make the likelihood estimation more efficient without sacrificing accuracy (up to a certain point, ~ 0.02) - verbose=TRUE; Specifies whether or not text output is given. - compress=TRUE; Specifies whether or not to compress the likelihood calculation. If true, any cells that have the same values (i.e., conditions, responses and rt’s) will be assigned the same likelihood to avoid calculating it many times. In RT models, the resolution impacts on compression as lower resolution will lead to higher compression (more similar rt’s) - rt_check=TRUE; checks for censoring and truncation. These are explained later.

10.1.3 Other design functions

  • map_p(p,dadm); Applying this function to a dadm and returns matrix of mapped parameters. This can help you to see how parameter values map to corresponding cells of the design.

  • sampled_p_vector(design,model=NULL); Makes an empty parameter vector using the design that corresponds to the model. We find this function incredibly useful when initially designing our models. The first thing we do after doing design_model is call sampled_p_vector to make sure the parameters to be estimated match our expectations.

  • get_design(samples); If a samples object is loaded in, get_design will print out the design from the samples object.

10.2 Mapping

Similar to design functions, mapping functions are used when setting up the model to map parameters to factors of the design. These are mainly used when looking to see if the design functions worked, or when investigating sampled parameters.

  • get_pars(p_vector,dadm); Transform parameter vector (according to model transformations used), map this to the design, and transform mapped parameters.

  • add_constants(p,constants); Augments parameter matrix or parameter vector with constant parameters (also used in data).

  • map_mcmc(mcmc,design,model); Maps vector or matrix (usually mcmc object) of sampled parameters to native model parameterization.

  • mapped_par(p_vector,design,model=NULL,digits=3,remove_subjects=TRUE); Shows augmented data and corresponding mapped parameters.

  • mapped_name_list <- function(design,model,save_design=FALSE); makes a list, with entries for each parameter type, of names for mapped parameters or with unique design columns.

  • mapped_designs <- function(samples,remove_subjects=TRUE); Shows augmented data and corresponding mapped parameter

  • get_map <- function(samples,add_design=TRUE)

10.3 Simulation and Recovery

Simulation and recovery are important parts of the modelling process. Simulation refers to generating data out of a model, given a set of ‘generating’ parameters. Recovery refers to investigating whether the sampled values of a fitted model are close to the true parameter values that generated the data. If this is not the case, we may have a problem in model specification, or the model may be intractable/unidentifiable. It may also mean that there is not enough information present to inform reliable parameter estimates. We can think of simulation and recovery as a powerful exercise that acts as the modeler’s power test. EMC2 has several functions for assessing model recovery.

  • make_data(p_vector,design,model=NULL,trials=NULL,data=NULL,expand=1,mapped_p=FALSE,LT=NULL,UT=NULL,LC=NULL,UC=NULL)

  • p_vector; parameter vector with specified values for each parameter of the model. This functions as the group level parameter vector.

  • design; A design as specified above.

  • trials=NULL; Number of trials to simulate per condition.

  • data=NULL; If data is included, the generation will follow the same format as the data.

  • expand=1; Multiplies the simulation. If trials are specified, expand will make the data x times larger for each subject. If data is specified, expand will generate x times the amount of data (in the same structure).

  • mapped_p=FALSE; For mapped parameters.

  • LT=NULL,UT=NULL,LC=NULL,UC=NULL; Parameters for truncation and censoring if true.

  • plot_alpha_recovery(tabs,layout=c(2,3),do_ci=TRUE,ci_col="grey",cap=.05,do_rmse=TRUE,rmse_pos="topleft",rmse_digits=3,do_coverage=TRUE,coverage_pos="bottomright",coverage_digits=1); function used to plot the recovery of subject-wise parameter recovery.

  • profile_pmwg(pname,p,p_min,p_max,dadm,n_point=100,main=""); used to make profile plots. Profile plots are useful for checking the model is functioning as expected (within a range), where the highest likelihood should be located at the generating value and values either side show should a decrease (inverted U shape).

10.4 Data Plots

Before sampling, one may also wish to interrogate the data, to ensure there are no unexpected effects present. For each model type, we have designed functions to easily evaluate data.

  • plot_defective_density(data,subject=NULL,factors=NULL,layout=NULL,mfcol=TRUE,xlim=NULL,bw="nrd0",adjust=1,correct_fun=NULL,rt="top",accuracy="topright"); plots the defective density of response times across conditions. This means that response time distributions across response types are plotted on the same figure which is then extrapolated out over all factors.

  • plot_roc(data,zROC=FALSE,qfun=NULL,main="",lim=NULL); Plots the receiver operator curve (ROC; false alarms:hits) and zROC (z-scored ROC) over conditions.

10.5 Sampling

Now that we have built our design and dadm object, we need to create a sampling object. We do this using the make_samplers function:

make_samplers <- function(data_list,design_list,model_list=NULL,type=c("standard","diagonal","blocked","factor","factorRegression","single")[1],n_chains=3,rt_resolution=0.02,prior_list = NULL,par_groups=NULL,n_factors=NULL,constraintMat = NULL,covariates=NULL)

  • data_list; The data to be used in the model.
  • design_list; The design we made previously.
  • model_list=NULL; The model, if not specified in the design.
  • type=c("standard","diagonal","blocked","factor","factorRegression","single"); The hyper level (group) model. standard uses a multivariate normal group level distribution; diagonal uses independent normal distributions; blocked specifies the standard approach with a blocked covariance structure; factor uses a factor decomposition of the standard models’ covariance matrix; factorRegression includes a factor regressor from covariates on the factor model; single uses a single subject sampling scheme.
  • n_chains=3; The number of chains to run (3 is generally good).
  • rt_resolution=0.02; when using models with rt, this function specifies the response time resolution (in seconds). Here, default is 0.001, which means that response times will be rounded at the fourth decimal place. This is useful for compression as response times generally have some measurement error, and by rounding at a specific resolution, we can make the likelihood estimation more efficient without sacrificing accuracy (up to a certain point, ~ 0.02)
  • prior_list = NULL; Used to specify custom priors if they are required. These are generally the mean and mean variance at the group level.
  • n_factors=NULL; number of factors to use in the factor model.
  • constraintMat = NULL; used to specify the constraint when using a factor model.
  • covariates=NULL; used to specify which covariates are to be used.

Having made the design and checked the model and data, the next step is sampling (fitting/estimating the model). For this, there are several main functions available. Here, we only show the key functions for sampling, which are broken down into three main stages.

10.5.1 run_burn

Used for the burn-in stage, run_burn takes the pmwg::run_stage for burn-in and adds automatic functions for convergence and efficient sampling, leading to an “automatic burn-in” stage. The burn-in stage takes particles from a mix of two proposal distributions - one centered on the individuals current particle (with group variance scaled by epsilon), and one centered on the group (with group variance).

run_burn(samplers,ndiscard=100,nstart=300,particles=NA, particle_factor = 50, start_mu = NULL, start_var = NULL,mix = NULL,verbose=TRUE,verbose_run_stage=FALSE,max_gd_trys=100,max_gd=1.2,thorough=TRUE,mapped=FALSE, step_size = NA,min_es=NULL,min_iter=NULL,max_iter=NULL,epsilon = NULL, epsilon_upper_bound=15, p_accept=0.7,cores_per_chain=1,cores_for_chains=NULL)

  • samplers; an object created from design_model that includes the model, data and a design.
  • n_discard=100; the number of initial samples to discard.
  • n_start=300; the number of burn-in samples (after discarding). Here, with 100 discarded and 300 start, we would do 400 iterations and discard the first 300.
  • particles=NA; Number of particles per iteration. If NA, particle_factor will be used.
  • particle_factor=100; The number of particles used per iteration, where particles = sqrt(n_pars)*particle_factor.
  • start_mu = NULL;
  • start_var = NULL;
  • mix = NULL;
  • verbose=TRUE;
  • verbose_run_stage=FALSE
  • max_gd_trys=100; Gelman-Diag is used to check convergence of the chains in burn-in. This is the “auto” part to ensure all the chains have converged around the posterior. This argument specifies the number of times to try and achieve a specified Gelman-Diag criterion.
  • max_gd=1.2; The maximum Gelman-Diag criteria value for any of the measures used.
  • thorough=TRUE; if TRUE, Gelman-Diag computed across many metrics. If FALSE, Gelman-Diag only computed across the multivariate variance.
  • mapped=FALSE;
  • step_size = NA;
  • min_es=NULL
  • min_iter=NULL; Minimum number of iterations to do in burn-in. Here, if Gelman-Diag criteria is reached before min iterations, there will be extra iterations.
  • max_iter=NULL; Maximum number of iterations for burn-in.
  • epsilon = NULL;
  • epsilon_upper_bound=15;
  • p_accept=0.7; The proportional acceptance rate desired. For p_accept = 0.7 the sampler will aim to obtain a new particle on 70% of iterations by adaptively changing the epsilon value for each subject, such that subjects with a high proportion of new particles will have a larger epsilon, consequently increasing the variance of the proposal distribution and leading to lower chances of a new particle (and vice-versa for a subject with a proportion of new particles).
  • cores_per_chain=1; The number of cores per chain. By default, 3 chains are run.
  • cores_for_chains=NULL; The number of chains to run across cores.

10.5.2 run_adapt

Used for the adaptation stage where an efficient proposal distribution is attempted to be created. This uses the same proposal distributions as in burn-in. run_adapt takes the pmwg::run_stage for adaptation and adds automatic functions for convergence and efficient sampling, including specifying the minimum number of unique values and effective samples, as well as giving the user more control over the number of attempts, and frequency, at creating the efficient proposal distribution.

run_adapt <- function(samplers, max_trys=25, epsilon=NULL, particles=NA,particle_factor=40, p_accept=.7, cores_per_chain=1, cores_for_chains=NA, mix=NULL, n_cores_conditional=1, min_es=NULL, min_unique=200, step_size=25, thin=NULL,verbose=FALSE, verbose_run_stage = F)

  • samplers; an object created from design_model that includes the model, data and a design.
  • max_trys=25; the maximum number of attempts made to construct the conditional efficient proposal distribution used in sampling.
  • epsilon=NULL;
  • particles=NA; Number of particles per iteration. If NA, particle_factor will be used.
  • particle_factor=40; The number of particles used per iteration, where particles = sqrt(n_pars)*particle_factor.
  • p_accept=0.7; The proportional acceptance rate desired. For p_accept = 0.7 the sampler will aim to obtain a new particle on 70% of iterations by adaptively changing the epsilon value for each subject, such that subjects with a high proportion of new particles will have a larger epsilon, consequently increasing the variance of the proposal distribution and leading to lower chances of a new particle (and vice-versa for a subject with a proportion of new particles).
  • cores_per_chain=1; The number of cores per chain. By default, 3 chains are run.
  • cores_for_chains=NULL; The number of chains to run across cores.
  • mix=NULL;
  • n_cores_conditional;
  • min_es=NULL; The minimum number of effective samples for the adapt stage.
  • min_unique=200; The minimum number of unique particles per subject.
  • step_size=25; How often the conditional efficient distribution is attempted to be created.
  • thin=NULL; Thins the sampled object. If thin=5, every 5th iteration will be kept.
  • verbose=FALSE;
  • verbose_run_stage=FALSE;

10.5.3 run_sample

Used for the sampling stage where the particles are drawn from (mainly) the efficient proposal distribution created in adaptation, as well as the other proposal distributions used in burn-in and adaptation. run_sample takes the pmwg::run_stage for sampling and adds automatic functions for efficient sampling, including specifying the minimum number of unique values, effective sampling and any thinning to be done. run_sample is used to sample efficiently from the posterior mode to collect iter number of samples.

run_sample <- function(samplers, iter=NA, verbose=FALSE, epsilon=NULL, particles=NA, particle_factor=25, p_accept=.7, cores_per_chain=1, cores_for_chains=NA, mix=NULL, n_cores_conditional=1, step_size=50, thin=NULL, verbose_run_stage=FALSE)

  • samplers; an object created from design_model that includes the model, data and a design.
  • iter=NA;
  • verbose=FALSE;
  • epsilon=NULL;
  • particles=NA; Number of particles per iteration. If NA, particle_factor will be used.
  • particle_factor=25; The number of particles used per iteration, where particles = sqrt(n_pars)*particle_factor.
  • p_accept=0.7; The proportional acceptance rate desired. For p_accept = 0.7 the sampler will aim to obtain a new particle on 70% of iterations by adaptively changing the epsilon value for each subject, such that subjects with a high proportion of new particles will have a larger epsilon, consequently increasing the variance of the proposal distribution and leading to lower chances of a new particle (and vice-versa for a subject with a proportion of new particles).
  • cores_per_chain=1; The number of cores per chain. By default, 3 chains are run.
  • cores_for_chains=NULL; The number of chains to run across cores.
  • mix=NULL;
  • n_cores_conditional=1;
  • step_size=50; How often the conditional efficient distribution is attempted to be created.
  • thin=NULL; Thins the sampled object. If thin=5, every 5th iteration will be kept.
  • verbose_run_stage=FALSE;

10.5.4 run_emc

run_emc <- function(file_name,nsample=1000, ...)

We can use run_emc when we are confident that our model will run. This does all steps (burn, adapt, sample) with default settings. This acts as a combined auto fitting function, getting and saving samples to disk. Note; samples must be first object in file_name.

10.5.5 run_chains

run_chains(samplers,iter=c(300,0,0),verbose=TRUE,verbose_run_stage=FALSE,mix=NULL,max_adapt_trys=10,particles=NA,particle_factor=100, p_accept= 0.7,cores_per_chain=1,cores_for_chains=NULL,min_unique=200,epsilon_upper_bound=2,epsilon = NULL, start_mu = NULL, start_var = NULL,pdist_update_n=50); used similar to pmwg::run_stage however, runs a specified number of chains. This is used by the auto functions above.

run_gd(samplers,iter=NA,max_trys=100,verbose=FALSE,burn=FALSE,max_gd=1.1,thorough=TRUE, natural=FALSE,epsilon = NULL,pdist_update_n=50, particles=NA,particle_factor=100, p_accept=NULL,cores_per_chain=1,cores_for_chains=NA,mix=NULL,min_unique=200,epsilon_upper_bound=2,min_es=NULL,min_iter=NULL,max_iter=NULL); used to check the Gelman-Diag statistic. This is used by the run_burn function.

10.6 Checks & Chain Statistics

Following sampling (and generally following burn-in), it is imperative to check the sampled chains. PMwG uses fewer chains than STAN or DE-MCMC - and are theoretically not required - however, we implement these to avoid instances of the sampler getting stuck in local maxima (i.e., regions of the parameter space that are locally ‘good’ but are not the globally optimal solution).

check_run(samples,pdf_name="check_run.pdf",interactive=TRUE,filter="sample",subfilter=0,layout=c(3,4),width=NULL,height=NULL); This function runs through a number of posterior checks on a sampled object and outputs these to a pdf. These checks relate to checking sampling behaviour and parameter chains.

chain_n(samplers); returns the length of stages for each chain. This is useful when checking whether chains ran for too long or not long enough.

check_adapt(samplers,verbose=TRUE); checks whether chains were adapted (in the adapt stage).

es_pmwg(pmwg_mcmc,selection="alpha",summary_alpha=mean,print_summary=TRUE,sort_print=TRUE,filter="burn",thin=1,subfilter=NULL); used to calculate the number of effective samples from the sampling.

gd_pmwg(pmwg_mcmc,return_summary=FALSE,print_summary=TRUE,digits_print=2,sort_print=TRUE,autoburnin=FALSE,transform=TRUE,selection="alpha",filter="sample",thin=1,subfilter=NULL,natural=FALSE,mapped=FALSE); returns R hat statistic. Prints multivariate summary and returns each participant unless multivariate as matrix unless !return_summary. Used to calculate the Gelman-Diag statistic across chains. Here we aim for values less than 1.1, however, in run_gd we can aim for more strict values.

iat_pmwg(pmwg_mcmc,print_summary=TRUE,digits_print=2,sort_print=TRUE,summary_alpha=mean,selection="alpha",filter="sample",thin=1,subfilter=NULL); Integrated autocorrelation time. Prints multivariate summary and returns each participant unless multivariate as matrix unless !return_summary.

10.7 Chain Plotting

plot_chains(pmwg_mcmc,layout=NA,subject=NA,ylim=NULL,selection="alpha",filter="burn",thin=1,subfilter=NULL,plot_acf=FALSE,acf_chain=1, verbose=TRUE); plots the parameter estimates across chains per subject. Can also plot log-likelihood values. Useful for checking convergence.

plot_density(pmwg_mcmc,layout=c(2,3),selection="alpha",filter="burn",thin=1,subfilter=NULL,mapped=FALSE,plot_prior=TRUE,n_prior=1e3,xlim=NULL,ylim=NULL,show_chains=FALSE,do_plot=TRUE,subject=NA,pars=NULL,probs=c(.025,.5,.975),bw = "nrd0", adjust = 1); Plots the density for parameter estimates across chains. If show_chains superimposes destinies for each chain on same plot invisibly returns tables of true and 95% CIs (for all chains combined no matter what show_chains is).

10.8 Sample Stats

These functions are intended to assist with model inference, where we can check model comparison or make inference from estimated parameters. NOTE: the information criteria methods are not recommended, but can be used as a “quick check”. Instead, we recommend using IS^2, which returns marginal likelihood estimates for models given data.

p_test(x,y=NULL,p_name,mapped=FALSE,c_vector=NULL,x_name=NULL,y_name=NULL,mu=0,alternative = c("less", "greater")[1],probs = c(0.025,.5,.975),digits=2,p_digits=3,print_table=TRUE,x_filter="sample",x_selection="alpha",x_subfilter=0,y_filter="sample",y_selection="alpha",y_subfilter=0); used to test for differences/effects. p_test is similar to the brms::t_test function, where the user can specify a single parameter to test against 0 (i.e., presence of an effect), or can compare two parameters (difference; for example when cell coding is used).

10.9 Model Comparison

We use these functions below for typical (and fast) model comparison measures (i.e., information criterion methods), as well as more robust comparison measures (i.e., marginal likelihood estimates).

10.9.1 Information Criterions

pmwg_IC(samplers,filter="burn",subfilter=0,use_best_fit=FALSE,print_summary=FALSE,digits=0,subject=NULL); Information criterion values for deviance information criterion (DIC), Bayesian Predictive Information Criterion (BPIC), effective parameters, mean deviance, and deviance of mean.

compare_ICs(sList,filter="burn",subfilter=0,use_best_fit=FALSE,print_summary=FALSE,digits=0,digits_p=3,subject=NULL); Used to compare information criteria across models.

compare_IC <- function(sList,filter="sample",subfilter=0,use_best_fit=TRUE, print_summary=TRUE,digits=0,digits_p=3,subject=NULL)

10.9.2 Marginal Likelihood Estimation

Whilst IC methods have their advantages, we advise to use more robust model comparison metrics that account for the prior as well as the fit (rather than ad-hoc penalty for complexity). For this, we have integrated importance sampling squared (IS^2) functionality into EMC2. We include a chapter on Model Comparison with IS2 later on. The functions for IS^2 are;

run_IS2 <- function(samples, filter = "sample", subfilter = 0, IS_samples = 1000, stepsize_particles = 500, max_particles = 5000, n_cores = 1, df = 5)

std_error_IS2 <- function(IS_samples, n_bootstrap = 50000)

compare_MLL <- function(mll,nboot=100000,digits=2,print_summary=TRUE)

10.10 Posterior Prediction

Following model sampling, it is important to check the predictive qualities of the estimated model. In Bayesian model sampling, the descriptive adequacy of the model can be checked using posterior prediction, where estimated posterior values are taken and “plugged in” to the model to return synthetic data. We can then compare the synthetic data to the observed data to check the quality of fit. This can enable us to evaluate whether any inference is compromised by descriptive inadequacy.

post_predict(samples,hyper=FALSE,n_post=100,expand=1,filter="sample",subfilter=1,thin=1,n_cores=1,use_par=c("random","mean","median")[1]); generates posterior predictive data given a sampled object. - hyper=FALSE; generate data at the group level only. - n_post=100; number of posterior predictive samples. - expand=1; if the data is expanded (i.e. > 1), the number of trials generated will be multiplied by this value for each posterior sample. - filter="sample"; focus on the sampling stage for posterior generation - thin=1; how many samples to thin the sampled object by - use_par=c("random","mean","median"); which method to use to sample posterior predictive values.

10.11 Fit Plots

Following posterior data prediction, we should plot the observed data against the predicted data. Here, we show common methods of plotting descriptive adequacy, which are model dependent.

plot_fit(data,pp,subject=NULL,factors=NULL,stat=NULL,stat_name="",ci=c(.025,.5,.975),do_plot=TRUE,xlim=NULL,ylim=NULL,layout=NULL,mfcol=TRUE,probs=c(1:99)/100,data_lwd=2,fit_lwd=1,qp_cex=1,q_points=c(.1,.3,.5,.7,.9),pqp_cex=.5,lpos="topleft",matchfun=NULL,signalFactor="S",zROC=FALSE,qfun=NULL,lim=NULL,rocfit_cex=.5); plots the observed data against predicted data, with various arguments relevant for different models.

10.12 Other

get_sigma(samps,filter="samples",thin=1,subfilter=0); Returns the variance-covariance matrix (for iterations). We often use cov2cor to turn this to a correlation matrix, and corrplot for plotting correlations

merge_samples(samples)

thin_chains(samplers,thin=5)

as_Mcmc(sampler,filter=stages,thin=1,subfilter=0,selection=c("alpha","mu","variance","covariance","correlation","LL","epsilon")[1])

parameters_data_frame(samples,filter="sample",thin=1,subfilter=0,mapped=FALSE,include_constants=FALSE, selection=c("alpha","mu","variance","covariance","correlation","LL","epsilon")[2]); extracts and stacks chains into a matrix.

subject_names(samples)