General methods
The analytical method was broadly similar across the three studies. Below, we present the commonalities in the statistical analysis and in the power analysis. Several R packages from the ‘tidyverse’ (Wickham et al., 2019) were used.
Covariates
Several covariates—or nuisance variables—were included in each study to allow a rigorous analysis of the effects of interest (Sassenhagen & Alday, 2016). Unlike the effects of interest, these covariates were not critical to our research question (i.e., the interplay between language-based and vision-based information). They comprised participant-specific variables (e.g., attentional control), lexical variables (e.g., word frequency) and word concreteness. The covariates are distinguished from the effects of interest in the results table(s) in each study. The three kinds of covariates included were as follows.
Participant-specific covariates were measures akin to general cognition, and were included because some studies have found that the effect of vocabulary size was moderated by general cognition variables such as processing speed (Ratcliff et al., 2010; Yap et al., 2012). Similarly, research has evidenced the role of attentional control (Hutchison et al., 2014; Yap et al., 2017), and authors have expressed the desirability of including such covariates in models (James et al., 2018; Pexman & Yap, 2018). Therefore, we included in the analyses a individual measure of ‘general cognition’, where available. These measures were available in the first two studies, and they indexed task performance abilities that were different from vocabulary knowledge. We refer to them by their more specific names in each study.7 In Study 2.1, the measure used was attentional control
(Hutchison et al., 2013). In Study 2.2, it was information uptake
(Pexman & Yap, 2018). In Study 2.3, such a covariate was not used as it was not available in the data set of Balota et al. (2007).
Lexical covariates were selected in every study out of the same five variables, which had been used as covariates in Wingfield and Connell (2022b; also see Petilli et al., 2021). They comprised: number of letters (i.e., orthographic length), word frequency, number of syllables (both the latter from Balota et al., 2007), orthographic Levenshtein distance (Yarkoni et al., 2008) and phonological Levenshtein distance (Suárez et al., 2011; Yap & Balota, 2009). The selection among these candidates was performed because some of them were highly intercorrelated—i.e., \(r\) > .70 (Dormann et al., 2013; Harrison et al., 2018). The correlations and the selection models are available in Appendix A.
Word concreteness was included due to the pervasive effect of this variable across lexical and semantic tasks (Brysbaert et al., 2014; Connell & Lynott, 2012; Pexman & Yap, 2018), and due to the sizable correlations (\(r\) > .30) between word concreteness and some other predictors, such as visual strength (see correlation figures in each study). Furthermore, the role of word concreteness has been contested, with some research suggesting that its effect stems from perceptual simulation (Connell & Lynott, 2012) versus other research suggesting that the effect is amodal (Bottini et al., 2021). In passing, we will bring our results to bear on the role of word concreteness.
Data preprocessing and statistical analysis
In the three studies, the statistical analysis was designed to investigate the contribution of each effect of interest. The following preprocessing steps were applied. First, incorrect responses were removed. Second, nonword trials were removed (only necessary in Studies 2.1 and 2.3). Third, too fast and too slow responses were removed. For the latter purpose, we applied the same thresholds that had been applied in each of the original studies. That is, in Study 2.1, we removed responses faster than 200 ms or slower than 3,000 ms (Hutchison et al., 2013). In Study 2.2, we removed responses faster than 250 ms or slower than 3,000 ms (Pexman & Yap, 2018). In Study 2.3, we removed responses faster than 200 ms or slower than 4,000 ms (Balota et al., 2007). Next, the dependent variable—response time (RT)—was \(z\)-scored around each participant’s mean to curb the influence of each participant’s baseline speed (Balota et al., 2007; Kumar et al., 2020; Lim et al., 2020; Pexman et al., 2017; Pexman & Yap, 2018; Yap et al., 2012, 2017). This was important because the size of experimental effects is known to increase with longer RTs (Faust et al., 1999). Next, binary predictors were recoded into continuous variables (Brauer & Curtin, 2018). Specifically, participants’ gender was recoded as follows: Female = 0.5, X = 0, Male = -0.5. The SOAs in Study 2.1 were recoded as follows: 200 ms = -0.5; 1,200 ms = 0.5. Next, the data sets were trimmed by removing rows that lacked values on any variable, and by also removing RTs that were more than 3 standard deviations (SD) away from the mean (M). The nesting factors applied in the trimming are specified in each study. Finally, all predictors were \(z\)-scored, resulting in M ≈ 0 and SD ≈ 1 (values not exact as the variables were not normally distributed). More specifically, between-item predictors—i.e., word-level variables (e.g., language-based information) and task-level variables (e.g., SOA)—were \(z\)-scored around each participant’s own mean (Brauer & Curtin, 2018).
Random effects
With regard to random effects, participants and stimuli were crossed in the three studies. That is, each participant was presented with a subset of the stimuli. Conversely, each word was presented to a subset of participants. Therefore, linear mixed-effects models were implemented. These models included a maximal random-effects structure, with by-participant and by-item random intercepts, and the appropriate random slopes for all effects of interest (Barr et al., 2013; Brauer & Curtin, 2018; Singmann & Kellen, 2019). Random effects—especially random slopes—constrain the analytical space by claiming their share of variance. As a result, that variance cannot be taken by the fixed effects. In the semantic priming study, the items were prime–target pairs, whereas in the semantic decision and lexical decision studies, the items were individual words. In the case of interactions, random slopes were included only when the interacting variables varied within the same unit (Brauer & Curtin, 2018)—e.g., an interaction of two variables varying within participants (only present in Study 2.1). Where required due to convergence warnings, random slopes for covariates were removed, as inspired by Remedy 11 from Brauer and Curtin (2018). In this regard, whereas Brauer and Curtin (2018) contemplate the removal of random slopes for covariates only when the covariates are not interacting with any effects of interest, we removed random slopes for covariates even if they interacted with effects of interest because these interactions were covariates themselves.
To avoid an inflation of the Type I error rate—i.e., false positives—, the random slopes for the effects of interest (as indicated in each study) were never removed (see Table 17 in Brauer & Curtin, 2018; for an example of this approach, see Diaz et al., 2021). This approach arguably provides a better protection against false positives (Barr et al., 2013; Brauer & Curtin, 2018; Singmann & Kellen, 2019) than the practice of removing random slopes when they do not significantly improve the fit (Baayen et al., 2008; Bates et al., 2015; e.g., Bernabeu et al., 2017; Pexman & Yap, 2018; but also see Matuschek et al., 2017).
Frequentist analysis
\(P\) values were calculated using the Kenward-Roger approximation for degrees of freedom (Luke, 2017) in the R package ‘lmerTest’, Version 3.1-3 (Kuznetsova et al., 2017). The latter package in turn used ‘lme4’, Version 1.1-26 (Bates et al., 2015; Bates et al., 2021). To facilitate the convergence of the models, the maximum number of iterations was set to 1 million. Diagnostics regarding convergence and normality are provided in Appendix B. Those effects that are non-significant or very small are best interpreted by considering the confidence intervals and the credible intervals (Cumming, 2014).
The R package ‘GGally’ (Schloerke et al., 2021) was used to create correlation plots, whereas the package ‘sjPlot’ (Lüdecke, 2021) was used for interaction plots.
Bayesian analysis
A Bayesian analysis was performed to complement the estimates that had been obtained in the frequentist analysis. Whereas the goal of the frequentist analysis had been hypothesis testing, for which \(p\) values were used, the goal of the Bayesian analysis was parameter estimation. Accordingly, we estimated the posterior distribution of every effect, without calculating Bayes factors (for other examples of the same estimation approach, see Milek et al., 2018; Pregla et al., 2021; Rodríguez-Ferreiro et al., 2020; for comparisons between estimation and hypothesis testing, see Cumming, 2014; Kruschke & Liddell, 2018; Rouder et al., 2018; Schmalz et al., 2021; Tendeiro & Kiers, 2019, in press; van Ravenzwaaij & Wagenmakers, 2021). In the estimation approach, the estimates are interpreted by considering the position of their credible intervals in relation to the expected effect size. That is, the closer an interval is to an effect size of 0, the smaller the effect of that predictor. For instance, an interval that is symmetrically centred on 0 indicates a very small effect, whereas—in comparison—an interval that does not include 0 at all indicates a far larger effect.
This analysis served two purposes: first, to ascertain the interpretation of the smaller effects—which were identified as unreliable in the power analyses—, and second, to complement the estimates obtained in the frequentist analysis. The latter purpose was pertinent because the frequentist models presented convergence warnings—even though it must be noted that a previous study found that frequentist and Bayesian estimates were similar despite convergence warnings appearing in the frequentist analysis (Rodríguez-Ferreiro et al., 2020). Furthermore, the complementary analysis was pertinent because the frequentist models presented residual errors that deviated from normality—even though mixed-effects models are fairly robust to such a deviation (Knief & Forstmeier, 2021; Schielzeth et al., 2020). Owing to these precedents, we expected to find broadly similar estimates in the frequentist analyses and in the Bayesian ones. Across studies, each frequentist model has a Bayesian counterpart, with the exception of the secondary analysis performed in Study 2.1 (semantic priming) that included vision-based similarity
as a predictor. The R package ‘brms’, Version 2.17.0, was used for the Bayesian analysis (Bürkner, 2018; Bürkner et al., 2022).
Priors
The priors were established by inspecting the effect sizes obtained in previous studies as well as the effect sizes obtained in our frequentist analyses of the present data (reported in Studies 2.1, 2.2 and 2.3 below). In the first regard, the previous studies that were considered were selected because the experimental paradigms, variables and analytical procedures they had used were similar to those used in our current studies. Specifically, regarding paradigms, we sought studies that implemented: (I) semantic priming with a lexical decision task—as in Study 2.1—, (II) semantic decision—as in Study 2.2—, or (III) lexical decision—as in Study 2.3. Regarding analytical procedures, we sought studies in which both the dependent and the independent variables were \(z\)-scored. We found two studies that broadly matched these criteria: Lim et al. (2020) (see Table 5 therein) and Pexman and Yap (2018) (see Tables 6 and 7 therein). Out of these studies, Pexman and Yap (2018) contained the variables that were most similar to ours, which included vocabulary size (labelled ‘NAART’) and word frequency.
Based on both these studies and on the frequentist analyses reported below, a range of effect sizes was identified that spanned between β = -0.30 and β = 0.30. This range was centred around 0 as the variables were \(z\)-scored. The bounds of this range were determined by the largest effects, which appeared in Pexman and Yap (2018). Pexman et al. conducted a semantic decision study, and split the data set into abstract and concrete words. The two largest effects they found were—first—a word concreteness effect in the concrete-words analysis of β = -0.41, and—second—a word concreteness effect in the abstract-words analysis of β = 0.20. Unlike Pexman et al., we did not split the data set into abstract and concrete words, but analysed these sets together. Therefore, we averaged between the aforementioned values, obtaining a range between β = -0.30 and β = 0.30.
In the results of Lim et al. (2020) and Pexman and Yap (2018), and in our frequentist results, some effects consistently presented a negative polarity (i.e., leading to shorter response times), whereas some other effects were consistently positive. We incorporated the direction of effects into the priors only in cases of large effects that had presented a consistent direction (either positive or negative) in previous studies and in our frequentist analyses in the present studies. These criteria were matched by the following variables: word frequency—with a negative direction, as higher word frequency leads to shorter RTs (Brysbaert et al., 2016; Brysbaert et al., 2018; Lim et al., 2020; Mendes & Undorf, 2021; Pexman & Yap, 2018)—, number of letters and number of syllables—both with positive directions (Barton et al., 2014; Beyersmann et al., 2020; Pexman & Yap, 2018)—, and orthographic Levenshtein distance—with a positive direction (Cerni et al., 2016; Dijkstra et al., 2019; Kim et al., 2018; Yarkoni et al., 2008). We did not incorporate information about the direction of the word concreteness effect, as this effect can follow different directions in abstract and concrete words (Brysbaert et al., 2014; Pexman & Yap, 2018), and we analysed both sets of words together. In conclusion, the four predictors that had directional priors were covariates. All the other predictors had priors centred on 0. Last, as a methodological matter, it is noteworthy that most of the psycholinguistic studies applying Bayesian analysis have not incorporated any directional information in priors (e.g., Pregla et al., 2021; Rodríguez-Ferreiro et al., 2020; Stone et al., 2020; cf. Stone et al., 2021).
Prior distributions and prior predictive checks
The choice of priors can influence the results in consequential ways. To assess the extent of this influence, prior sensitivity analyses have been recommended. These analyses are performed by comparing the effect of more and less strict priors—or, in other words, priors varying in their degree of informativeness. The degree of variation is adjusted through the standard deviation, and the means are not varied (Lee & Wagenmakers, 2014; Schoot et al., 2021; Stone et al., 2020).
In this way, we compared the results obtained using ‘informative’ priors (SD = 0.1), ‘weakly-informative’ priors (SD = 0.2) and ‘diffuse’ priors (SD = 0.3). These standard deviations were chosen so that around 95% of values in the informative priors would fall within our initial range of effect sizes that spanned from -0.30 to 0.30. All priors are illustrated in Figure 1. These priors resembled others from previous psycholinguistic studies (Pregla et al., 2021; Stone et al., 2020; Stone et al., 2021). For instance, Stone et al. (2020) used the following priors: \(Normal\)(0, 0.1), \(Normal\)(0, 0.3) and \(Normal\)(0, 1). The range of standard deviations we used—i.e., 0.1, 0.2 and 0.3—was narrower than those of previous studies because our dependent variable and our predictors were \(z\)-scored, resulting in small estimates and small SDs (see Lim et al., 2020; Pexman & Yap, 2018). These priors were used on the fixed effects and on the standard deviation parameters of the fixed effects. For the correlations among the random effects, an LKJ(2) prior was used (Lewandowski et al., 2009). This is a ‘regularising’ prior, as it assumes that high correlations among random effects are rare (also used in Rodríguez-Ferreiro et al., 2020; Stone et al., 2020; Stone et al., 2021; Vasishth, Nicenboim, et al., 2018).
Code
source('bayesian_priors/bayesian_priors.R', local = TRUE)
include_graphics(
paste0(
getwd(), # Circumvent illegal characters in file path
'bayesian_priors/plots/bayesian_priors.pdf'
))
The adequacy of each of these priors was assessed by performing prior predictive checks, in which we compared the observed data to the predictions of the model (Schoot et al., 2021). Furthermore, in these checks we also tested the adequacy of two model-wide distributions: the traditional Gaussian distribution (default in most analyses) and an exponentially modified Gaussian—dubbed ‘ex-Gaussian’—distribution (Matzke & Wagenmakers, 2009). The ex-Gaussian distribution was considered because the residual errors of the frequentist models were not normally distributed (Lo & Andrews, 2015), and because this distribution was found to be more appropriate than the Gaussian one in a previous, related study (see supplementary materials of Rodríguez-Ferreiro et al., 2020). The ex-Gaussian distribution had an identity link function, which preserves the interpretability of the coefficients, as opposed to a transformation applied directly to the dependent variable (Lo & Andrews, 2015). The results of these prior predictive checks revealed that the priors were adequate, and that the ex-Gaussian distribution was more appropriate than the Gaussian one (see Appendix C), converging with Rodríguez-Ferreiro et al. (2020). Therefore, the ex-Gaussian distribution was used in the final models.
Prior sensitivity analysis
In the main analysis, the informative, weakly-informative and diffuse priors were used in separate models. In other words, in each model, all priors had the same degree of informativeness (as done in Pregla et al., 2021; Rodríguez-Ferreiro et al., 2020; Stone et al., 2020; Stone et al., 2021). In this way, a prior sensitivity analysis was performed to acknowledge the likely influence of the priors on the posterior distributions—that is, on the results (Lee & Wagenmakers, 2014; Schoot et al., 2021; Stone et al., 2020).
Posterior distributions
Posterior predictive checks were performed to assess the consistency between the observed data and new data predicted by the posterior distributions (Schoot et al., 2021). These checks are available in Appendix C.
Convergence
When convergence was not reached in a model, as indicated by \(\widehat R\) > 1.01 (Schoot et al., 2021; Vehtari et al., 2021), the number of iterations was increased and the random slopes for covariates were removed (Brauer & Curtin, 2018). The resulting random effects in these models were largely the same as those present in the frequentist models. The only exception concerned the models of the lexical decision study. In the frequentist model for the latter study, the random slopes for covariates were removed due to convergence warnings, whereas in the Bayesian analysis, these random slopes did not have to be removed as the models converged, thanks to the large number of iterations that were run. In the lexical decision study, it was possible to run a larger number of iterations than in the two other studies, as the lexical decision data set had fewer observations, resulting in faster running.
The Bayesian models in the semantic decision study could not be made to converge, and the final results of these models were not valid. Therefore, those estimates are not shown in the main text, but are available in Appendix E.
Statistical power analysis
Power curves based on Monte Carlo simulations were performed for most of the effects of interest using the R package ‘simr’, Version 1.0.5 (Green & MacLeod, 2016). Obtaining power curves for a range of effects in each study allows for a comprehensive assessment of the plausibility of the power estimated for each effect.
In each study, the item-level sample size—i.e., the number of words—was not modified. Therefore, to plan the sample size for future studies, these results must be considered under the assumptions that the future study would apply a statistical method similar to ours—namely, a mixed-effects model with random intercepts and slopes—, and that the analysis would encompass at least as many stimuli as the corresponding study (numbers detailed in each study below). \(P\) values were calculated using the Satterthwaite approximation for degrees of freedom (Luke, 2017).
Monte Carlo simulations consist of running the statistical model a large number of times, under slight, random variations of the dependent variable (Green & MacLeod, 2016; for a comparable approach, see Loken & Gelman, 2017). The power to detect each effect of interest is calculated by dividing the number of times that the effect is significant by the total number of simulations run. For instance, if an effect is significant on 85 simulations out of 100, the power for that effect is 85% (Kumle et al., 2021). The sample sizes tested in the semantic priming study ranged from 50 to 800 participants, whereas those tested in the semantic decision and lexical decision studies ranged from 50 to 2,000 participants. These sample sizes were unequally spaced to limit the computational requirements. They comprised the following: 50, 100, 200, 300, 400, 500, 600, 700, 800, 1,200, 1,600 and 2,000 participants.8 The variance of the results decreases as more simulations are run. In each of our three studies, 200 simulations (as in Brysbaert & Stevens, 2018) were run for each effect of interest and for each sample size under consideration. Thus, for a power curve examining the power for an effect across 12 sample sizes, 2,400 simulations were run.
Power analyses require setting an effect size for each effect. Often, it is difficult to determine the effect size, as the amount and the scope of relevant research are usually finite and biased (Albers & Lakens, 2018; Gelman & Carlin, 2014; Kumle et al., 2021). In some power analyses, the original effect sizes from previous studies have been adopted without any modification (e.g., Pacini & Barnard, 2021; Villalonga et al., 2021). In contrast, some authors have opted to reduce the previous effect sizes to account for two intervening factors. First, publication bias and insufficient statistical power cause published effect sizes to be inflated (Brysbaert, 2019; Loken & Gelman, 2017; Open Science Collaboration, 2015; Vasishth, Mertzen, et al., 2018; Vasishth & Gelman, 2021). Second, over the course of the research, a variety of circumstances could create differences between the planned study and the studies that were used in the power analysis. Some of these differences could be foreseeable—for instance, if they are due to a limitation in the literature available for the power analysis—, whereas other differences might be unforeseeable and could go unnoticed (Barsalou, 2019; Noah et al., 2018). Reducing the effect size in the power analysis leads to an increase of the sample size of the planned study (Brysbaert & Stevens, 2018; Green & MacLeod, 2016; Hoenig & Heisey, 2001). The reduced effect size—sometimes dubbed the smallest effect size of interest—is often set with a degree of arbitrariness. In previous studies, Fleur et al. (2020) applied a reduction of 1/8 (i.e., 12.5%), whereas Kumle et al. (2021) applied a 15% reduction. In the present study, a reduction of 20% was applied to every effect in the power analysis. By comparison with the power analyses reviewed in this paragraph, the present reduction will lead to a more conservative estimate of required sample sizes. However, after considering the precedents of small samples and publication bias reviewed above, a 20% reduction is arguably a reasonable safeguard. Indeed, a posteriori, the results of our power analyses suggested that the 20% reduction had not been excessive, as some of the effects examined were detectable with small sample sizes.
Both the primary analysis and the power analysis were performed in R (R Core Team, 2021). Version 4.0.2 was used for the frequentist analysis, Version 4.1.0 was used for the Bayesian analysis, and Version 4.1.2 was used for fast operations such as data preprocessing and plotting. Given the complexity of these analyses, all the statistical and the power analyses were run on the High-End Computing facility at Lancaster University.9
References
The general cognition measures could also be dubbed general or fluid intelligence, but we think that cognition is more appropriate in our present context.↩︎
For the semantic priming study, the remaining sample sizes up to 2,000 participants have not finished running yet. Upon finishing, they will be reported in this manuscript.↩︎
Information about this facility is available at https://answers.lancaster.ac.uk/display/ISS/High+End+Computing+%28HEC%29+help. Even though analysis jobs were run in parallel, some of the statistical analyses took four months to complete (specifically, one month for the final model to run, which was delayed due to three reasons: limited availability of machines, occasional cancellations of jobs to allow maintenance work on the machines, and lack of convergence of the models). Furthermore, the power analysis for the semantic priming study took six months (specifically, two months of running, with delays due to the limited availability of machines and occasional cancellations of jobs).↩︎
Thesis: https://doi.org/10.17635/lancaster/thesis/1795.
Online book created using the R package bookdown.