# Chapter 4 Data analysis

## 4.1 Necessary conditions ‘in kind’: The ‘empty space’

Necessary condition hypotheses are qualitative statements like ‘\(X\) is necessary for \(Y\).’ The statement indicates that it is impossible to have \(Y\) without \(X\). Consequently, if the statement holds, the space corresponding to having \(Y\) but not \(X\) in an \(XY\) plot or contingency table does not have observations, thus is empty. To calculate the size of the empty space, NCA draws a border line (ceiling line) between the empty space without observation and the full space with observations. The size of the empty space relative to the full space is the effect size and indicates the constraint that \(X\) poses on \(Y\). When the effect size is large (e.g., d > 0.1), and when the empty space is unlikely to be a random result of unrelated \(X\) and \(Y\) (e.g., p < 0.05), the researcher may conclude that there is empirical evidence for the hypothesis. Thus, after having evaluated *theoretical support*, the *effect size* and the *p value*, the hypothesis ‘\(X\) is necessary for \(Y\)’ may be supported. This hypothesis is formulated in a qualitative way and describes the necessary condition *in kind*. The effect size and its p value can be produced with the `NCA`

software using the `nca_analysis`

function.

## 4.2 Necessary conditions ‘in degree’: The bottleneck table

For additional insights, the necessity relationship can be formulated in a quantitative way: the necessary condition *in degree*: “*level* of \(X\) is necessary for *level* of \(Y\).” The bottleneck table is a helpful tool for evaluating necessary conditions in degree. The bottleneck table is the tabular representation of the ceiling line. The first column is the outcome \(Y\) and the next columns are the necessary conditions. The values in the table are levels of \(X\) and \(Y\) corresponding to the ceiling line.
By reading the bottleneck table row by row from left to right it can be evaluated for which particular level of \(Y\), which particular threshold levels of the conditions \(X\) are necessary. The bottleneck table includes only conditions that are supposed to be necessary conditions in kind. Conditions that are not supposed to be necessary (e.g., because the effect size is too small or the p value too large) are usually excluded from the table. The bottleneck table can be produced with the `NCA`

software using the argument`bottlenecks = 'TRUE'`

in the `nca_output`

function.

Figure 4.1 shows the software output of a particular bottleneck table (Dul, Hauff, et al., 2021) showing two personality traits of sales persons that were identified as necessary conditions for *Sales performance* (\(Y\)): *Ambition* (\(X_1\)) and *Sociability* (\(X_2\)) because there is theoretical support for it, the effect sizes are relatively large (d > 0.10), and the p values are relatively low (p < 0.05).

The example shows that up to level 40 of Sales performance, Ambition and Sociability are *not* necessary (NN). For level 50 and 60 of Sales performance only Ambition is necessary and for higher levels of Sales performance both personality traits are necessary. For example, for a level of 70 of Sales performance, Ambition must be at least 29.4 and Sociability at least 23.2. If a case (a sales person) has a level of a condition that is lower than the threshold value, this person cannot achieve the corresponding level of Sale performance. The condition is a bottleneck. For the highest level of Sales performance, the required threshold levels of Ambition and Sociability are 66.8 and 98.4 respectively.

### 4.2.1 Levels expressed as percentage of range

Figure 4.1 is an example of a default bottleneck table produced by the `NCA`

software. This table has 11 rows with levels of \(X\) and \(Y\) expressed as ‘percentage.range.’ The *range* is the maximum level minus the minimal level. A level of 0% of the range corresponds to the minimum level, 100% the maximum level, and 50% the middle level between these extremes. In the first row, the \(Y\) level is 0% of the \(Y\) range, and in the eleventh row it is 100%. In the default bottleneck table the levels of \(Y\) and \(X\)’s are expressed as percentages of their respective ranges.

### 4.2.2 Levels expressed as actual values

The levels of \(X\) and \(Y\) in the bottleneck table can also be expressed as actual values. Actual values are the values of \(X\) and \(Y\) as they are in the original dataset (and in the scatter plot). This way of expressing the levels can help to compare the bottleneck table results with the original data and with the scatter plot results. Expressing the levels of \(X\) with actual values can be done with the argument `bottleneck.x = 'actual'`

in the `nca_analysis`

function of the `NCA`

software, and the actual values of \(Y\) can be shown in the bottleneck table with the argument `bottleneck.y = 'actual'`

.

One practical use of a bottleneck table with actual values is illustrated with the example of sales performance. Figure 4.2 combines the bottleneck table with the scatter plots.

The actual values of Sales performance range from 0 to 5.5, and for both conditions from 0 to 100. The left side of Figure 4.2 shows that up to level of 2 of Sales performance, none of the conditions are necessary. The scatter plots shows that cases with very low level of Ambition or Sociability still can achieve a level of 2 of Sales performance. However, for level 4 of Sales performance Ambition must be at least 33 and Sociability at least 29. The scatter plot shows that several cases have not reached these threshold levels of the conditions. This means that these cases do not reach level 4 of Sales performance. The scatter plots also show that cases exist with (much) higher levels of the conditions than these threshold levels and yet do not reach level of 4 of Sales performance: Ambition and Sociability are necessary, but not sufficient for Sales performance.

The right side of Figure 4.2 shows a particular case with Sales performance level 4, Ambition level 65 and Sociability level 59. For this person Sociability is the bottleneck for moving from level 4 to level 5 of Sales performance. Level 5 of Sales performance requires level 55 of Ambition and this is already achieved by this person. However, the person’s level of Sociability is 59 and this is below the threshold level that is required for level 5 of Sales performance. For reaching this level of Sales performance, this person must increase the level of Sociability (e.g., by training). Improving the Ambition without improving Sociability has no effect and is a waste of effort. For other persons, the individual situation might be different (e.g., Ambition but not Sociability is the bottleneck, both are bottlenecks, or none is a bottleneck). This type of bottleneck analysis allows the researcher to better understand the bottlenecks of individual cases and how to act on indivual cases.

### 4.2.3 Levels expressed as percentiles

The levels of \(X\) and \(Y\) can also be expressed as percentiles. This can be helpful for selecting ‘important’ necessary conditions when acting on a group of cases. The percentile is a score where a certain percentage of scores fall below that score. For example, 90 percentile of \(X\) is a score of \(X\) that is so high that 90% of the observed \(X\)-scores fall below that \(X\)-score. Similarly, 5 percentile of \(X\) is a score of \(X\) that is so low that 5% of the observed \(X\)-scores fall below that \(X\)-score.
The bottleneck can be expressed with percentiles by using the argument `bottleneck.x = 'percentile'`

and/or `bottleneck.y = 'percentile'`

in the `nca_analysis`

function of the `NCA`

software. Figure 4.3 shows the example bottleneck table with the level of \(Y\) expressed as actual values, and the level of the \(X\)s as percentiles.

A percentile level of \(X\) corresponds to the percentage of cases that are *unable* to reach the threshold level of \(X\) and thus the corresponding level of \(Y\) in the same row. If in a certain row (e.g., row with \(Y\) = 3) the percentile of \(X\) is small (e.g. 4% for Ambition), only a few cases (4% of 108 = 4) where not able to reach the required level of \(X\) for the corresponding level of \(Y\). If the percentile of \(X\) is large (e.g., row with \(Y\) = 5.5), many cases where not able to reach the required level of \(X\) for the corresponding level of \(Y\). Therefore, the percentile for \(X\) is an indication of the ‘importance’ of the necessary conditions: how many cases were unable to reach the required level of the necessary condition for a particular level of the outcome. When several necessary conditions exit, this information may be relevant for prioritizing a collective actions on a group of cases: focus on those bottleneck conditions with high percentiles values of \(X\).

Further details of using and interpreting the bottleneck table in this specific example can be seen in this video.

### 4.2.4 Levels expressed as percentage of maximum

The final way of expressing the levels of \(X\) and \(Y\) is less commonly used. It is possible to express levels of \(X\) and \(Y\) as percentage of their maximum levels. The bottleneck can be expressed with percentage of maximum by using the argument `bottleneck.x = 'percentage.max'`

or `bottleneck.y = 'percentage.max'`

in the `nca_analysis`

function of the `NCA`

software.

### 4.2.5 NN and NA the bottleneck table

A NN (Not Necessary) in the bottleneck table means that \(X\) is not necessary for \(Y\) for the particular level of \(Y\). With any value of \(X\) it is possible to achieve the particular level of \(Y\).

An NA (Not Applicable) in the bottleneck table is a warning that it is not possible to compute a value for the \(X\). There are two possible reasons for it, the first more often than the second:
1. The maximum possible value of the condition for the particular level of \(Y\) according to the ceiling line is lower than the actually observed maximum value. This can happen for example when the CR ceiling (which is a trend line) runs at \(X\) = \(X_{max}\) (which is the right vertical line of the scope in the scatter plot) under the line \(Y\) = \(Y_{max}\) (which is the upper horizontal line of the scope). If this happens the researcher can either explain why this NA appears in the bottleneck table, or can change NA into the highest observed level of \(X\). The latter can be done with the argument `cutoff`

= 1 in the `nca_analysis`

function.
2. In a bottleneck table with multiple conditions, one case determines the \(Y_{max}\) value and that case has a missing value for the condition with the NA (but not for another condition). When all cases are complete (have no missing values) or when at least one case exist that has a complete observation (\(X\), \(Y_{max}\)) for the given condition, the NA will not appear. The action to be taken is either to explain why this NA appears, or to delete the incomplete case from the bottleneck table analysis (and accept that the \(Y_{max}\) in the bottleneck table does not correspond to the actually observed \(Y_{max}\)).

## 4.3 Combining NCA with regression analysis

### 4.3.1 Introduction

Regression is the mother of all data analyses in the social sciences. It was invented more than 100 years ago when Francis Galton (Galton, 1886) quantified the pattern in the scores of parental height and child height (see Figure 4.4 showing the original graph).

In Figure 4.5 Galton’s data are shown in two \(XY\) scatter plots.

Galton drew lines though the middle of the data for describing the average trend between Parental height and Child height: the regression line (Figure 4.5 A). For example, with a Parent height of 175 cm, the estimated average Child height is about 170 cm. Galton could also have drawn a line on top of the data for describing the necessity of Parent height for Child height: the ceiling line (Figure 4.5 B). For example, with a Parent height of 175 cm, the estimated maximum possible Child height is about 195 cm. But Galton did not draw a ceiling line, and the social sciences have adopted the average trend line as the basis for many data analysis approaches. Regression analysis has developed over the years and many variants exist. The main variant is Ordinary Least Squares (OLS) regression. It is used for example in Simple Linear Regression (SLR), Multiple Linear Regression (MLR), path analysis, variance-based Structural Equation Modeling (SEM) and Partial Least Squares Structural Equation Modeling (PLS-SEM). In this section I compare OLS regression with NCA.

### 4.3.2 Logic and theory

OLS regression uses additive, average effect logic. The regression line (Figure 4.5 A) predicts the average \(Y\) for a given \(X\). Because the cases are scattered, for a given \(X\) also higher and lower values of \(Y\) than the average value of \(Y\) are possible. With one \(X\) (Simple Linear Regression), \(Y\) is predicted by the regression equation is:

\[\begin{equation} \tag{4.1} Y = β_0 + β_1 X + ɛ(X) \end{equation}\]

where \(β_0\) is the intercept of the regression line, \(β_1\) is the slope of the regression line, and \(ɛ(X)\) is the error term representing the scatter around the regression line for a given \(X\). The slope of the regression line (regression coefficient) is estimated by minimizing the squared vertical distances between the observed \(Y\)-values and the regression line (‘least squares’). The error term includes the effect of all other factors that can contribute to the outcome \(Y\).

For the parent-child data, the regression equation is \(Y = 57.5 + 0.64 X + ɛ(X)\). OLS regression assumes that on average \(ɛ(X) = 0\). Thus, when \(X\) (Parent height) is 175 cm, the estimated average Child height is about 170 cm. In contrast NCA’s C-LP ceiling line is defined by \(Y_c = -129 + 1.85 X\). Thus, when \(X\) (Parent height) is 175 cm, the estimated maximum possible Child height is about 195 cm. Normally, in NCA the ceiling line is interpreted inversely (e.g., in the bottleneck table): \(X_c = (Y_c + 129)/1.85\) indicating, while assuming a non-decreasing ceiling line, that a minimum level of \(X = X_c\) is necessary (but not sufficient) for a desired level of \(Y =Y_c\). When parents wish to have a child of 200 cm it is necessary (but not sufficient) that their Parent height is at least about 177 cm.

To allow for doing statistical tests with OLS, it is usually assumed that the error term for a given \(X\) is normally distributed (with average value 0): cases close to the regression line for the given \(X\) are more likely than cases far from the regression line. The normal distribution is unbounded, hence very high or very low values of \(Y\) are possible, though not likely. This implies that any high value of \(Y\) is possible. Even without the assumption of the normal distribution of the error term, a fundamental assumption of OLS is that the \(Y\) value is unbounded (Berry, 1993). Thus, very large child heights (e.g., 300 cm) are theoretically possible in OLS, but unlikely. This assumption contradicts NCA’s logic in which \(X\) and \(Y\) are presumed bounded. \(X\) puts a limit on \(Y\) and thus there is a border represented by the ceiling line. The limits can be empirically observed in the sample (e.g., the height of the observed tallest person in the sample is 205 cm) for defining NCA’s empirical scope or can be theoretically defined (e.g., the height of the ever-observed tallest person of 272 cm) for defining the NCA’s theoretical scope.

Additivity is another part of regression logic. It is assumed that the terms of the regression equation are added. Next to \(X\), the error term is always added in the regression equation. Possibly also other \(X\)’s or combination of \(X\)’s are added in the regression equation (multiple regression, see below). This means that the terms that make up the equation can compensate for each other. For example, when \(X\) is low, \(Y\) can still be achieved when other terms (error term or other \(X\)’s) give a higher contribution to \(Y\). The additive logic implies that for achieving a certain level of \(Y\), no \(X\) is necessary. This additive logic contradicts NCA’s logic that \(X\) is necessary: \(Y\) cannot be achieved when the necessary factor does not have the right level, and this absence of \(X\) cannot be compensated by other factors.

Results of a regression analysis are usually interpreted in terms of sufficiency. A common sufficiency-type of hypotheses is ‘\(X\) increases \(Y\)’ or ‘\(X\) has a positive effect on \(Y\).’ Such hypothesis can be tested with regression analysis. The hypothesis is considered to be supported if the regression coefficient is positive. Often, it is then suggested that \(X\) is sufficient to produce an increase of the outcome \(Y\). The results also suggest that a given \(X\) is not necessary for producing the outcome \(Y\) because other factors in the regression model (other \(X\)’s and the error term) can compensate for the absence of a low level of \(X\).

### 4.3.3 Data analysis

Most regression models include more than one \(X\). The black box of the error term is opened and other \(X\)’s are added to the regression equation (Multiple Linear Regression - MLR), for example:

\[\begin{equation} \tag{4.2} Y = β_0 + β_1 X_1 + β_2 X_2 + \epsilon_X \end{equation}\]

where \(β_1\) and \(β_2\) are the regression coefficients. By adding more factors that contribute to \(Y\) into the equation, \(Y\) is predicted for given combinations of \(X\)’s,and a larger part of the scatter is explained. R\(^2\) is the amount of explained variance of a regression model and can have values between 0 and 1. By adding more factors, usually more variance is explained resulting in higher values of R\(^2\).

Another reason to add more factors is to avoid ‘omitted variable bias.’ This bias is the result of not including factors that correlate with \(X\) and \(Y\), which causes in biased estimations of the regression coefficients. Hence, the common standard of regression is not the simple OLS regression with one factor, but multiple regression with many factors including control variables to reduce omitted variable bias. By adding more relevant factors, the prediction of \(Y\) becomes better and the risk of omitted variable bias is reduced. Adding factors in the equation is not just adding new factors (\(X\)). Some factors may be combined such as squaring a factor (\(X^2\)) to represent a non-linear effect of \(X\) on \(Y\), or taking the product of two factors \((X_2 * X_2)\) to represent the interaction between these factors. Such combination of factors are added as a separate terms into the regression equation. Also, other regression-based approaches such as SEM and PLS-SEM include many factors. In SEM models factors are ‘latent variables’ of the measurement model of the SEM approach.

A study by Bouquet & Birkinshaw (2008) is an example of the prediction of an average outcome using MLR by adding many terms including combined factors in the regression equation. This highly cited article in the *Academy of Management Journal* studies multinational enterprises (MNE’s) to predict how subsidiary companies gain attention from their headquarters (\(Y\)). They use a multiple regression model with 25 terms (\(X\)’s and combination of \(X\)’s) and an ‘error’ term \(ɛ\). With the regression model the average outcome (average attention) for a group of cases (or for the theoretical ‘the average case’) for given values of the terms can be estimated. The error term represents all unknown factors that have a positive or negative effect on the outcome but are not included in the model, assuming that the average effect of the error term is zero. \(β_0\) is a constant and the other \(β_i\)’s are the regression coefficients of the terms, indicating how strong the term is related to the outcome (when all other terms are constant). The regression model is:

\(Attention = \\ β_0 \\ + β_1\ Subsidiary\ size \\ + β_2\ Subsidiary\ age \\ + β_3\ (Subsidiary\ age)^2 \\ + β_4\ Subsidiary\ autonomy \\ + β_5\ Subsidiary\ performance \\ + β_6\ (Subsidiary\ performance)^2 \\ + β_7\ Subsidiary\ functional\ scope \\ + β_8\ Subsidiary\ market\ scope \\ + β_9\ Geographic\ area\ structure \\ + β_{10}\ Matrix\ structure \\ + β_{11}\ Geographic\ scope \\ + β_{12}\ Headquarter\ AsiaPacific\ parentage \\ + β_{13}\ Headquarter\ NorthAmerican\ parentage \\ + β_{14}\ Headquarter\ subsidiary\ cultural\ distance \\ + β_{15}\ Presence\ of\ MNEs\ in\ local\ market \\ + β_{16}\ Local\ market\ size \\ + β_{17}\ Subsidiary\ strength\ within\ MNE\ Network \\ + β_{18}\ Subsidiary\ initiative\ taking \\ + β_{19}\ Subsidiary\ profile\ building \\ + β_{20}\ Headquarter\ subsidiary\ geographic\ distance \\ + β_{21}\ (Initiative\ taking * Geographic\ distance) \\ + β_{22}\ (Profile\ building * Geographic\ distance) \\ + β_{23}\ Subsidiary\ downstream\ competence \\ + β_{24}\ (Initiative\ taking * Downstream\ competence) \\ + β_{25}\ (Profile\ building * Downstream\ competence) \\ + \epsilon\)

The 25 terms of single and combined factors in the regression equation explain 27% of the variance (R\(^2\) = 0.27). Thus, the error term (representing the not included factors) represent the other 73% percent (unexplained variance). Single terms predict only a small part of the outcome. For example, ‘subsidiary initiative taking’ (term 18) is responsible for 2% to the explained variance.

The example shows that adding more factors makes the model more complex and less understandable and therefore less useful in practice. The contrast with NCA is large. NCA can have a model with only one factor that perfectly explains the absence of a certain level of an outcome when the factor is not present at the right level for that outcome. Whereas regression models must include factors that correlate with other factors and with the outcome to avoid biased estimation of the regression coefficient, NCA’s effect size for a necessary factor is not influenced by the absence or presence of other factors in the model. This is illustrated with and example about the effect of a sales person’s personality on sales performance using data of 108 cases (sales representatives from a large USA food manufacturer) obtained with The Hogan Personality Inventory (HPI) personality assessment tool for predicting organizational performance (Hogan & Hogan, 2007). Details of the example are in Dul, Hauff, et al. (2021). The statistical descriptives of the data (mean, standard deviation, correlation) are shown in Figure 4.6 A. Ambition and Sociability are correlated with \(Y\) as well as each other. Hence, if one of them is omitted from the model the regression results may be biased.

The omission of one variable is shown in Figure 4.6 B, middle column. The full model (Model 1) includes all four personality factors. The regression results show that Ambition and Sociability have positive average effects on Sales performance (regression coefficients 0.13 and 0.16 respectively, and Learning approach has a negative average effect on Sales performance (regression coefficient -0.11). Interpersonal sensitivity has virtually no average effect on Sales performance (regression coefficient 0.01). The p values for Ambition and Sociability are relatively small. Model 2 has only three factors because Sociability is omitted. The regression results show that the regression coefficients of all three remaining factors have changed. The regression coefficient for Ambition has increased to 0.18, and the regression coefficients of the other two factors have minor differences (because these factors are less correlated with the omitted variable). Hence, in a regression model that is not correctly specified because factors that correlate with factors that are included in the model and with the outcome are not included, the regression coefficients of the included factors may be biased (omitted variable bias). However, the results of the NCA analysis does not change when a variable is omitted (Figure 4.6 B, right column). This means that an NCA model does not suffer from omitted variable bias.

### 4.3.4 How to combine NCA and regression

This example illustrates that regression and NCA are fundamentally different and complementary. A regression analysis can be added to a NCA study to evaluate the average effect of the identified necessary condition on the outcome. However, the researcher must then include all relevant factors, also those that are not expected to be necessary, to avoid omitted variable bias, and must obtain measurement scores for these factors. When NCA is added to a regression study not much extra effort is required. If a theoretical argument is available for a factor being necessary, any factor that is included in a regression model (independent variables, moderators, mediators) can also be treated as a potential necessary condition that can be tested with NCA. This could be systematically done:

For all potential necessary conditions.

For those factors that provide a surprising result in the regression analysis (e.g., in terms of direction of the regression coefficient) to better understand the result.

For those factors that show no or a limited effect in the regression analysis (small regression coefficient) to check whether such ‘unimportant’ factors on average still may be necessary for a certain outcome.

For those factors that have a large effect in the regression analysis (large regression coefficient) to check whether an ‘important’ factor on average may also be necessary or not.

When adding NCA to a regression analysis more insight about the effect of \(X\) on \(Y\) can be obtained.

### 4.3.5 What is the same in NCA and regression?

I showed that regression has several characteristics that are fundamentally different from the characteristics of NCA. Regression is about average trends, uses additive logic, assumes unbounded \(Y\) values, is prone to omitted variable bias, needs control variables, and is used for testing sufficiency-type of hypotheses, whereas NCA is about necessity logic, assumes limited \(X\) and \(Y\), is immune for omitted variable bias, does not need control variables, and is used for testing necessity hypotheses.

However, NCA and regression also share several characteristics. Both NCA and regression are variance-based approaches and use linear algebra (although NCA can also be applied with the set theory approach with Boolean algebra; see the supplement on NCA and QCA). Both methods need good (reliable and valid) data without measurement error, although NCA may be more prone to measurement error. For statistical generalization from sample to population both methods need to have a probability sample that is representative for the population, and having larger samples usually give more reliable estimations of the population parameters, although NCA can handle small sample sizes. Additionally, for generalization of the findings of a study both methods need replications with different samples; a one-shot study is not conclusive. Both methods cannot make strong causal interpretations when observational data are used; then at least also theoretical support is needed. When null hypothesis testing is used in both methods, such tests and the corresponding p values have strong limitations and are prone to misinterpretations; a low p value only indicates a potential randomness of the data and is not a prove of the specific alternative hypothesis of interest (average effect, or necessity effect).

When a researcher uses NCA or OLS, these common fundamental limitations should be acknowledged. When NCA and OLS are used in combination the fundamental differences between the methods should be acknowledged. It is important to stress that one method is not better than the other. NCA and OLS are different and address different research questions. To ensure theory-method fit, OLS is the preferred method when the researcher is interested in an average effect of \(X\) on \(Y\), and NCA is the preferred method when the researcher is interested in the necessity effect of \(X\) on \(Y\).

## 4.4 Combining NCA with QCA

### 4.4.1 Introduction to QCA

#### 4.4.1.1 History of QCA

Qualitative Comparative Analysis (QCA) has its roots in political science and sociology, and was developed by Charles Ragin (Ragin, 1987, 2000, 2008). QCA has steadily evolved and used over the years, and currently many types of QCA approaches exist. A common interpretation of QCA is described by Schneider & Wagemann (2012), which I follow in this book.

#### 4.4.1.2 Logic and theory of QCA

Set theory is in the core of QCA. It means that relations between sets, rather than relations between variables are studied. A case can be part of a set or not part of the set. For example, the Netherlands is a case (of all countries) that is ‘in the set’ of rich countries, and Ethiopia is a case that is ‘out of the set’ of rich countries. Set membership scores (rather than variable scores) are linked to a case. Regarding the set of rich countries, the Netherlands has a set membership score of 1 and Ethiopia of 0. In the original version of QCA the set membership scores could only be 0 or 1. This version of QCA is called crisp-set QCA. Later also fuzzy-set QCA (fsQCA) was developed. Here the membership scores can also have values between 0 and 1. For example, Croatia could be allocated a set membership score of 0.7 indicating that it is ‘more in the set’ than ‘out of the set’ of rich countries. In QCA relations between sets are studied. Suppose that one set is the set of rich countries (X), and another set is the set of countries with happy people (‘happy countries,’ Y). QCA uses Boolean (binary) algebra and expresses the relationship between condition X and outcome Y as the presence or absence of X is related to the presence or absence of Y. More specifically, the relations are expressed in terms of sufficiency and necessity. For example, the presence of X (being a country that is part of the set of rich countries) could be theoretically stated as sufficient for the presence of Y (being a country that is part of the set of happy countries). All rich countries are happy countries. The set of rich countries is a subset of the set of happy countries. No rich country is not a happy country. Set X is a subset of set Y. Alternatively, in another theory it could be stated that the presence of X (being country that is part of the set of rich countries) is necessary for the presence of Y (being a country that is part of the set of happy countries). All happy countries are rich countries. The set of happy countries is a superset of the set of rich countries. No happy country is not a rich county. Set X is a superset of set Y. QCA’s main interest is about sufficiency. QCA assumes that a configuration of single conditions produces the outcome. For example, the condition of being in the set of rich countries (\(X_1\)) AND the condition of being in the set of democratic countries (\(X_2\)) is sufficient for the outcome of being in the set of happy countries (\(Y\)). QCA’s Boolean logic statements for this sufficiency relationship is expressed as follows:

\[\begin{equation} \tag{4.3} X_1*X_2 → Y \end{equation}\]

where the symbol ‘\(*\)’ means the logical ‘AND,’ and the symbol ‘\(→\)’ means ‘is sufficient for.’

Furthermore, QCA assumes that several alternative configurations may exits that can produce the outcome, known as ‘equifinality.’ This is expressed in the following example:

\[\begin{equation} \tag{4.4} X_1*X_2 + X_2*X_3*X_4 → Y \end{equation}\]

where the symbol ‘\(+\)’ means the logical ‘OR.’

It is also possible that the absence of a condition is part of a configuration. This is shown in the following example:

\[\begin{equation} \tag{4.5} X_1*X_2 + X_2*{\sim}X_3*X_4 → Y \end{equation}\]

where the symbol ‘\(\sim\)’ means ‘absence of.’

Single conditions in a configuration that is sufficient for the outcome are called INUS conditions Mackie (1965). An INUS condition is an ‘Insufficient but Non-redundant (i.e., Necessary) part of an Unnecessary but Sufficient condition.’ In this expression, the words ‘part’ and ‘condition’ are somewhat confusing because ‘part’ refers to the single condition and ‘condition’ refers to the configuration that consists of single conditions. Insufficient refers to the fact that a part (single condition) is not itself sufficient for the outcome. Non-redundant refers to the necessity of the part (single condition) for the configurations being sufficient for the outcome. Unnecessary refers to the possibility that also other configurations can be sufficient for the outcome. Sufficient refers to the fact that the configuration is sufficient for the outcome. Although a single condition may be locally necessary for the configuration to be sufficient for the outcome, it is not globally necessary for the outcome because the single condition may be absent in other sufficient configurations. INUS conditions are thus usually not necessary conditions for the outcome (the latter are the conditions that NCA considers). Hence, in above generic logical statements about relations between sets, see (4.3), (4.4), (4.5), \(X\) and \(Y\) can only be absent of present (Boolean algebra), even though the individual members of the sets can have fuzzy scores. Both csQCA and fsQCA use logical statements where the condition and the outcome can only be absent (0) or present (1). In fsQCA absence means set membership scores < 0.5 and presence means set membership scores > 0.5.

#### 4.4.1.3 Data and data analysis of QCA

Particularly in large N studies and in the business and management field, the starting point of the QCA data analysis is to transform variable scores into set membership scores in a ‘mechanistic’ way (data driven). The transformation process is called ‘calibration.’ Calibration can be based on the distribution of the data, the measurement scale, or expert knowledge. The goal of calibration is to get scores of 0 or 1 (csQCA) or between 0 and 1 (fsQCA) to represent the extent to which the case belongs to the set (set membership score). In fsQCA mechanistic transformation is usually done with the logistic transformation function. The selection is somewhat arbitrary (build in popular QCA software) and moves the variable scores to the extremes (0 and 1) in comparison to just standardization of the data: low values move to 0 and high values move to 1. When no substantive reason exists for the logistic transformation, I have proposed (Dul, 2016b) to use a standard transformation. This transformation keeps the distribution of the data intact. The reason for my proposal is that moving the scores to the extremes implies that cases in the \(XY\) scatter plot with low to middle values of \(X\) move to the left and cases with middle to high values of \(Y\) move upwards. As a result, the upper left corner is more filled with cases. Consequently, potential empty spaces (indicating necessity) may not be identifiable. With the standard transformation the cases stay where they are; an empty space in a corner of the \(XY\) plot with the original data stays empty. The standard transformation is an alternative to an arbitrary transformation: it just changes variable score into set membership scores, without affecting the distribution of the data. A calibration evaluation tool to check the effect of calibration on the necessity effect size is available at https://r.erim.eur.nl/r-apps/qca/. QCA performs two separate analyses with calibrated data: a necessity analysis for identifying (single) necessary conditions, and a sufficiency analysis (‘truth table’ analysis) for identifying sufficient configurations. In this book I focus on the necessary condition analysis of single necessary conditions, which precedes the sufficiency analysis. In csQCA the necessity analysis is similar to a dichotomous necessary condition analysis of NCA with the contingency table approach when \(X\) and \(Y\) are dichotomous set membership scores that can only be present (in the set) or absent (not in the set). By visual inspection of the contingency table a necessary condition ‘in kind’ can be identified when the upper left cell is empty (Figure 4.7) using set membership scores 0 and 1.

For fuzzy set membership scores the necessity analyses of fsQCA and NCA differ. In fsQCA a diagonal is drawn in the \(XY\) scatter plot (Figure 4.8A) with data from Rohlfing & Schneider (2013); see also Vis & Dul (2018)). For necessity, there can be no cases above the diagonal. Necessity consistency is a measure of the extent to which cases are not above the diagonal, which can range from 0 to 1. When some cases are present in the ‘empty’ zone above the diagonal, fsQCA considers these cases as ‘deviant cases.’ FsQCA accepts some deviant cases as long as the necessity consistency level, which is computed from the total vertical distances of the deviant cases to the diagonal, is not smaller than a certain threshold, usually 0.9. The necessity consistency is large enough, fsQCA makes a qualitative (‘in kind’) statement about the necessity of \(X\) for \(Y\): ‘\(X\) is necessary for \(Y\),’ e.g., the presence of \(X\) (membership score > 0.5) is necessary for the presence of \(Y\) (membership score > 0.5).

### 4.4.2 The differences between NCA and QCA

The major differences between NCA and QCA are summarized in Figure 4.9 (see also Dul (2016a) and Vis & Dul (2018)) and discussed below.

#### 4.4.2.1 Logic and theory

NCA and QCA are only the same in a very specific situation of ‘in kind’ necessity: A single \(X\) is necessary for \(Y\), and \(X\) and \(Y\) are dichotomous set membership scores (0 and 1). Then the analyses of NCA and QCA are exactly the same. However, NCA normally uses variable scores, but can also set membership scores when NCA is applied in combination with QCA (see below). In addition to the ‘in kind’ necessity that both methods share, NCA also formulates ‘in degree’ necessity. QCA also formulates ‘in kind’ necessity of ‘OR’ combinations of conditions, as well as ‘in kind’ sufficiency of configurations of conditions. The main interest of NCA is the necessity ‘in degree’ of single factors that enable the outcome, whereas the main interest of QCA is the sufficiency ‘in kind’ of (alternative) configurations of conditions.

#### 4.4.2.2 Data and data analysis

Regarding research strategy most NCA studies are observational studies (both small N and large N), although also experiments are possible. Most QCA studies are small N observational studies, although increasingly also large N studies are employed with QCA, in particular in the business and management area. The experimental research strategy is rare (if not absent) in QCA. Regarding case selection/sampling purposive sampling is the main case selection strategy in QCA. It is also possible in small N NCA studies. For large N studies sampling strategies such as those used in regression analysis (preferably probability sampling) are used both in NCA and QCA. Regarding measurement, NCA uses valid and reliable variable scores unless it is used in combination with QCA, in which case NCA uses calibrated set membership scores. QCA uses calibrated set membership scores and cannot use variable scores. In QCA data with variable scores may be used as input for the ‘calibration’ process to transform variable scores into set membership scores. Regarding data analysis in fsQCA a necessary condition is assumed to exist if the area above the diagonal reference line in an XY scatter plot is virtually empty (see Figure 4.8 A). In contrast, NCA uses the ceiling line as the reference line (see Figure 4.8 B) for evaluating the necessity of X for Y (with possibly some cases above the ceiling line; accuracy below 100%). In situations where fsQCA observes ‘deviant cases,’ NCA includes these cases in the analysis by ‘moving’ the reference line from the diagonal position to the boundary between the zone with cases and the zone without cases. NCA considers cases around the ceiling line (and usually above the diagonal) as ‘best practice’ cases rather than ‘deviant’ cases. These cases are able to reach a high level of outcome (e.g., an output that is desired) for a relatively low level of condition (e.g., an input that requires effort). For deciding about necessity, NCA evaluates the size of the ‘empty’ zone as a fraction of the total zone (empty plus full zone), which ratio is called the necessity effect size. If the effect size is greater than zero (an empty zone is present), and if according to NCA’s statistical test this is unlikely a random result of unrelated \(X\) and \(Y\), NCA identifies a necessary condition ‘in kind’ that can be formulated as: ‘\(X\) is necessary for \(Y\),’ indicating that for at least a part of the range of \(X\) and the range of \(Y\) a certain level of \(X\) is necessary for a certain level of Y. Additionally, NCA can quantitatively formulate necessary condition ‘in degree’ by using the ceiling line: ‘level \(X_c\) of \(X\) is necessary for level \(Y_c\) of \(Y\).’ The ceiling line represents all combinations \(X\) and \(Y\) where \(X\) is necessary for \(Y\). Although also fsQCA’s diagonal reference line allows for making quantitative necessary conditions statements, e.g. \(X\) > 0.3 is necessary for \(Y\) = 0.3, fsQCA does not make such statements. When the ceiling line coincides with the diagonal (corresponding to the situation that fsQCA considers) the statement ‘\(X\) is necessary for \(Y\)’ applies to all \(X\)-levels [0,1] and all \(Y\)-levels [0,1] and the results of the qualitative necessity analysis of fsQCA and NCA are the same. When the ceiling line is positioned above the diagonal ‘\(X\) is necessary for \(Y\)’ only applies to a specific range of \(X\) and a specific range of \(Y\). Outside these ranges \(X\) is not necessary for \(Y\) (‘necessity inefficiency’). Then the results of the qualitative necessity analysis of fsQCA and NCA can be different. Normally, NCA identifies more necessary conditions than fsQCA, mostly because the diagonal is used as reference line. In the example of Figure 4.8, NCA identifies that \(X\) is necessary for \(Y\) because there is an empty zone above the ceiling line. However, fsQCA would conclude that \(X\) is not necessary for \(Y\), because the necessity consistency level is below 0.9. FsQCA’s necessity analysis can be considered as a special case of NCA: an NCA analysis with discrete or continuous fuzzy set membership scores for \(X\) and \(Y\), a ceiling line that is diagonal, an allowance of a specific number of cases in the empty zone given by the necessity consistency threshold, and the formulation of a qualitative ‘in kind’ necessity statement.

### 4.4.3 Recommendation for combining NCA and QCA

Although in most NCA applications variable scores are used for quantifying condition X and outcome Y, NCA can also employ set membership scores for the conditions and the outcome, allowing to combine NCA and QCA. The other way around is not possible: combining QCA to a regular NCA with variable scores is not possible because by definition QCA is a set theoretic approach that does not use variable scores. How can NCA with membership scores complement QCA? For answering this question first another question should be raised: how does QCA integrate an identified necessary condition in kind with the results of the sufficiency analysis: the identified sufficient configurations. By definition the necessary condition must be part of all sufficient configurations, otherwise this configuration cannot produce the outcome. However, within the QCA community five different views exist about how to integrate necessary condition in sufficient configurations. In the first view only sufficient configurations that include the necessary conditions are considered as a result. Hence, all selected configurations have the necessary condition. In the second view the truth table analysis to find the sufficient configurations are done without the necessary conditions and afterwards the necessary conditions are added to the configuration. This also ensures that all configurations have the necessary conditions. In the third view configurations that do not include the necessary condition are excluded from the truth table before this table is further analysed to find the sufficient configurations. This ‘ESA’ approach (Schneider & Wagemann, 2012) also ensures that all configurations have the necessary conditions. In the fourth view sufficient configurations are analysed without worrying about necessary conditions. Afterwards, the necessary conditions are discussed separately. In the fifth view a separate necessity analysis is not done, or necessity is ignored. All views have been employed in QCA; hence no consensus exists yet. And additional complexity of integrating necessity with sufficient configuration is that NCA produces necessary conditions in degree, rather than QCA’s necessary condition and sufficient configurations in kind. The conditions that are part of the sufficient configurations can only be absent or present. Given these complexities, I suggest, a combination of the second and the fourth view:

Perform the NCA analysis ‘in degree’ before QCA’s sufficiency analysis.

Integrate a part of the results of NCA’s necessity analysis into QCA’s sufficient configurations, namely the conditions (‘in degree’) that are necessary for outcome > 0.5.

Discuss the full results of NCA’s necessity analysis afterwards.

In particular it could be discussed that specific levels of necessary membership scores found by NCA must be present in each sufficient configuration found by QCA. If that membership in degree is not in place, the configuration will not produce the outcome.

### 4.4.4 Examples

I discuss two examples of integrating NCA in QCA according to this recommendation. The first example is a study by Emmenegger (2011) about the effects of six conditions: S = state-society relationships, C = non-market coordination, L = strength labour movement, R = denomination, P = strength religious parties and V = veto points on JSR = job security regulation in Western European countries. He performed an fsQCA analysis with a necessity analysis and a sufficiency analysis. His necessity analysis showed that no condition was necessary for job security regulation (necessity consistency of each condition was < 0.9).

However, the NCA analysis in degree with the six conditions and using the CE-FDH ceiling line (Figure 4.10) shows the following effect sizes and corresponding p values (Figure 4.11).

A condition could be considered necessary when the effect size has a small p value (e.g. p < 0.05). Hence, the NCA analysis shows that certain strength of labour movement (L), a certain level of denomination (R), and a certain strength of religious parties (P) are necessary for high levels of job security regulation (JSR). From Figure 4.10 it can be observed that the following conditions are necessary for JSR > 0.5:

L > 0.29 is necessary for JSR > 0.5 (presence of JSR)

R > 0.20 is necessary for JSR > 0.5 (presence of JSR)

P > 0.20 is necessary for JSR > 0.5 (presence of JSR)

Although in QCA’s binary logic these small necessary membership scores of L, R, P (all < 0.5) would be framed as ‘absence’ of the condition, in NCA these membership scores are considered small, yet must be present for having the outcome. Thus, according to NCA the low level of membership scores must be present, otherwise the sufficient configurations identified by QCA will not produce the outcome. Emmenegger (2010) identified four possible sufficient configurations for the outcome JSR:

S*R*~V (presence of S AND presence of R AND absence of V)

S*L*R*P (presence of S AND presence of L AND presence of R AND presence of P)

S*C*R*P (presence of S AND presence of C AND presence of R AND presence of P)

C*L*P*~V (presence of C AND presence of L AND presence of P AND absence of V)

This combination of solutions can be expressed by the following logical expression: S*R*~V + S*L*R*P + S*C*R*P + C*L*P*~V → JSR The presence of a condition and the outcome means that the membership score is > 0.5. The absence of a condition means that the membership score is < 0.5. A common way to summarize the results is shown in Figure 4.12.

The NCA necessity results can be combined with the QCA sufficiency results as shown in Figure 4.13.

Small full square symbols (▪) are added to sufficient configurations (according to QCA) to ensure that the minimum required necessity membership score (according to NCA) is fulfilled. This symbol is inspired by Greckhamer (2016) who used large full square symbol (▪) to be added to the solutions table to indicate the presence of a necessary condition (membership score > 0.5) according to QCA’s necessity analysis.

The NCA results that the presence of L > 0.29 is necessary for JSR > 0.5 is already achieved in the QCA sufficient configurations 2 and 4, but not in configurations 1 and 3. In these latter configurations the requirement L > 0.29 is added to the configuration (▪). Similarly, R > 0.20 is added to configuration 4, and P > 0.20 is added to configuration 1. Without adding these requirements to the configuration, the configuration cannot produce the outcome. Only configuration 2 includes all three necessary conditions according to NCA, without a need for adding them. If the results of NCA would be ignored and the QCA of configurations 1, 3 and 4 would not produce the outcome (or only by chance if the required minimum levels of the ignored NCA necessary conditions would be present by chance). Additionally, the NCA results can show what levels of the condition would be necessary for a higher level of the outcome than a membership score > 0.5. This can be observed in Figure 3. For example, for a membership score of JSR of 0.9, it is necessary to have membership scores of S = 1, L > 0.45, R = 1, P > 0.2, and ~V > 0.35. I therefore recommend presenting the NCA necessity results together with the QCA sufficient configurations results as in Figure 4.13, and additionally to discuss the full results of NCA for deeper understanding of the sufficient configurations.

The second example is from a study of Skarmeas et al. (2014) that I discuss in my book (Dul, 2020, pp. 77-83). This study is about the effect of four organizational motives (Egoistic-driven motives, absence of Value-driven motives, Strategy-driven motives, Stakeholder-driven motives) on customer scepticism about an organization’s Corporate Social Responsibility (CSR) initiative. The results of NCA’s necessity analysis with raw scores, and with calibrated set membership scores are shown in Figure 4.14 and Figure 4.15, respectively.

NCA with raw variable scores shows that Absence of Value-driven motives and Strategydriven motives could be considered are necessary for Scepticism given the medium effect sizes and the small p values (p < 0.05). Also, the NCA with calibrated set membership scores shows that these two conditions have low p values; however, their effect sizes are small (0.04 and 0.02, respectively). This means that these necessary conditions may be statistically significant but may not be practically significant: nearly all cases reached the required level of necessity. Also, Egoistic-driven motives are statistically, but not practically significant. NCA with raw variable scores (the conventional NCA approach) can be used in combination with regression analysis, as regression analysis uses raw variable scores (see the supplement ‘Comparing NCA with regression’). NCA with calibrated set scores (set membership scores) can be used in combination with QCA, because QCA uses calibrated set membership scores (this supplement).

Figure 4.16 shows the two sufficient configurations according the QCA analysis of Skarmeas et al. (2014). (2014).

In each configuration the necessity of Egoistic-driven motives and the Absence of Valuedriven motives are ensured in the configuration. However, the necessity of Strategy-driven motives is not ensured in Sufficient configuration 1. Therefore, the minimum required level of Stakeholder-driven motives according to NCA (0.01) is added to ensure that the configuration is able to produce the outcome. However, the practical meaning of this addition is very limited because the necessity effect size is small. It is added here for illustration of our recommendation.

### 4.4.5 Logical misinterpretations when combining NCA and QCA

When NCA is combined with QCA sometimes a logical misinterpretation is made about the role of necessary conditions in sufficient configurations. Although a factor that is a necessary condition for the outcome *must* be part of each sufficient configuration, the opposite is not true: a factor that is part of the sufficient configuration must *not* automatically be a necessary condition for the outcome. The latter misinterpretation has been introduced in the tourism and hospitality field (Dul, 2022b) and can be found for example in studyies by Farmaki et al. (2022), Pappas & Farmaki (2022), Pappas & Glyptou (2021b), Pappas & Glyptou (2021a), Pappas (2021), and Farmaki & Pappas (2022). The misinterpretation was also exported to other fields (e.g., Mostafiz et al., 2021).

This misinterpretation may be caused by mixing necessary conditions for the outcome (as analysed with NCA) with necessary parts of a sufficient configuration (INUS conditions). An INUS condition is an *Insufficient but Necessary part of an Unnecessary but Sufficient configuration*. INUS conditions are the necessary elements of a configuration to make that configuration sufficient. NCA captures necessary conditions for the outcome, not INUS conditions, see also Dul, Vis, et al. (2021).

### References

*Understanding regression assumptions*(Vol. 92). Sage. https://us.sagepub.com/en-us/nam/book/understanding-regression-assumptions

*Academy of Management Journal*,

*51*(3), 577–601. https://www.jstor.org/stable/20159527

*Journal of Business Research*,

*69*(4), 1516–1523. https://doi.org/10.1016/j.jbusres.2015.10.134

*Organizational Research Methods*,

*19*(1), 10–52. https://doi.org/10.1177/1094428115584005

*Tourism Management*, 104616. https://doi.org/10.1016/j.tourman.2022.104616

*Handbook of research methods for marketing management*. Edward Elgar Publishing. https://www.e-elgar.com/shop/gbp/handbook-of-research-methods-for-marketing-management-9781788976947.html

*Sociological Methods & Research*,

*50*(2), 926–936. https://doi.org/10.1177%2F0049124118799383

*European Journal of Political Research*,

*50*(3), 336–364. https://doi.org/10.1111/j.1475-6765.2010.01933.x

*International Journal of Contemporary Hospitality Management*,

*34*, 1012–1036. https://doi.org/10.1108/IJCHM-07-2021-0859

*Tourism Management*,

*91*, 104526. https://doi.org/10.1016/j.tourman.2022.104526

*The Journal of the Anthropological Institute of Great Britain and Ireland*,

*15*, 246–263. https://www.jstor.org/stable/2841583

*Strategic Management Journal*,

*37*(4), 793–815. https://doi.org/10.1002/smj.2370

*The Hogan Personality Inventory manual (3rd ed.)*. Hogan Assessment Systems, Tulsa, OK.

*American Philosophical Quarterly*,

*2*(4), 245–264. https://www.jstor.org/stable/20009173

*Knowledge Management Research & Practice*, 1–15. https://doi.org/10.1080/14778238.2021.1919573

*Tourism Management*,

*84*, 104287. https://doi.org/10.1016/j.tourman.2021.104287

*Current Issues in Tourism*, 1–17. https://doi.org/10.1080/13683500.2022.2056004

*International Journal of Hospitality Management*,

*93*, 102767. https://doi.org/10.1016/j.ijhm.2020.102767

*International Journal of Contemporary Hospitality Management*,

*33*, 2932–2949. https://doi.org/10.1108/IJCHM-09-2020-1046

*The comparative method: Moving beyond qualitative and quantitative strategies*. JSTOR. https://www.jstor.org/stable/10.1525/j.ctt1pnx57?turn_away=true

*Fuzzy-set social science*. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/F/bo3635786.html

*Redesigning social inquiry: Fuzzy sets and beyond*. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/R/bo5973952.html

*Political Research Quarterly*, 220–235. https://www.jstor.org/stable/23563606

*Set-theoretic methods for the social sciences: A guide to qualitative comparative analysis*. Cambridge University Press. https://www.cambridge.org/nl/academic/subjects/social-science-research-methods/qualitative-methods/set-theoretic-methods-social-sciences-guide-qualitative-comparative-analysis?format=PB

*Journal of Business Research*,

*67*(9), 1796–1805. https://doi.org/10.1016/j.jbusres.2013.12.010

*Sociological Methods & Research*,

*47*(4), 872–899. https://doi.org/10.1177/0049124115626179