11.2 Theoretical Foundations of Missing Data

11.2.1 Definition and Classification of Missing Data

Missing data refers to the absence of values for some variables in a dataset. The mechanisms underlying missingness significantly impact the validity of statistical analyses and the choice of handling methods. These mechanisms are classified into three categories:

  1. Missing Not at Random (MNAR): Missingness depends on unobserved variables or the missing values themselves.

11.2.1.1 Missing Completely at Random (MCAR)

MCAR occurs when the probability of missingness is entirely random and unrelated to either observed or unobserved variables. Under this mechanism, missing data do not introduce bias in parameter estimates when ignored, although statistical efficiency is reduced due to the smaller sample size.

Mathematical Definition: The missingness is independent of all data, both observed and unobserved:

\[ P(Y_{\text{missing}} | Y, X) = P(Y_{\text{missing}}) \]

Characteristics of MCAR:

  • Missingness is completely unrelated to both observed and unobserved data.
  • Analyses remain unbiased even if missing data are ignored, though they may lack efficiency due to reduced sample size.
  • The missing data points represent a random subset of the overall data.

Examples:

  • A sensor randomly fails at specific time points, unrelated to environmental or operational conditions.
  • Survey participants randomly omit responses to certain questions without any systematic pattern.

Methods for Testing MCAR:

  1. Little’s MCAR Test: A formal statistical test to assess whether data are MCAR. A significant result suggests deviation from MCAR.

  2. Mean Comparison Tests:

    • T-tests or similar approaches compare observed and missing data groups on other variables. Significant differences indicate potential bias.
    • Failure to reject the null hypothesis of no difference does not confirm MCAR but suggests consistency with the MCAR assumption.

Handling MCAR:

Since MCAR data introduce no bias, they can be handled using the following techniques:

  1. Complete Case Analysis (Listwise Deletion):
    • Analyses are performed only on cases with complete data. While unbiased under MCAR, this method reduces sample size and efficiency.
  2. Universal Singular Value Thresholding (USVT):
    • This technique is effective for MCAR data recovery but can only recover the mean structure, not the entire true distribution (Chatterjee 2015).
  3. SoftImpute:
    • A matrix completion method useful for some missing data problems but less effective when missingness is not MCAR (Hastie et al. 2015).
  4. Synthetic Nearest Neighbor Imputation:
    • A robust method for imputing missing data. While primarily designed for MCAR, it can also handle certain cases of missing not at random (MNAR) (Agarwal et al. 2023). Available on GitHub: syntheticNN.

Notes:

  • The “missingness” on one variable can be correlated with the “missingness” on another variable without violating the MCAR assumption.
  • Absence of evidence for bias (e.g., failing to reject a t-test) does not confirm that the data are MCAR.

11.2.1.2 Missing at Random (MAR)

Missing at Random (MAR) occurs when missingness depends on observed variables but not the missing values themselves. This mechanism assumes that observed data provide sufficient information to explain the missingness. In other words, there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.

Mathematical Definition:

The probability of missingness is conditional only on observed data:

\[ P(Y_{\text{missing}} | Y, X) = P(Y_{\text{missing}} | X) \]

This implies that whether an observation is missing is unrelated to the missing values themselves but is related to the observed values of other variables.

Characteristics of MAR:

  • Missingness is systematically related to observed variables.
  • The propensity for a data point to be missing is not related to the missing data but is related to some of the observed data.
  • Analyses must account for observed data to mitigate bias.

Examples:

  • Women are less likely to disclose their weight, but their gender is recorded. In this case, weight is MAR.
  • Missing income data is correlated with education, which is observed. For example, individuals with higher education levels might be less likely to reveal their income.

Challenges in MAR:

  • MAR is weaker than Missing Completely at Random (MCAR).
  • It is impossible to directly test for MAR. Evidence for MAR relies on domain expertise and indirect statistical checks rather than direct tests.

Handling MAR:

Common methods for handling MAR include:

  • Multiple Imputation by Chained Equations (MICE): Iteratively imputes missing values based on observed data.

  • Maximum Likelihood Estimation: Estimates model parameters directly while accounting for MAR assumptions.

  • Regression-Based Imputation: Predicts missing values using observed covariates.

These methods assume that observed variables fully explain the missingness. Effective handling of MAR requires careful modeling and often domain-specific knowledge to validate the assumptions underlying the analysis.

11.2.1.3 Missing Not at Random (MNAR)

Missing Not at Random (MNAR) is the most complex missing data mechanism. Here, missingness depends on unobserved variables or the values of the missing data themselves. This makes MNAR particularly challenging, as ignoring this dependency introduces significant bias in analyses.

Mathematical Definition:

The probability of missingness depends on the missing values:

\[ P(Y_{\text{missing}} | Y, X) \neq P(Y_{\text{missing}} | X) \]

Characteristics of MNAR:

  • Missingness cannot be fully explained by observed data.
  • The cause of missingness is directly related to the unobserved values.
  • Ignoring MNAR introduces significant bias in parameter estimates, often leading to invalid conclusions.

Examples:

  • High-income individuals are less likely to disclose their income, and income itself is unobserved.
  • Patients with severe symptoms drop out of a clinical study, leaving their health outcomes unrecorded.

Challenges in MNAR:

  • MNAR is the most difficult missingness mechanism to address because the missing data mechanism must be explicitly modeled.
  • Identifying MNAR often requires domain knowledge and auxiliary information beyond the observed dataset.

Handling MNAR:

MNAR requires explicit modeling of the missingness mechanism. Common approaches include:

  • Heckman Selection Models: These models explicitly account for the selection process leading to missing data, adjusting for potential bias (James J. Heckman 1976).

  • Instrumental Variables: Variables predictive of missingness but unrelated to the outcome can be used to mitigate bias (B. Sun et al. 2018; E. J. Tchetgen Tchetgen and Wirth 2017).

  • Pattern-Mixture Models: These models separate the data into groups (patterns) based on missingness and model each group separately. They are particularly useful when the relationship between missingness and missing values is complex.

  • Sensitivity Analysis: Examines how conclusions change under different assumptions about the missing data mechanism.

  • Use of Auxiliary Data

    Auxiliary data refers to external data sources or variables that can help explain the missingness mechanism.

    • Surrogate Variables: Adding variables that correlate with missing data can improve imputation accuracy and mitigate the MNAR challenge.

    • Linking External Datasets: Merging datasets from different sources can provide additional context or predictors for missingness.

    • Applications in Business: In marketing, customer demographics or transaction histories often serve as auxiliary data to predict missing responses in surveys.

Additionally, data collection strategies, such as follow-up surveys or targeted sampling, can help mitigate MNAR effects by collecting information that directly addresses the missingness mechanism. However, such approaches can be resource-intensive and require careful planning.

11.2.2 Missing Data Mechanisms

Mechanism Missingness Depends On Implications Examples
MCAR Neither observed nor missing data No bias; simplest to handle; decreases efficiency due to data loss. Random sensor failure.
MAR Observed data only Requires observed data to explain missingness; common assumption in imputation methods. Gender-based missingness of weight.
MNAR Missing data itself or unobserved variables Requires explicit modeling of the missingness mechanism; significant bias if ignored. High-income individuals not disclosing income.

11.2.3 Relationship Between Mechanisms and Ignorability

The concept of ignorability is central to determining whether the missingness process must be explicitly modeled. Ignorability impacts the choice of methods for handling missing data and whether the missing data mechanism can be safely disregarded or must be explicitly accounted for.

11.2.3.1 Ignorable Missing Data

Missing data is ignorable under the following conditions:

  1. The missing data mechanism is MAR or MCAR.
  2. The parameters governing the missing data process are unrelated to the parameters of interest in the analysis.

In cases of ignorable missing data, there is no need to model the missingness mechanism explicitly unless you aim to improve the efficiency or precision of parameter estimates. Common imputation techniques, such as multiple imputation or maximum likelihood estimation, rely on the assumption of ignorability to produce unbiased parameter estimates.

Practical Considerations for Ignorable Missingness

Even though ignorable mechanisms simplify analysis, researchers must rigorously assess whether the missingness mechanism meets the MAR or MCAR criteria. Violations can lead to biased results, even if unintentionally overlooked.

For example: A survey on income may assume MAR if missingness is associated with respondent age (observed variable) but not income itself (unobserved variable). However, if income directly influences nonresponse, the assumption of MAR is violated.


11.2.3.2 Non-Ignorable Missing Data

Missing data is non-ignorable when:

  1. The missingness mechanism depends on the values of the missing data themselves or on unobserved variables.
  2. The missing data mechanism is related to the parameters of interest, resulting in bias if the mechanism is not modeled explicitly.

This type of missingness (i.e., Missing Not at Random (MNAR) requires modeling the missing data mechanism directly to produce unbiased estimates.

Characteristics of Non-Ignorable Missingness

  • Dependence on Missing Values: The likelihood of missingness is associated with the missing values themselves.
    • Example: In a study on health, individuals with more severe conditions are more likely to drop out, leading to an underrepresentation of the sickest individuals in the data.
  • Bias in Complete Case Analysis: Analyses based solely on complete cases can lead to substantial bias.
    • Example: In income surveys, if wealthier individuals are less likely to report their income, the estimated mean income will be systematically lower than the true population mean.
  • Need for Explicit Modeling: To address MNAR, the analyst must model the missing data mechanism. This often involves specifying relationships between observed data, missing data, and the missingness process itself.

11.2.3.3 Implications of Non-Ignorable Missingness

Non-ignorable mechanisms are often associated with sensitive or personal data:

  • Examples:

    • Individuals with lower education levels may omit their education information.

    • Participants with controversial or stigmatized health conditions might opt out of surveys entirely.

  • Impact on Policy and Decision-Making:

    • Biases introduced by MNAR can have serious consequences for policymaking, such as underestimating the prevalence of poverty or mischaracterizing population health needs.

By explicitly addressing non-ignorable missingness, researchers can mitigate biases and ensure that findings accurately reflect the underlying population.


References

Agarwal, Anish, Munther Dahleh, Devavrat Shah, and Dennis Shen. 2023. “Causal Matrix Completion.” In The Thirty Sixth Annual Conference on Learning Theory, 3821–26. PMLR.
Chatterjee, Sourav. 2015. “Matrix Estimation by Universal Singular Value Thresholding.”
Hastie, Trevor, Rahul Mazumder, Jason D Lee, and Reza Zadeh. 2015. “Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” The Journal of Machine Learning Research 16 (1): 3367–3402.
Heckman, James J. 1976. “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models.” In Annals of Economic and Social Measurement, Volume 5, Number 4, 475–92. NBER.
Sun, BaoLuo, Lan Liu, Wang Miao, Kathleen Wirth, James Robins, and Eric J Tchetgen Tchetgen. 2018. “Semiparametric Estimation with Data Missing Not at Random Using an Instrumental Variable.” Statistica Sinica 28 (4): 1965.
Tchetgen Tchetgen, Eric J, and Kathleen E Wirth. 2017. “A General Instrumental Variable Framework for Regression Analysis with Outcome Missing Not at Random.” Biometrics 73 (4): 1123–31.