11.1 Assumptions

11.1.1 Missing Completely at Random (MCAR)

Missing Completely at Random, MCAR, means there is no relationship between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.

The probability of missing data on a variable is unrelated to the value of it or to the values of any other variables in the data set.

Note: the “missingness” on Y can be correlated with the “missingness” on X We can compare the value of other variables for the observations with missing data, and observations without missing data. If we reject the t-test for mean difference, we can say there is evidence that the data are not MCAR. But we cannot say that our data are MCAR if we fail to reject the t-test.

  • the propensity for a data point to be missing is completely random.
  • There’s no relationship between whether a data point is missing and any values in the data set, missing or observed.
  • The missing data are just a random subset of the data.

Methods include:

  • Universal singular value thresholding (Chatterjee 2015) (can only recover the mean, not the whole true distribution).

  • Softimputet: (Hastie et al. 2015) (doesn’t work well under “Limited” -missing not at random).

  • Synthetic nearest neighbor (Agarwal et al. 2023) (still work okay under missing not at random). Available on GitHub: syntheticNN

11.1.2 Missing at Random (MAR)

Missing at Random, MAR, means there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data. Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their weight than women, weight is MAR.

MAR is weaker than MCAR

\[ P(Y_{missing}|Y,X)= P(Y_{missing}|X) \]

The probability of Y missing given Y and X equal to the probability of of Y missing given X. However, it is impossible to provide evidence to the MAR condition.

  • the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. In another word, there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.

    • For example, if men are more likely to tell you their weight than women, weight is MAR
  • MAR requires that the cause of the missing data is unrelated to the missing values but may be related to the observed values of other variables.

  • MAR means that the missing values are related to observed values on other variables. As an example of CD missing data, missing income data may be unrelated to the actual income values but are related to education. Perhaps people with more education are less likely to reveal their income than those with less education

11.1.3 Ignorable

The missing data mechanism is ignorable when

  1. The data are MAR
  2. the parameters in the function of the missing data process are unrelated to the parameters (of interest) that need to be estimated.

In this case, you actually don’t need to model the missing data mechanisms unless you would like to improve on your accuracy, in which case you still need to be very rigorous about your approach to improve efficiency in your parameters.

11.1.4 Nonignorable

Missing Not at Random, MNAR, means there is a relationship between the propensity of a value to be missing and its values.

Example: people with the lowest education are missing on education or the sickest people are most likely to drop out of the study.

MNAR is called Nonignorable because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include some model for why the data are missing and what the likely values are.

Hence, in the case of nonignorable, the data are not MAR. Then, your parameters of interest will be biased if you do not model the missing data mechanism. One of the most widely used approach for nonignorable missing data is (James J. Heckman 1976)

  • Another name: Missing Not at Random (MNAR): there is a relationship between the propensity of a value to be missing and its values

    • For example, people with low education will be less likely to report it
  • We need to model why the data are missing and what the likely values are.

  • the missing data mechanism is related to the missing values

  • It commonly occurs when people do not want to reveal something very personal or unpopular about themselves

  • Complete case analysis can give highly biased results for NI missing data. If proportionally more low and moderate income individuals are left in the sample because high income people are missing, an estimate of the mean income will be lower than the actual population mean.

One can use instrument that can predict the nonresponse process in outcome variable, and unrelated to the outcome of the population to correct for this missingness (but you still have to use complete cases) (B. Sun et al. 2018; Tchetgen Tchetgen and Wirth 2017)

References

Agarwal, Anish, Munther Dahleh, Devavrat Shah, and Dennis Shen. 2023. “Causal Matrix Completion.” In The Thirty Sixth Annual Conference on Learning Theory, 3821–26. PMLR.
Chatterjee, Sourav. 2015. “Matrix Estimation by Universal Singular Value Thresholding.”
Hastie, Trevor, Rahul Mazumder, Jason D Lee, and Reza Zadeh. 2015. “Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” The Journal of Machine Learning Research 16 (1): 3367–3402.
Heckman, James J. 1976. “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models.” In Annals of Economic and Social Measurement, Volume 5, Number 4, 475–92. NBER.
Sun, BaoLuo, Lan Liu, Wang Miao, Kathleen Wirth, James Robins, and Eric J Tchetgen Tchetgen. 2018. “Semiparametric Estimation with Data Missing Not at Random Using an Instrumental Variable.” Statistica Sinica 28 (4): 1965.
Tchetgen Tchetgen, Eric J, and Kathleen E Wirth. 2017. “A General Instrumental Variable Framework for Regression Analysis with Outcome Missing Not at Random.” Biometrics 73 (4): 1123–31.