Chapter 5 Econometrics and Large-scale Data

5.1 Implications for Hypothesis Testing

Policy researchers making frequentist inferences beware: when performing hypothesis tests on large-scale data, reliance on statistical significance for making meaningful assertions about the outcomes of hypothesis tests will often fail you. Any parameter that is not exactly equal to the null hypothesis value will become statistically significant as sample size (n) increases infinitely, and with p-values that are much smaller in size than thresholds commonly used in social science (e.g. $p < 0.01$ **). Note that, if the assumptions of a test are true and the true parameter is exactly equal to the null value, then a large sample size will not lead to a rejection of the null hypothesis.

In practice, however, we can easily extract statistically significant parameter estimates from any data with a size that would require the use of Spark. A true parameter value is very unlikely to be exactly equal to a specified null in reality: consequently, an extremely large sample size can increase the magnitude of the test statistic, and drive down the p-value far past standard thresholds of statistical significance.⁴ So, a parameter estimate that is arbitrarily close to the null value that is not statistically significant at a standard sample size (e.g. n = 15,000) may become statistically significant if the number of observations drawn from the same population increased to several hundred million. While the estimate would be statistically significant, the parameter estimate would be so close in value to the null that the outcome may be uninteresting.

If arbitrarily small p-values are abundant, finding p-values that are smaller in size than traditional cutoffs for significance cannot be used as compelling evidence in building an empirical argument,⁵ and policy researchers must employ alternative heuristics for making policy inferences in this context.

5.1.1 Statistical v. Practical Significance

One conceptual shift that many researchers will need to make when working with massive data is that inference is no longer primarily about detecting some effect, but is rather focused on examining the data for evidence of an absence of an effect. In particular, the abundance of arbitrarily small p-values in massive data analysis requires that researchers give greater importance to identifying whether an estimate is practically significant rather than its likely statistical significance (since it is likely that the majority of estimates will be statistically significant).

One of the most important indicators of practical significance is effect size (e.g. correlation, regression coefficient or mean difference across groups) and, in conjunction, confidence intervals. Remember that the range of a confidence interval is determined by the critical value for the t-distribution, the standard deviation estimate and sample size. Reporting the effect size of a parameter through confidence intervals:

Provides a more precise description of the magnitude and direction of an effect relative to reporting only the parameter estimate and, more generally, and
Describes the confidence of the parameter estimate being true rather than rejecting (or failing to reject) a null hypothesis.

Although the large sample size can cause the range of the confidence intervals to be extremely small, using statements of confidence to examine the importance of a parameter estimate is a useful strategy when implementing a hypothesis test that is very sensitive (e.g. the simple $H_0$ : $\theta = 0$ hypothesis test). Note that this approach requires researchers to make some a priori decision about what constitutes an practically significant value in the context of their research question.

5.2 Computational Implications

In addition to the theoretical implications of working with massive data discussed above, analyzing data with Spark also has computational implications that policy researchers should be aware of. Spark fits linear models using iterative optimization methods that are well-suited for large-scale and distributed computation.

Specifically, these methods use iterative algorithms to minimize the cost function of the linear model. Currently, the SparkR operation spark.glm uses only iteratively reweighted least squares IRLS to fit a linear model. This “solver” can fit a model with a maximum number of 4096 model features. PySpark similarly fits linear models using IRLS with the GeneralizedLinearRegression operation, which also can interpret up to 4096 feature models. If needed, there are PySpark operations that allow researchers to fit linear models with more than 4096: LinearRegression and LogisticRegression fit linear models using the Limited-memory Broyden–Fletcher–Goldfarb–Shanno L-BFGS algorithm. L-BFGS approximates a local minimum of the cost function of a model. However, as long as the cost function of the model is convex, the function will have only one global minimum and no other local minima—remember that the cost function for a linear regression model is a convex quadratic function. Therefore, L-BFGS will consistently compute parameter estimates for standard social science models.

In the appendix, we provide a sketch of a proof for the claim that some mean parameter estimate of a random variable will become trivially statistically significant as n goes to infinity (if the true value is not exactly equal to the null). We can similarly show that differences in mean parameter estimates, regression coefficient estimates and linear combinations of coefficient estimates behave this way under unbounded n.↩
While small p-values are not sufficient for making meaningful inferences, they are necessary for making inferences with statistical confidence. If a hypothesis test results in a small test statistic and, therefore, a large p-value, then: the sample is still not sufficiently large to accurately estimate the parameter, we have estimated the parameter with too little precision or the true parameter is arbitrarily close to the null or some combination of listed reasons.↩