6.15 Generalization / overfitting

As discussed in Section 5.26, it is important to limit the number of predictors in a model to avoid overfitting and ensure generalizability. In a logistic regression with \(n\) observations, the rule of thumb is to have no more than \(n_{min}/15\) predictors, where \(n_{min}\) is the number of observations in the less common outcome level (Babyak 2004; Harrell 2015, pp72–73). This can also be expressed in terms of the sample proportion – have no more than \(n \times p_{min}/15\) predictors, where \(p_{min}\) is the proportion of observations with the less common outcome level. Seen the other way around, if you are designing a study and plan to include \(K\) predictors, you need at least \(15 \times K / p_{min}\) observations, where \(p_{min}\) is your best guess for the smaller of the two population prevalences. As with any rule of thumb, this is meant as guidance – there is no requirement that it be strictly applied.

For example, suppose your outcome is “occurrence of disease within one-year post-exposure”. The outcome levels are “disease” and “no disease”. If you have a sample of size \(n = 300\) and \(23\%\) of the sample developed the disease (so \(77\%\) did not and the less prevalent outcome level is “disease”), you should include no more than \(n \times p_{min}/15 = 300 \times 0.23/15 = 4.6\) predictors (you can round up to \(5\)) in a logistic regression model.

Suppose instead you were designing a study in which you expect \(65\%\) of the individuals to experience the outcome and you would like to include 5 predictors. In this case, the prevalences are \(65\%\) and \(35\%\), so the lower prevalence is that of not experiencing the outcome. To ensure generalizability, you need at least \(15 \times K / p_{min} = 15 \times 5 / 0.35 = 214.29\) observations in your sample (round up to \(215\)).

NOTE: A sample size sufficient to ensure generalizability may or may not be sufficient for the purpose of having enough power to test a hypothesis.

References

Babyak, Michael A. 2004. “What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models.” Psychosomatic Medicine 66 (3): 411–21.
Harrell, Frank E, Jr. 2015. Regression Modeling Strategies. 2nd ed. Switzerland: Springer International Publishing.