6.15 Generalization / overfitting

As discussed in Section 5.27, it is important to limit the number of predictors in a model to avoid overfitting and ensure generalizability. In a logistic regression with n observations, the rule of thumb is to have no more than nmin/15 predictors, where nmin is the number of observations in the less common outcome level (Babyak 2004; Harrell 2015, 72–73). This can also be expressed in terms of the sample proportion – have no more than n×pmin/15 predictors, where pmin is the proportion of observations with the less common outcome level. Seen the other way around, if you are designing a study and plan to include K predictors, you need at least 15×K/pmin observations, where pmin is your best guess for the smaller of the two population prevalences. As with any rule of thumb, this is meant as guidance – there is no requirement that it be strictly applied.

For example, suppose your outcome is “occurrence of disease within one-year post-exposure”. The outcome levels are “disease” and “no disease”. If you have a sample of size n=300 and 23% of the sample developed the disease (so 77% did not and the less prevalent outcome level is “disease”), you should include no more than n×pmin/15=300×0.23/15=4.6 predictors (you can round up to 5) in a logistic regression model.

Suppose instead you were designing a study in which you expect 65% of the individuals to experience the outcome and you would like to include five predictors. In this case, the prevalences are 65% and 35%, so the lower prevalence is that of not experiencing the outcome. To ensure generalizability, you need at least 15×K/pmin=15×5/0.35=214.29 observations in your sample (round up to 215).

NOTE: A sample size sufficient to ensure generalizability may or may not be sufficient for the purpose of having enough power to test a hypothesis.

References

Babyak, Michael A. 2004. “What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models.” Psychosomatic Medicine 66 (3): 411–21.
Harrell, Frank E, Jr. 2015. Regression Modeling Strategies. 2nd ed. Switzerland: Springer International Publishing.