# Notes for Predictive Modeling

*MSc in Big Data Analytics at Carlos III University of Madrid*

*Last updated: 2021-04-29, v5.8.6*

# Preface

### Welcome

Welcome to the notes for *Predictive Modeling* for the academic year 2020/2021. The course is part of the MSc in Big Data Analytics from Carlos III University of Madrid.

The course is designed to have, roughly, **one session per main topic** in the syllabus. The schedule is tight due to time constraints, which will inevitably make the treatment of certain methods somehow superficial. Nevertheless, the course will hopefully give you a respectable panoramic view of different available *statistical* methods for predictive modeling. A broad view of the syllabus and its planning is:

- Introduction (first session)
- Linear models I (first/second session)
- Linear models II (second/third session)
- Linear models III (third/fourth session)
- Generalized linear models (fifth/sixth session)
- Nonparametric regression (sixth/seventh session)

Some logistics for the development of the course follow:

**Office hours**are Tuesdays from 17:00 to 18:00, online. Making use of them is the fastest way for me to clarify your doubts.**Questions and comments**during lectures are most welcome. Particularly if these are clarifications, comments, or alternative perspectives that may help the rest of the class. So just go ahead and fire!- Detailed
**course evaluation**guidelines can be found in the Aula Global. Recall that participation in lessons is positively evaluated.

### Main references and credits

Several great reference books have been used for preparing these notes. The following list presents the books that have been consulted:

- Chacón and Duong (2018) (Section 6.1.4)
- DasGupta (2008) (Section 3.5.2)
- Durbán (2017) (Section 5.2.2)
- Fan and Gijbels (1996) (Sections 6.2, 6.2.3, and 6.2.4)
- Hastie, Tibshirani, and Friedman (2009) (Section 4.1)
- James et al. (2013) (Sections 2.2 – 2.7, 3.1, 3.5, and 3.6.3, 4.1)
- Kuhn and Johnson (2013) (Section 1.2)
- Li and Racine (2007) (Section 6.3)
- Loader (1999) (Section 6.5)
- McCullagh and Nelder (1983) (Sections 5.2 – 5.6)
- Peña (2002) (Sections 2.2 – 2.7, 3.5, and 5.2.1)
- Seber and Lee (2003) (Section 4.2)
- Seber (1984) (Section 4.3)
- Wand and Jones (1995) (Sections 6.1.2, 6.1.3, and 6.2.4)
- Wasserman (2004) (Sections 6.5)
- Wasserman (2006) (Sections 6.2.4)
- Wood (2006) (Sections 5.2.2 and 5.7)

These notes are possible due to the existence of the incredible pieces of software by Xie (2016), Xie (2020), Allaire et al. (2020), Xie and Allaire (2020), and R Core Team (2020). Also, certain hacks to improve the design layout have been possible due to the outstanding work of Úcar (2018). The icons used in the notes were designed by madebyoliver, freepik, and roundicons from Flaticon.

Last but not least, the notes have benefited from contributions from the following people:

- Ainara Apezteguía García (fixed a mathematical typo)
- Katherine Botz (performed a thorough proofreading of the course materials, fixing a large number of typos)
- Marcos José Castillo Estévez (fixed two typos)
- Luis Cerdán Pedraza (performed an outstanding proofreading of the course materials fixing more than fifty typos, style issues, and mathematical typos)
- Frederik Chettouh (fixed a mathematical typo and two bugs)
- Gulnur Demir (fixed two typos)
- Andrés Escalante Ariza (fixed a mathematical typo)
- José Ángel Fernández (fixed several typos)
- Trinidad González Berzal (fixed a mathematical typo)
- Andrés Modet Álamo (performed an excellent review of the course materials detecting and fixing more than thirty typos, mostly mathematical, and four bugs)
- Santiago Palmero Muñoz (fixed a mathematical typo and a bug)
- Federico Petraccaro (fixed three mathematical typos)
- Enrique Ramírez Díaz (fixed a mathematical typo)
- Pavel Razgovorov (fixed a mathematical typo)
- Cristina Rodríguez Beltrán (fixed a typo and two bugs)
- Manuel Rodríguez Ramírez (fixed two typos)
- Celia Romero González (fixed a typo)
- Leonardo Stincone (fixed a mathematical typo and a bug)

### Contributions

Contributions, reporting of typos, and feedback on the notes are very welcome. Just send an email to edgarcia@est-econ.uc3m.es and give me a good reason for writing your name in the list of contributors!

### License

All the material in these notes is licensed under the **Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License** (CC BY-NC-ND 4.0). You may not use this material except in compliance with the aforementioned license. The human-readable summary of the license states that:

**You are free to**:*Share*– Copy and redistribute the material in any medium or format.

**Under the following terms**:*Attribution*– You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.*NonCommercial*– You may not use the material for commercial purposes.*NoDerivatives*– If you remix, transform, or build upon the material, you may not distribute the modified material.

### Citation

You may use the following \(\mathrm{B{\scriptstyle{IB}} \! T\!_{\displaystyle E} \! X}\) entry when citing these notes:

```
@book{Garcia-Portugues2021,
title = {Notes for Predictive Modeling},
author = {Garc\'ia-Portugu\'es, E.},
year = {2021},
note = {Version 5.8.6. ISBN 978-84-09-29679-8},
url = {https://bookdown.org/egarpor/PM-UC3M/}
}
```

You may also want to use the following template:

García-Portugués, E. (2021).

Notes for Predictive Modeling. Version 5.8.6. ISBN 978-84-09-29679-8. Available at https://bookdown.org/egarpor/PM-UC3M/.

### References

Allaire, J. J., Y. Xie, J. McPherson, J. Luraschi, K. Ushey, A. Atkins, H. Wickham, J. Cheng, W. Chang, and R. Iannone. 2020. *rmarkdown: Dynamic Documents for R*. https://github.com/rstudio/rmarkdown.

Chacón, J. E., and T. Duong. 2018. *Multivariate Kernel Smoothing and Its Applications*. Vol. 160. Monographs on Statistics and Applied Probability. Boca Raton: CRC Press. https://doi.org/10.1201/9780429485572.

DasGupta, A. 2008. *Asymptotic Theory of Statistics and Probability*. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-0-387-75971-5.

Durbán, M. 2017. *Modelización Estadística*. Lecture notes.

Fan, J., and I. Gijbels. 1996. *Local Polynomial Modelling and Its Applications*. Vol. 66. Monographs on Statistics and Applied Probability. London: Chapman & Hall. https://doi.org/10.2307/2670134.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. *The Elements of Statistical Learning*. Springer Series in Statistics. New York: Springer. https://doi.org/10.1007/978-0-387-84858-7.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. *An Introduction to Statistical Learning*. Vol. 103. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-1-4614-7138-7.

Kuhn, M., and K. Johnson. 2013. *Applied Predictive Modeling*. New York: Springer. https://doi.org/10.1007/978-1-4614-6849-3.

Li, Q., and J. S. Racine. 2007. *Nonparametric Econometrics*. Princeton: Princeton University Press. https://press.princeton.edu/books/hardcover/9780691121611/nonparametric-econometrics.

Loader, C. 1999. *Local Regression and Likelihood*. Statistics and Computing. New York: Springer. https://doi.org/10.2307/1270956.

McCullagh, P., and J. A. Nelder. 1983. *Generalized Linear Models*. Monographs on Statistics and Applied Probability. London: Chapman & Hall. https://doi.org/10.1007/978-1-4899-3244-0.

Peña, D. 2002. *Regresión y Diseño de Experimentos*. Madrid: Alianza Editorial. https://www.alianzaeditorial.es/libro/manuales/regresion-y-diseno-de-experimentos-daniel-pena-9788420693897/.

R Core Team. 2020. *R: A Language and Environment for Statistical Computing*. Vienna. https://www.R-project.org/.

Seber, G. A. F. 1984. *Multivariate Observations*. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. New York: John Wiley & Sons. https://doi.org/10.1002/9780470316641.

Seber, G. A. F., and A. J. Lee. 2003. *Linear Regression Analysis*. Wiley Series in Probability and Statistics. Hoboken: Wiley-Interscience. https://doi.org/10.1002/9780471722199.

Úcar, I. 2018. “Energy Efficiency in Wireless Communications for Mobile User Devices.” PhD thesis, Universidad Carlos III de Madrid. https://enchufa2.github.io/thesis/.

Wand, M. P., and M. C. Jones. 1995. *Kernel Smoothing*. Vol. 60. Monographs on Statistics and Applied Probability. London: Chapman & Hall. https://doi.org/10.1007/978-1-4899-4493-1.

Wasserman, L. 2004. *All of Statistics*. Springer Texts in Statistics. New York: Springer-Verlag. https://doi.org/10.1007/978-0-387-21736-9.

Wasserman, L. 2006. *All of Nonparametric Statistics*. Springer Texts in Statistics. New York: Springer-Verlag. https://doi.org/10.1007/0-387-30623-4.

Wood, S. N. 2006. *Generalized Additive Models*. Texts in Statistical Science Series. Boca Raton: Chapman & Hall/CRC. https://doi.org/10.1201/9781420010404.

Xie, Y. 2016. *Bookdown: Authoring Books and Technical Documents with R Markdown*. The R Series. Boca Raton: CRC Press. https://bookdown.org/yihui/bookdown/.

Xie, Y. 2020. *knitr: A General-Purpose Package for Dynamic Report Generation in R*. https://CRAN.R-project.org/package=knitr.

Xie, Y., and J. J. Allaire. 2020. *tufte: Tufte’s Styles for R Markdown Documents*. https://CRAN.R-project.org/package=tufte.