Regression Models
MSc in Statistic for Data Sciences, Carlos III University of Madrid
Chapter 1 Introduction
These notes contain both the theory and practice for the statistical models presented in the course.
Regression Analysis is the most common statistical modeling approach used in data analysis, and it is the basis for more advanced statistical and machine learning modeling.
In this course, you will received the foundation knowledge in the use of widely used tools in regression analysis. You will learn the basics of regression analysis such as linear regression, logistic regression, Poisson regression, generalized linear regression and generalized additive models.
Throughout this course, you will be exposed to fundamental concepts of regression analysis, and also many data examples using the R
statistical software. Therefore, by the end of this course, you will also be familiar with the implementation of regression models using the R statistical software along with interpretation for the results derived from such implementations.
The structure of the course is as follows:
- Linear Models
- Generalized Linear Models
- Generalized additive models
- An important part of the theoretical results of the course is based on Matrix Algebra, and basic concepts on distributions, a review of results can be found here here
Also, a good bedside book for matrix algebra is:
Matrix Algebra from a Statistician’s Perspective. D.A. Harville, Springer, 2008
- The sofware used in the course is the statistical language
R
or its well know environmentRStudio
. A basic prior knowledge is assumed.
Academic Integrity
Academic integrity is the pursuit of scholarly activity free from fraud and deception and is an educational objective of this institution. Academic dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating of information or citations, facilitating acts of academic dishonesty by others, submitting work of another person or work previously used without informing the instructor, or tampering with the academic work of other students. All University policies regarding academic integrity apply to this course.
1.1 Required packages for Chapters 1-4
We will employ several packages that are not contained in R
(note: list to be updated). These can be installed as:
1.2 1. Statistical Modeling versus Machine Learning
Statistical Modeling: The two Cultures. Statistical Sciences, 16, 199-231 (2001)
- What is the diffence between mathematical and statistical modeling?
- Whats is the difference between statistical modeling and machine learning?
1.3 What is Regression Modeling?
- Set of statistical tools to model associations among variables:
- Prediction of new observations
- System explanation
- Variable screening
- Parameter estimation
- Regression versus Correlation
- Correlation: The relationship is not directional. Interest is on how they are mutually associated. Only linear relationships
- Regression: Interest is on how some variables respond to others
1.4 Why is it called “Regression”?
In one experiment, (Galton 1886) collected data on the heights of sets of parents with adult children. He calculated the average of the two parents’ heights (which he called the “mid-parent height”) and divided them into groups based on the range of their heights
Based on these results, Galton concluded that as heights of the parents deviated from the average height (that is as they became taller or shorter than the average adult), their children tended to be less extreme in height. That is, the heights of the children regressed to the average height of an adult.
Later, the term regression become associated with the statistical analysis that we now call regression. But it was just by chance that Galton’s original results using a fit line happened to show a regression of heights. If his study had showed increasing deviance of childrens’ heights from the average compared to their parents, perhaps we’d be calling it “progression” instead