4 Introductory Data Analysis
4.1 Exploratory Data Analysis
So far, we’ve learned about how to manipulate our data and how to graph our outputs. Both of these are critically important parts of what’s known as exploratory data analysis - or EDA. When you’re starting with a new dataset, you won’t always immediately know what trends and patterns might be there to discover. The idea at this stage isn’t to find out what’s causing any trends in the data, to identify any significant results you might have, or to get publishable figures and tables - the point is to understand exactly what it is that you’re dealing with.
This unit gives examples of what EDA might look like with a sample dataset. But there aren’t prescribed sets of steps to go through while working on EDA - you should feel free to create as many hypotheses as possible, and spend time analyzing them individually. You might find something surprising!
Speaking of surprises, I really enjoy this quote from Nate Silver, founder and editor in chief of FiveThirtyEight:
You ideally want to find yourself surprised by the data some of the time — just not too often. If you never come up with a result that surprises you, it generally means that you didn’t spend a lot of time actually looking at the data; instead, you just imparted your assumptions onto your analysis and engaged in a fancy form of confirmation bias. If you’re constantly surprised, on the other hand, more often than not that means your [code] is buggy or you don’t know the field well enough; a lot of the “surprises” are really just mistakes.
—Nate Silver
Surprises are awesome, and are how discoveries are made in science. But at the same time, a lot of papers are retracted because their big surprise was actually just a glitch in the code. Whenever you find something you didn’t expect, make sure you go back through your code and assumptions - it never hurts to double check!
4.1.1 gapminder
Anyway. We’ll be working with data from the gapminder database, which contains statistics on global development metrics. We can get the data like we get most packages:
install.packages("gapminder")
library(gapminder)
Let’s also load the tidyverse:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The gapminder package includes four tables, of which we only care about one: gapminder
. We can preview the data, as usual, by typing the name of the table:
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
4.1.2 Describing Your Data
If we want to get a sense of each variable, we can use base R’s summary()
to find basic summary statistics for each column:
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
For the categorical variables country
and continent
, R just returns the number of observations each category has. For the numeric variables, meanwhile, the output is a little more involved.
We can get an even fuller sense of the data using functions from other packages. My personal favorite is the describe()
function included in psych
:
install.packages("psych")
psych::describe(gapminder)
## vars n mean sd median trimmed
## country* 1 1704 71.50 41.00 71.50 71.50
## continent* 2 1704 2.33 1.21 2.00 2.27
## year 3 1704 1979.50 17.27 1979.50 1979.50
## lifeExp 4 1704 59.47 12.92 60.71 59.92
## pop 5 1704 29601212.32 106157896.74 7023595.50 11399459.45
## gdpPercap 6 1704 7215.33 9857.45 3531.85 5221.44
## mad min max range skew kurtosis
## country* 52.63 1.00 142.0 141 0.00 -1.20
## continent* 1.48 1.00 5.0 4 0.25 -1.34
## year 22.24 1952.00 2007.0 55 0.00 -1.22
## lifeExp 16.10 23.60 82.6 59 -0.25 -1.13
## pop 7841473.62 60011.00 1318683096.0 1318623085 8.33 77.62
## gdpPercap 4007.61 241.17 113523.1 113282 3.84 27.40
## se
## country* 0.99
## continent* 0.03
## year 0.42
## lifeExp 0.31
## pop 2571683.45
## gdpPercap 238.80
(Note: psych::
lets me call functions from the psych
package without having to load it using library()
. This is useful when you aren’t using the functions that often - and lets me explicitly identify which functions come from which packages, for the purposes of instruction - but becomes less useful when using the same function or package multiple times. Imagine having to type dplyr::
everytime we wanted to use %>%
, for instance!)
This output replaces the IQR with the full range of the data, and adds a number of other important statistics to the output. For instance, we can now see the skew and kurtosis of our data, representing how close our data is to the normal distribution. Briefly, skew represents how close the median is to the mean (or how long the data’s “tails” are - 0 means the median is the mean), while kurtosis represents how large the tails of the distribution are (with normally distributed data having a kurtosis of 0).
You should always analyze these values for your data - both in order to give you a sense of what you’re working with, and to tell you if there were any errors during data entry. If population had a negative minimum value, for instance, you’d know to be alarmed.
As we can see, some of our data is highly skewed, with extremely long tails. For instance, if we were to make a histogram of population:
ggplot(gapminder, aes(pop)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This makes some intuitive sense - most countries have decently small populations, while some countries - such as China and India - contain significant portions of the world’s population.
Now that we have a sense of what our data looks like, we can start attempting to identify trends in the data. You should never trust your data before visualizing it - summary statistics and other tests may not give you critical insights about trends present in your data.
Base R’s pairs()
function is useful for this purpose - it makes a matrix of scatterplots for all your variables, letting you see any correlations that exist visually. Note that we have to subset our data to make sure that we’re only graphing numeric columns:
pairs(gapminder[, 3:6])
From this, we can immediately see some trends - it looks like all our variables are increasing over time (left column), and that life expectancy goes up as GDP increases (right column, second from the top). A way to see the same thing in table form is to use cor()
:
cor(gapminder[, 3:6])
## year lifeExp pop gdpPercap
## year 1.00000000 0.43561122 0.08230808 0.22731807
## lifeExp 0.43561122 1.00000000 0.06495537 0.58370622
## pop 0.08230808 0.06495537 1.00000000 -0.02559958
## gdpPercap 0.22731807 0.58370622 -0.02559958 1.00000000
The numbers refer to how correlated the variables are: 1 means perfectly positively correlated, while -1 is perfectly negatively correlated.
If we wanted to get even fancier, we could coerce this result to a dataframe using the function tidy()
from the package broom
. We can even visualize this data in ggplot, using geom_tile()
:
install.packages("broom")
CorMatrix <- broom::tidy(cor(gapminder[, 3:6])) %>%
rename(Var1 = ".rownames") %>%
gather(Var2, Cor, -Var1)
CorMatrix
## Var1 Var2 Cor
## 1 year year 1.00000000
## 2 lifeExp year 0.43561122
## 3 pop year 0.08230808
## 4 gdpPercap year 0.22731807
## 5 year lifeExp 0.43561122
## 6 lifeExp lifeExp 1.00000000
## 7 pop lifeExp 0.06495537
## 8 gdpPercap lifeExp 0.58370622
## 9 year pop 0.08230808
## 10 lifeExp pop 0.06495537
## 11 pop pop 1.00000000
## 12 gdpPercap pop -0.02559958
## 13 year gdpPercap 0.22731807
## 14 lifeExp gdpPercap 0.58370622
## 15 pop gdpPercap -0.02559958
## 16 gdpPercap gdpPercap 1.00000000
ggplot(CorMatrix, aes(Var1, Var2, fill = Cor)) +
geom_tile()
Note that what we did above - calling cor()
inside of tidy()
- applies the outside function to the output of the inside function. Functions are run from the inside out. You might remember similar formulas from algebra classes - f(g(x))
is the same as (f * g)(x)
.
Of course, there’s no need to do all of these at once - you can do whichever method makes sense to you. Either way, we get similar results - the strongest correlation is between life expectancy and GDP. We can visualize this trend using our ggplot skills:
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_jitter()
(See that example of nesting functions? ggplot(aes())
!)
Hmm, weird! While we can see the correlation, it seems like there’s another factor we aren’t accounting for. What happens if we color the points by the year they represent?
ggplot(gapminder, aes(gdpPercap, lifeExp, color = year)) +
geom_jitter()
That looks like we’re onto something. What if we facetted the data by continent?
ggplot(gapminder, aes(gdpPercap, lifeExp, color = year)) +
geom_jitter() +
facet_wrap(~ continent)
Hmm. We can see some obvious trends - it seems like Africa has a lower average life expectancy than the other continents, for instance - but they’re hard to discern from the visuals alone. For that, we’re going to have to get into some actual statistic computing.
4.2 Statistical Tests and Regressions
Now that we’ve identified some hypotheses about our data, we’re able to use the full power of R to try and prove them. First off, we can identify if any of the correlations we saw are statistically significant. For the full dataset, this is pretty easy - we can just feed two vectors into cor.test()
:
cor.test(gapminder$lifeExp, gapminder$gdpPercap)
##
## Pearson's product-moment correlation
##
## data: gapminder$lifeExp and gapminder$gdpPercap
## t = 29.658, df = 1702, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5515065 0.6141690
## sample estimates:
## cor
## 0.5837062
(Note that most likely we’d actually use a different correlation test, as the data on life expectancy are non-normal. You can control the test used by setting the method
argument - information about the correlation tests available in this function can be found here.)
Looks like a yes!
Before we rush off to publish, though, it might worth seeing if the correlations are different (or, hopefully, stronger) on each continent. Before we do that, let’s see if the different continents really do have different life expectancies. In order to do this, we’ll perform an ANOVA using our continents as different groups, then test for differences between the groups.
Now, ANOVA is actually just another form of a linear regression, where you’re predicting the impact an independent variable has on the response variable. The unique thing about ANOVA is that the independent variables have to be categorical - if they’re all numeric, you want to stick with regression, and if there’s a mix, you’ll use ANCOVA.
The reason I bring this up is that running an ANOVA in R requires you to first build a linear model, using the aov()
function. This function is pretty simple, taking three arguments - the response variable, all independent variables, and the data to be used. For our purposes, modeling life expectancy as a function of continent, the equation looks like this:
aov(lifeExp ~ continent, data = gapminder)
## Call:
## aov(formula = lifeExp ~ continent, data = gapminder)
##
## Terms:
## continent Residuals
## Sum of Squares 139343.2 144805.2
## Deg. of Freedom 4 1699
##
## Residual standard error: 9.231992
## Estimated effects may be unbalanced
Those coefficients do mean something, but we won’t get into that right now. More interesting for our purposes is getting the outputs of our ANOVA, which we can do by feeding anova()
our model:
anova(aov(lifeExp ~ continent, data = gapminder))
## Analysis of Variance Table
##
## Response: lifeExp
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 139343 34836 408.73 < 2.2e-16 ***
## Residuals 1699 144805 85
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note, by the way, that you’d usually do this by assigning the aov()
to an object (as in a <- aov(...)
). I’m not doing this right now, as I think this section is easier to conceptually understand if I’m explicit about what we’re running our anova()
and other tests on.
Looks like continent does have a significant effect on life expectancy!
By the way, there’s two other, identical ways to run the same test, using lm()
for aov()
and summary()
for anova()
. It really doesn’t matter which you go with - they all return the same results.
anova(lm(lifeExp ~ continent, data = gapminder))
summary(aov(lifeExp ~ continent, data = gapminder))
So now we know that there are significant differences between the continents. What we want now is to be able to tell what those differences are. Base R’s TukeyHSD()
function will help with that, telling us what the differences between each continent are. All we have to do is feed it our aov()
object, and, optionally, tell it to sort the results by the difference with ordered = TRUE
TukeyHSD(aov(lifeExp ~ continent, data = gapminder), ordered = TRUE)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = lifeExp ~ continent, data = gapminder)
##
## $continent
## diff lwr upr p adj
## Asia-Africa 11.199573 9.579887 12.819259 0.0000000
## Americas-Africa 15.793407 14.022263 17.564550 0.0000000
## Europe-Africa 23.038356 21.369862 24.706850 0.0000000
## Oceania-Africa 25.460878 20.216908 30.704848 0.0000000
## Americas-Asia 4.593833 2.664235 6.523432 0.0000000
## Europe-Asia 11.838783 10.002952 13.674614 0.0000000
## Oceania-Asia 14.261305 8.961718 19.560892 0.0000000
## Europe-Americas 7.244949 5.274203 9.215696 0.0000000
## Oceania-Americas 9.667472 4.319650 15.015293 0.0000086
## Oceania-Europe 2.422522 -2.892185 7.737230 0.7250559
So as we can see, Africa has a significantly lower life expectancy than every other continent, followed by Asia, and then the Americas. Oceania and Europe, meanwhile, have about equal life expectancies.
If you remember the tidy()
function from broom
that we used earlier, you probably won’t be surprised to learn it can make this sort of output tidier, too:
broom::tidy(TukeyHSD(aov(lifeExp ~ continent, data = gapminder), ordered = TRUE))
## term comparison estimate conf.low conf.high adj.p.value
## 1 continent Asia-Africa 11.199573 9.579887 12.819259 2.727152e-12
## 2 continent Americas-Africa 15.793407 14.022263 17.564550 2.727152e-12
## 3 continent Europe-Africa 23.038356 21.369862 24.706850 2.727152e-12
## 4 continent Oceania-Africa 25.460878 20.216908 30.704848 2.727152e-12
## 5 continent Americas-Asia 4.593833 2.664235 6.523432 1.047859e-09
## 6 continent Europe-Asia 11.838783 10.002952 13.674614 2.727152e-12
## 7 continent Oceania-Asia 14.261305 8.961718 19.560892 5.862422e-12
## 8 continent Europe-Americas 7.244949 5.274203 9.215696 2.775336e-12
## 9 continent Oceania-Americas 9.667472 4.319650 15.015293 8.648670e-06
## 10 continent Oceania-Europe 2.422522 -2.892185 7.737230 7.250559e-01
This will make your life a lot better when you’re trying to export data.
4.3 Functional Programming
So, it’s probably worth finding the correlation between life expectancy and GDP for each continent. We could filter our data five times - one for each continent - but that breaks our rule: if we have to do it more than twice, there’s probably a better way.
Now, fair warning, what we’re about to do is probably the hardest topic we’ve gone over yet. Understanding it will make a lot of common tasks much easier, but don’t worry if you don’t understand it on the first go-through.
First off, let’s load broom explicitly now - the only thing we’ll be using it for is the tidy()
function.
library(broom)
The tidyverse includes a package - purrr
- useful for working with lists. As we saw last unit, a list is a type of object which can include multiple types of data. These include the numerics, characters, and logicals we’re comfortable with - but can also include name-value pairs (like z = 0
, from the last unit) and, weirdly enough, entire dataframes.
For instance, let’s see what happens when we use the nest()
function that purrr
provides:
gapminder %>%
nest()
## # A tibble: 1 x 1
## data
## <list>
## 1 <tibble [1,704 x 6]>
All of our data is now stored in a single cell, as a tibble!
We can exclude columns from being nested with -
if we want, splitting the data into as many subsets as there are levels of that variable:
gapminder %>%
nest(-continent)
## # A tibble: 5 x 2
## continent data
## <fct> <list>
## 1 Asia <tibble [396 x 5]>
## 2 Europe <tibble [360 x 5]>
## 3 Africa <tibble [624 x 5]>
## 4 Americas <tibble [300 x 5]>
## 5 Oceania <tibble [24 x 5]>
And when we do this, we can subset the list in the exact same way we would a normal vector:
Nested <- gapminder %>%
nest(-continent)
Nested[1, 2]
## # A tibble: 1 x 1
## data
## <list>
## 1 <tibble [396 x 5]>
If we then wanted to, we can unnest()
the data:
Nested <- gapminder %>%
nest(-continent)
Nested[1, 2] %>%
unnest()
## # A tibble: 396 x 5
## country year lifeExp pop gdpPercap
## <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan 1952 28.8 8425333 779.
## 2 Afghanistan 1957 30.3 9240934 821.
## 3 Afghanistan 1962 32.0 10267083 853.
## 4 Afghanistan 1967 34.0 11537966 836.
## 5 Afghanistan 1972 36.1 13079460 740.
## 6 Afghanistan 1977 38.4 14880372 786.
## 7 Afghanistan 1982 39.9 12881816 978.
## 8 Afghanistan 1987 40.8 13867957 852.
## 9 Afghanistan 1992 41.7 16317921 649.
## 10 Afghanistan 1997 41.8 22227415 635.
## # ... with 386 more rows
But what’s interesting is that we can also manipulate the data while it’s nested. purrr
provides a number of “map” functions, which will apply a function to each member of a list and return the outputs as a list.
What this allows us to do is create additional nested lists as new columns, letting us find correlations between groups much faster. For instance, if we want to create a column Cor
with the output of the correlation tests, we could do the following:
gapminder %>%
nest(-continent) %>%
mutate(Cor = map(data, ~ cor.test(.$lifeExp, .$gdpPercap)))
## # A tibble: 5 x 3
## continent data Cor
## <fct> <list> <list>
## 1 Asia <tibble [396 x 5]> <S3: htest>
## 2 Europe <tibble [360 x 5]> <S3: htest>
## 3 Africa <tibble [624 x 5]> <S3: htest>
## 4 Americas <tibble [300 x 5]> <S3: htest>
## 5 Oceania <tibble [24 x 5]> <S3: htest>
The ~
indicates that the next word is a function that should be applied to each element of the list. .
is what’s referred to as a pronoun - it’s the short name for the data that’s being applied to the function. That’s why we can subset it with $
- while we’re inside the map function, .
is your data. We’ll be using .
repeatedly for the rest of the course.
Now, because cor.test()
doesn’t provide a tidy data output - it produces something human-readable, but not computer-usable - we have to tidy it up before we can extract our numbers. That’s where tidy()
comes in, which we use pretty similarly to cor.test()
:
gapminder %>%
nest(-continent) %>%
mutate(Cor = map(data, ~ cor.test(.$lifeExp, .$gdpPercap)),
TidyCor = map(Cor, ~ tidy(.)))
## # A tibble: 5 x 4
## continent data Cor TidyCor
## <fct> <list> <list> <list>
## 1 Asia <tibble [396 x 5]> <S3: htest> <data.frame [1 x 8]>
## 2 Europe <tibble [360 x 5]> <S3: htest> <data.frame [1 x 8]>
## 3 Africa <tibble [624 x 5]> <S3: htest> <data.frame [1 x 8]>
## 4 Americas <tibble [300 x 5]> <S3: htest> <data.frame [1 x 8]>
## 5 Oceania <tibble [24 x 5]> <S3: htest> <data.frame [1 x 8]>
That last column - made up of dataframes - is exactly what we want. We can extract it from this dataframe using unnest(.drop = TRUE)
, which will drop the other nested columns:
gapminder %>%
nest(-continent) %>%
mutate(Cor = map(data, ~ cor.test(.$lifeExp, .$gdpPercap)),
TidyCor = map(Cor, ~ tidy(.))) %>%
unnest(TidyCor, .drop = TRUE)
## # A tibble: 5 x 9
## continent estimate statistic p.value parameter conf.low conf.high
## <fct> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Asia 0.382 8.21 3.29e-15 394 0.295 0.463
## 2 Europe 0.781 23.6 4.05e-75 358 0.737 0.818
## 3 Africa 0.426 11.7 7.60e-29 622 0.359 0.488
## 4 Americas 0.558 11.6 5.45e-26 298 0.475 0.632
## 5 Oceania 0.956 15.4 2.99e-13 22 0.901 0.981
## # ... with 2 more variables: method <fct>, alternative <fct>
And tada, we have the output from five correlation tests in one step, rather than the ten it would take to do the long way.
When I was learning how to do this, I kept believing that it was a waste of my time, because learning took significantly more time than just doing it the long way. But in addition to the obvious benefits this method gives - we have less code replication, which means faster programs and less chance for typos; all of our data is in the same dataframe, rather than in five chunks; and we can compare our results immediately, rather than having to sift through paragraphs to find the same data points - this method is extremely applicable to many other areas of coding. If you ever have to build models or analyze grouped data, this is the way to make sure that your work is reproducible and your science is sound.
So even though it might take a bit to fully internalize, keep trying these efficient paths when we use them in this reader. We’ll be returning to this format repeatedly, in order to build models, analyze results, and compare groups.
4.3.1 Column Names
In order to do that, of course, we have to understand what each of those column names represent. In order:
estimate
is whatever parameter is being estimated - here, the correlation coefficient.statistic
is the test statistic used to calculate the p value - in this case, t.p.value
is, well, the p-value. It’s calculated for 95%, unless you changed the confidence level by settingcor.test(conf.level = ##)
.parameter
is the degrees of freedom.conf.low
andconf.high
are the bounds of the confidence interval of the estimate - so here, of the correlation coefficient. It’s calculated for whatever confidence level you set incor.test(conf.level = ##)
.method
is the test used.alternative
is the alternative hypothesis tested - you can change it by settingcor.test(alternative = "")
to “two.sided”, “lesser”, or “greater”.
4.4 Modeling
Okay, time to get back to data analysis.
If we want to see the effect of two continuous variables (say, gdpPercapita
and year
) on lifeExp
, we’d use what’s known as multiple regression. There’s a lot of different flavors of multiple regression, but the simplest is what’s known as general linear regression - the y = mx + b formula that you went over in high school. R makes it easy to fit a general linear model to your data (via ordinary least squares) using the lm()
function, using the same arguments as we used for aov()
:
lm(lifeExp ~ gdpPercap + year, data = gapminder)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + year, data = gapminder)
##
## Coefficients:
## (Intercept) gdpPercap year
## -4.184e+02 6.697e-04 2.390e-01
These coefficients show how much we can expect life expectancy to increase if the GDP or year increases. The intercept is what life expectancy would be if both GDP and the year were 0 - obviously, this number doesn’t make a ton of sense, since we don’t have any data on life expectancy in either of those situations. There’s ways to correct for this, and we aren’t gonna do them - more information can be found here.
If we want to see how well this model fits our data, we can graph it:
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_jitter() +
geom_smooth(method = "lm")
Hmm. Seems like a pretty bad fit.
If we want statistical estimates of how well our model fits the data, we can feed it as an argument to summary()
:
Model <- lm(lifeExp ~ gdpPercap + year, data = gapminder)
summary(Model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.262 -6.954 1.219 7.759 19.553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.184e+02 2.762e+01 -15.15 <2e-16 ***
## gdpPercap 6.697e-04 2.447e-05 27.37 <2e-16 ***
## year 2.390e-01 1.397e-02 17.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.694 on 1701 degrees of freedom
## Multiple R-squared: 0.4375, Adjusted R-squared: 0.4368
## F-statistic: 661.4 on 2 and 1701 DF, p-value: < 2.2e-16
As we can see, both GDP and the year are significant predictors of life expectancy, but the linear model isn’t a great fit for the data - our R2 is 0.44.
Perhaps we could improve the model by accounting for the interaction between GDP and the year. There are two ways to include interaction terms in models, each of which have their benefits. One is to replace the +
in the model with *
, as follows:
Model <- lm(lifeExp ~ gdpPercap * year, data = gapminder)
summary(Model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap * year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.234 -7.314 1.002 7.951 19.780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.532e+02 3.267e+01 -10.811 < 2e-16 ***
## gdpPercap -8.754e-03 2.547e-03 -3.437 0.000602 ***
## year 2.060e-01 1.653e-02 12.463 < 2e-16 ***
## gdpPercap:year 4.754e-06 1.285e-06 3.700 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.658 on 1700 degrees of freedom
## Multiple R-squared: 0.442, Adjusted R-squared: 0.441
## F-statistic: 448.8 on 3 and 1700 DF, p-value: < 2.2e-16
The other is to specify which terms you want to track the interaction of, using :
as follows:
Model <- lm(lifeExp ~ gdpPercap + year + gdpPercap:year, data = gapminder)
summary(Model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + year + gdpPercap:year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.234 -7.314 1.002 7.951 19.780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.532e+02 3.267e+01 -10.811 < 2e-16 ***
## gdpPercap -8.754e-03 2.547e-03 -3.437 0.000602 ***
## year 2.060e-01 1.653e-02 12.463 < 2e-16 ***
## gdpPercap:year 4.754e-06 1.285e-06 3.700 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.658 on 1700 degrees of freedom
## Multiple R-squared: 0.442, Adjusted R-squared: 0.441
## F-statistic: 448.8 on 3 and 1700 DF, p-value: < 2.2e-16
As you can see, these outputs are identical. Even so, I personally prefer using the second format, because it makes the interaction term a more explicit part of your model. While it doesn’t matter in simple models like the one we’ve developed here, once you have a large number of terms, it’s helpful to specify which interactions you’re measuring.
Mixed Models
That R2 pf 0.44 is an improvement, but still not great. We could try combining continuous and categorical variables in a single model, to try and better predict life expectancy.
Doing that in R is very simple - just add the categorical variable to your model like you would a continuous. Let’s do that with continents:
Model <- lm(lifeExp ~ gdpPercap + year + continent + gdpPercap:year + gdpPercap:continent + year:continent + gdpPercap:year:continent, data = gapminder)
summary(Model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + year + continent + gdpPercap:year +
## gdpPercap:continent + year:continent + gdpPercap:year:continent,
## data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.7123 -3.2453 0.1535 3.4204 17.7379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.241e+02 3.777e+01 -11.229 < 2e-16
## gdpPercap -2.168e-02 1.226e-02 -1.768 0.07721
## year 2.377e-01 1.909e-02 12.457 < 2e-16
## continentAmericas -4.065e+02 7.317e+01 -5.556 3.20e-08
## continentAsia -2.258e+02 5.723e+01 -3.946 8.27e-05
## continentEurope 7.212e+01 7.849e+01 0.919 0.35830
## continentOceania 3.047e+02 4.846e+02 0.629 0.52956
## gdpPercap:year 1.149e-05 6.176e-06 1.860 0.06308
## gdpPercap:continentAmericas 7.174e-02 1.445e-02 4.963 7.63e-07
## gdpPercap:continentAsia 7.051e-03 1.251e-02 0.563 0.57318
## gdpPercap:continentEurope 4.049e-02 1.317e-02 3.075 0.00214
## gdpPercap:continentOceania 1.275e-02 3.527e-02 0.362 0.71768
## year:continentAmericas 2.119e-01 3.694e-02 5.736 1.15e-08
## year:continentAsia 1.196e-01 2.894e-02 4.133 3.75e-05
## year:continentEurope -2.636e-02 3.970e-02 -0.664 0.50684
## year:continentOceania -1.408e-01 2.461e-01 -0.572 0.56739
## gdpPercap:year:continentAmericas -3.634e-05 7.275e-06 -4.996 6.46e-07
## gdpPercap:year:continentAsia -3.945e-06 6.305e-06 -0.626 0.53163
## gdpPercap:year:continentEurope -2.076e-05 6.629e-06 -3.133 0.00176
## gdpPercap:year:continentOceania -6.939e-06 1.757e-05 -0.395 0.69289
##
## (Intercept) ***
## gdpPercap .
## year ***
## continentAmericas ***
## continentAsia ***
## continentEurope
## continentOceania
## gdpPercap:year .
## gdpPercap:continentAmericas ***
## gdpPercap:continentAsia
## gdpPercap:continentEurope **
## gdpPercap:continentOceania
## year:continentAmericas ***
## year:continentAsia ***
## year:continentEurope
## year:continentOceania
## gdpPercap:year:continentAmericas ***
## gdpPercap:year:continentAsia
## gdpPercap:year:continentEurope **
## gdpPercap:year:continentOceania
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.307 on 1684 degrees of freedom
## Multiple R-squared: 0.7643, Adjusted R-squared: 0.7616
## F-statistic: 287.4 on 19 and 1684 DF, p-value: < 2.2e-16
Our R2 is now up to a respectable 0.76! It looks like these three variables - and the interactions between them - explain a lot of the variance in our data. To get a cleaner table explaining the impacts each variable has on life expectancy, we can perform an analysis of covariance, or ANCOVA. To do this, we just change out our mixed model lm()
function for an aov()
, and then pass that aov()
object to summary()
:
Model <- aov(lifeExp ~ gdpPercap + year + continent + gdpPercap:year + gdpPercap:continent + year:continent + gdpPercap:year:continent, data = gapminder)
summary(Model)
## Df Sum Sq Mean Sq F value Pr(>F)
## gdpPercap 1 96813 96813 2434.192 <2e-16 ***
## year 1 27495 27495 691.316 <2e-16 ***
## continent 4 79428 19857 499.269 <2e-16 ***
## gdpPercap:year 1 18 18 0.461 0.497
## gdpPercap:continent 4 5420 1355 34.069 <2e-16 ***
## year:continent 4 4331 1083 27.225 <2e-16 ***
## gdpPercap:year:continent 4 3667 917 23.048 <2e-16 ***
## Residuals 1684 66976 40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output means we can now say that GDP has a significant impact on life expectancy when controlling for year as a covariate - and that all the interaction terms are significant, except for GDP and year.
One way to get around having to use an ANCOVA is to develop separate regression models for each continent. We can do that in much the same way we tested for correlation on each continent:
gapminder %>%
nest(-continent) %>%
mutate(Models = map(data, ~ lm(lifeExp ~ gdpPercap + year + gdpPercap:year, data = .)),
TidyMods = map(Models, ~ tidy(.))) %>%
unnest(TidyMods, .drop = TRUE)
## # A tibble: 20 x 6
## continent term estimate std.error statistic p.value
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Asia (Intercept) -650. 54.6 -11.9 4.38e-28
## 2 Asia gdpPercap -0.0146 0.00318 -4.60 5.71e- 6
## 3 Asia year 0.357 0.0276 12.9 4.34e-32
## 4 Asia gdpPercap:year 0.00000754 0.00000161 4.68 3.97e- 6
## 5 Europe (Intercept) -352. 29.8 -11.8 1.92e-27
## 6 Europe gdpPercap 0.0188 0.00208 9.06 8.85e-18
## 7 Europe year 0.211 0.0151 14.0 6.51e-36
## 8 Europe gdpPercap:year -0.00000928 0.00000104 -8.90 2.80e-17
## 9 Africa (Intercept) -424. 41.9 -10.1 2.09e-22
## 10 Africa gdpPercap -0.0217 0.0136 -1.59 1.11e- 1
## 11 Africa year 0.238 0.0212 11.2 8.83e-27
## 12 Africa gdpPercap:year 0.0000115 0.00000685 1.68 9.40e- 2
## 13 Americas (Intercept) -831. 54.3 -15.3 2.46e-39
## 14 Americas gdpPercap 0.0501 0.00663 7.55 5.53e-13
## 15 Americas year 0.450 0.0274 16.4 1.81e-43
## 16 Americas gdpPercap:year -0.0000249 0.00000333 -7.46 9.52e-13
## 17 Oceania (Intercept) -119. 41.3 -2.89 9.07e- 3
## 18 Oceania gdpPercap -0.00892 0.00283 -3.15 5.01e- 3
## 19 Oceania year 0.0969 0.0210 4.62 1.67e- 4
## 20 Oceania gdpPercap:year 0.00000455 0.00000141 3.23 4.19e- 3
Note that I deleted all the “continent” terms from the model, because we’re now making a different model for each continent available. I’ve also used the .
pronoun to represent our data here.
In this case, estimate
represents the coefficient (beta) for each variable, while the statistic is the F statistic.
Conclusion
The important takeaways from this unit are not necessarily the statistical tests used - since those will vary dependent upon your purpose - so much as the methods highlighted. Understanding how to generate hypotheses from a new dataset - and then how to drill down and analyze them each in turn - is a cross-disciplinary skill used in any new analysis project.
Later in the course, you’ll be given projects which will require you to understand brand new datasets, manipulate them, perform complex analyses on them, and visualise them. EDA will give you the familiarity with your data to find these patterns, isolate them appropriately, and perform the right analyses moving forward. The entire purpose of this unit is to give you the skillset to identify what those analyses might be, by understanding how to generate hypotheses from a combination of data visualization and manipulation.
What we haven’t done is confirmatory analysis, where we start our work with a model in mind and then report how well it fit the data. Once data is used for exploratory analysis, it can’t be used again for confirmatory work. There are methods to split data for confirmatory analysis, while still being able to use some data for exploration - for an overview, see Hadley Wickham’s overview here. We’ll be working with these methods in unit 10.
4.5 Exercises
4.5.1 Answer the following:
- The output of
psych::describe(gapminder_unfiltered)
put stars aftercountry
andcontinent
. Why? - Make a histogram of a gapminder variable other than population. Describe the graph with regards to its skewdness and kurtosis.
- Compute an ANOVA for the impacts of continent on population. Report the results as you would in a manuscript.
- Fit a regression model to the impacts of the current year and life expectancy on GDP. Why does this model not make sense conceptually?
- Fit a linear model explaining life expectancy as a function of the current year for each country in the dataset (use the formula lifeExp ~ year). Then tidy the model outputs and look at the p values for each coefficient. If you don’t hate yourself, you’ll try the functional computing approach.