Chapter 5 Data Analysis
5.1 Descriptive statistics
5.1.1 Univariate analysis
Index of qualitative variation (categorical variables)
## Frequencies
## allbus2012$female
## Type: Factor
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
## Male 1725 49.57 49.57 49.57 49.57
## Female 1755 50.43 100.00 50.43 100.00
## <NA> 0 0.00 100.00
## Total 3480 100.00 100.00 100.00 100.00
## Frequencies
## allbus2012$migrant
## Type: Factor
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## Native 3298 94.77 94.77 94.77 94.77
## Migrant 182 5.23 100.00 5.23 100.00
## <NA> 0 0.00 100.00
## Total 3480 100.00 100.00 100.00 100.00
Measures of central tendency (metric variables)
The most important measures of central tendency are the arithmetic mean, the median, and the mode.
allbus2012 select(class, imp_nei, imp_fr, imp_fam, health, finance, lifesat) %>%
descr(stats = c("min", "max", "med", "mean"), transpose = T)
## Descriptive Statistics
## allbus2012
## Label: GGSScompact 2012
## N: 3480
## Min Max Median Mean
## ------------- ------ ------- -------- ------
## class 1.00 5.00 3.00 2.77
## finance 1.00 5.00 4.00 3.53
## health 1.00 5.00 4.00 3.55
## imp_fam 1.00 7.00 7.00 6.50
## imp_fr 1.00 7.00 6.00 5.68
## imp_nei 1.00 7.00 5.00 4.60
## lifesat 1.00 11.00 9.00 8.64
## Frequencies
## allbus2012$lifesat
## Type: Numeric
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 8 0.23 0.23 0.23 0.23
## 2 6 0.17 0.40 0.17 0.40
## 3 26 0.75 1.15 0.75 1.15
## 4 46 1.32 2.47 1.32 2.47
## 5 76 2.19 4.66 2.18 4.66
## 6 283 8.14 12.80 8.13 12.79
## 7 263 7.56 20.36 7.56 20.34
## 8 606 17.43 37.79 17.41 37.76
## 9 1071 30.80 68.59 30.78 68.53
## 10 653 18.78 87.37 18.76 87.30
## 11 439 12.63 100.00 12.61 99.91
## <NA> 3 0.09 100.00
## Total 3480 100.00 100.00 100.00 100.00
Measures of dispersion (metric variables)
Variance, standard deviation (sd), coefficient of variation (sd/mean)
allbus2012 select(class, starts_with("imp"), health, finance, lifesat) %>%
descr(stats = c("min", "max", "med", "mean", "sd", "cv"), transpose = T)
## Descriptive Statistics
## allbus2012
## Label: GGSScompact 2012
## N: 3480
## Min Max Median Mean Std.Dev CV
## ------------- ------ ------- -------- ------ --------- ------
## class 1.00 5.00 3.00 2.77 0.66 0.24
## finance 1.00 5.00 4.00 3.53 0.80 0.23
## health 1.00 5.00 4.00 3.55 1.00 0.28
## imp_fam 1.00 7.00 7.00 6.50 1.15 0.18
## imp_fr 1.00 7.00 6.00 5.68 1.19 0.21
## imp_nei 1.00 7.00 5.00 4.60 1.59 0.35
## lifesat 1.00 11.00 9.00 8.64 1.72 0.20
5-point statistics (see, Tuckey 1975)
allbus2012 select(class, starts_with("imp"), health, finance, lifesat) %>%
descr(stats = c("fivenum"), transpose = T)
## Descriptive Statistics
## allbus2012
## Label: GGSScompact 2012
## N: 3480
## Min Q1 Median Q3 Max
## ------------- ------ ------ -------- ------- -------
## class 1.00 2.00 3.00 3.00 5.00
## finance 1.00 3.00 4.00 4.00 5.00
## health 1.00 3.00 4.00 4.00 5.00
## imp_fam 1.00 7.00 7.00 7.00 7.00
## imp_fr 1.00 5.00 6.00 7.00 7.00
## imp_nei 1.00 4.00 5.00 6.00 7.00
## lifesat 1.00 8.00 9.00 10.00 11.00
Skewness and kurtosis
allbus2012 select(class, starts_with("imp"), health, finance, lifesat) %>%
descr(stats = c("min", "max", "med", "mean", "skewness", "kurtosis"),
transpose = T)
## Descriptive Statistics
## allbus2012
## Label: GGSScompact 2012
## N: 3480
## Min Max Median Mean Skewness Kurtosis
## ------------- ------ ------- -------- ------ ---------- ----------
## class 1.00 5.00 3.00 2.77 -0.05 0.54
## finance 1.00 5.00 4.00 3.53 -0.83 0.74
## health 1.00 5.00 4.00 3.55 -0.47 -0.20
## imp_fam 1.00 7.00 7.00 6.50 -2.89 8.67
## imp_fr 1.00 7.00 6.00 5.68 -0.84 0.48
## imp_nei 1.00 7.00 5.00 4.60 -0.35 -0.52
## lifesat 1.00 11.00 9.00 8.64 -0.98 1.33
Histogram and box-plot
ggplot(allbus2012, aes(x = lifesat)) +
geom_histogram() +
labs(title = "Life satisfaction") +
scale_x_continuous(breaks = 1:11) +
theme_stata(scheme = "s1mono")
ggplot(allbus2012, aes(x = female, y = lifesat)) +
geom_boxplot() +
labs(x = "", y = "Life satisfaction") +
theme_stata(scheme = "s1mono")
5.1.2 Bivariate analysis
Categorical characteristics: Chi2 und Cramer’s V
ctable(allbus2012$female, allbus2012$migrant,
prop = "none")
## Cross-Tabulation
## female * migrant
## Data Frame: allbus2012
## Label: GGSScompact 2012
## -------- --------- -------- --------- -------
## migrant Native Migrant Total
## female
## Male 1643 82 1725
## Female 1655 100 1755
## Total 3298 182 3480
## -------- --------- -------- --------- -------
allbus2012 crosstable_statistics(female, migrant,
statistics = "cramer",
correct = FALSE)
## # Measure of Association for Contingency Tables
## Chi-squared: 1.5654
## Cramer's V: 0.0212
## p-value: 0.2109
ctable(allbus2012$female, allbus2012$astrology,
prop = "none")
## Cross-Tabulation
## female * astrology
## Data Frame: allbus2012
## Label: GGSScompact 2012
## -------- ----------- ------ ----- ------ -------
## astrology No Yes <NA> Total
## female
## Male 1431 290 4 1725
## Female 1245 506 4 1755
## Total 2676 796 8 3480
## -------- ----------- ------ ----- ------ -------
allbus2012 crosstable_statistics(female, astrology,
statistics = "cramer",
correct = FALSE)
## # Measure of Association for Contingency Tables
## Chi-squared: 71.2874
## Cramer's V: 0.1433
## p-value: <0.001
Metric and categorical
t.test(allbus2012$lifesat ~ allbus2012$female,
var.equal = TRUE)
## Two Sample t-test
## data: allbus2012$lifesat by allbus2012$female
## t = -1.9648, df = 3475, p-value = 0.04951
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2288332507 -0.0002433446
## sample estimates:
## mean in group Male mean in group Female
## 8.583865 8.698404
Metric and metric
allbus2012 select(lifesat, finance, health, class) %>%
## Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
## ---------------------------------------------------------------------------------------
## lifesat | finance | 0.44 | [0.42, 0.47] | 29.24 | 3471 | < .001 | Pearson | 3473
## lifesat | health | 0.31 | [0.28, 0.34] | 19.49 | 3473 | < .001 | Pearson | 3475
## lifesat | class | 0.24 | [0.21, 0.27] | 14.35 | 3428 | < .001 | Pearson | 3430
## finance | health | 0.23 | [0.20, 0.26] | 14.07 | 3472 | < .001 | Pearson | 3474
## finance | class | 0.34 | [0.31, 0.37] | 21.35 | 3427 | < .001 | Pearson | 3429
## health | class | 0.19 | [0.16, 0.23] | 11.52 | 3429 | < .001 | Pearson | 3431
5.2 Inferential statistics
5.2.1 Linear regression
OLS-regression and diagnostics
lm(lifesat ~ finance + health + imp_fam + imp_fr + imp_nei +
ols <- class + age + age2 + female + migrant,
data = allbus2012)
tab_model(ols, = TRUE, digits = 3)
lifesat | ||||
Predictors | Estimates | std. Error | CI | p |
(Intercept) | 2.772 | 0.298 | 2.188 – 3.356 | <0.001 |
finance | 0.747 | 0.035 | 0.679 – 0.815 | <0.001 |
health | 0.398 | 0.028 | 0.344 – 0.453 | <0.001 |
imp_fam | 0.122 | 0.023 | 0.078 – 0.167 | <0.001 |
imp_fr | 0.060 | 0.023 | 0.015 – 0.105 | 0.009 |
imp_nei | 0.080 | 0.018 | 0.045 – 0.114 | <0.001 |
class | 0.201 | 0.041 | 0.121 – 0.281 | <0.001 |
age | -0.021 | 0.008 | -0.036 – -0.005 | 0.008 |
age2 | 0.000 | 0.000 | 0.000 – 0.000 | 0.001 |
female [Female] | 0.113 | 0.051 | 0.013 – 0.214 | 0.026 |
migrant [Migrant] | 0.074 | 0.115 | -0.153 – 0.300 | 0.523 |
Observations | 3407 | |||
R2 / R2 adjusted | 0.270 / 0.268 |
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
## Data
## -----------------------------------
## Response : lifesat
## Variables: fitted values of lifesat
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 257.3509
## Prob > Chi2 = 6.485887e-58
resettest(ols, power = 2:4, type = "fitted", data = allbus2012)
## RESET test
## data: ols
## RESET = 0.5925, df1 = 3, df2 = 3393, p-value = 0.6199
Heteroskedasticity robust standard errors
Due to heteroskedasticity robust standard errors should be estimated
tab_model(ols, = "HC", vcov.type = "HC1", = TRUE, digits = 3)
lifesat | ||||
Predictors | Estimates | std. Error | CI | p |
(Intercept) | 2.772 | 0.324 | 2.137 – 3.407 | <0.001 |
finance | 0.747 | 0.041 | 0.666 – 0.828 | <0.001 |
health | 0.398 | 0.033 | 0.334 – 0.463 | <0.001 |
imp_fam | 0.122 | 0.027 | 0.069 – 0.176 | <0.001 |
imp_fr | 0.060 | 0.026 | 0.009 – 0.112 | 0.022 |
imp_nei | 0.080 | 0.020 | 0.041 – 0.118 | <0.001 |
class | 0.201 | 0.046 | 0.111 – 0.291 | <0.001 |
age | -0.021 | 0.008 | -0.036 – -0.005 | 0.008 |
age2 | 0.000 | 0.000 | 0.000 – 0.000 | 0.001 |
female [Female] | 0.113 | 0.051 | 0.013 – 0.214 | 0.026 |
migrant [Migrant] | 0.074 | 0.130 | -0.182 – 0.329 | 0.571 |
Observations | 3407 | |||
R2 / R2 adjusted | 0.270 / 0.268 |
Standardized b-coefficients
One can also request standardized b-coefficients (betas) to compare the strength of relation between coefficients.
tab_model(ols, = TRUE, show.std = TRUE,
digits = 3)
lifesat | |||||||
Predictors | Estimates | std. Error | std. Beta | standardized std. Error | CI | standardized CI | p |
(Intercept) | 2.772 | 0.298 | -0.036 | 0.021 | 2.188 – 3.356 | -0.077 – 0.006 | <0.001 |
finance | 0.747 | 0.035 | 0.348 | 0.016 | 0.679 – 0.815 | 0.316 – 0.380 | <0.001 |
health | 0.398 | 0.028 | 0.233 | 0.016 | 0.344 – 0.453 | 0.201 – 0.265 | <0.001 |
imp_fam | 0.122 | 0.023 | 0.081 | 0.015 | 0.078 – 0.167 | 0.052 – 0.111 | <0.001 |
imp_fr | 0.060 | 0.023 | 0.041 | 0.016 | 0.015 – 0.105 | 0.011 – 0.072 | 0.009 |
imp_nei | 0.080 | 0.018 | 0.074 | 0.016 | 0.045 – 0.114 | 0.042 – 0.106 | <0.001 |
class | 0.201 | 0.041 | 0.078 | 0.016 | 0.121 – 0.281 | 0.047 – 0.109 | <0.001 |
age | -0.021 | 0.008 | -0.215 | 0.081 | -0.036 – -0.005 | -0.374 – -0.056 | 0.008 |
age2 | 0.000 | 0.000 | 0.275 | 0.081 | 0.000 – 0.000 | 0.117 – 0.433 | 0.001 |
female [Female] | 0.113 | 0.051 | 0.066 | 0.030 | 0.013 – 0.214 | 0.008 – 0.125 | 0.026 |
migrant [Migrant] | 0.074 | 0.115 | 0.043 | 0.068 | -0.153 – 0.300 | -0.089 – 0.176 | 0.523 |
Observations | 3407 | ||||||
R2 / R2 adjusted | 0.270 / 0.268 |
Plotting coeffcients
The (unstandardized) coefficients can also be plotted
plot_model(ols, = "HC", vcov.type = "HC1") +
theme_sjplot2() +
scale_y_continuous(limits = c(-.2, .8), breaks = seq(-.2, .8, by = .2)) +
geom_hline(yintercept = 0, linetype = "dashed")
5.2.2 Logistic regression
glm(astrology ~ age + migrant + female + class + finance + health,
logit <-family = "binomial"(link = "logit"),
data = allbus2012)
tab_model(logit, transform = NULL, = "HC", vcov.type = "HC1", = T, digits = 3)
astrology | ||||
Predictors | Log-Odds | std. Error | CI | p |
(Intercept) | 0.665 | 0.301 | 0.076 – 1.255 | 0.027 |
age | -0.038 | 0.003 | -0.044 – -0.033 | <0.001 |
migrant [Migrant] | -0.424 | 0.204 | -0.824 – -0.025 | 0.037 |
female [Female] | 0.725 | 0.088 | 0.553 – 0.897 | <0.001 |
class | 0.172 | 0.067 | 0.040 – 0.303 | 0.011 |
finance | -0.110 | 0.056 | -0.220 – 0.000 | 0.050 |
health | -0.154 | 0.048 | -0.247 – -0.060 | 0.001 |
Observations | 3413 | |||
R2 Tjur | 0.092 |