4.4 Further exercises
We read in the bab (original) dataset, replicating the command given in the first chunk and create a new variable for birthweight divided into those less than 2500g, those between 2500g and 3000g, and those over 3000g. We do this with the tidyverse function case_when(). This function uses the formula syntax we have seen before: we explicitly set out the condition on the left hand side of the tilde and the value we want on the right (e.g. the first line of case_when says that when birthweight is less than or equal to 2500, set the value of lbw3 to 1). To use multiple conditions, we use the & operator to stand for ‘AND’. We can also use | to stand for ‘OR’ and ! to stand for ‘NOT’. Note that after applying case_when() we then convert this variable to a factor.
We can also use a nested ifelse approach - if the bweight variable is less than 2500, give it the value 1, else continue to the next statement - we repeat this process. In general, case_when() is more clear than using ifelse() as it explicitly sets out each new level and generates an NA when a variable does not meet any conditions, while ifelse() can lead to every value not specified being lumped into the final else condition. In this case, we have only three possible categories and thus the ifelse approach is relatively straightforward, but nesting many statements can get rapidly confusing.
Exercise 10.7: What percentage of birthweights were 3000g or more?
#--- Read in the bab file
bab <- read.dta("./BAB.dta", convert.factors = T)
#--- Create the low birthweight categorical variable
bab <- bab %>% mutate(lbw3 = case_when(bweight <= 2500 ~ 1,
bweight < 3000 & bweight > 2500 ~ 2,
bweight >= 3000 ~ 3)) %>%
mutate(lbw3 = as.factor(lbw3))
summary(bab$lbw3)
## 1 2 3
## 82 136 423
#--- Same as above using a nested ifelse approach
# bab <- bab %>% mutate(lbw3 = ifelse(bweight <= 2500, 1, ifelse(bweight < 3000, 2, 3))) %>%
# mutate(lbw3 = as.factor(lbw3))
Now we conduct some further analyses.
Exercise 10.8: Analyse whether there is any association between sex of infant and birthweight (using lbw3). How many degrees of freedom is the chi-square test based on?
#--- Get a 2x2 table with a chi-square test
bab %$% cc(sex, lbw3, graph = F)
##
## lbw3
## sex 1 2 3
## male 36 63 227
## female 46 73 196
##
## Odds ratio 1 0.91 0.68
## lower 95% CI 0.5 0.41
## upper 95% CI 1.63 1.12
##
## Chi-squared = 4.04 , 2 d.f., P value = 0.133
## Fisher's exact test (2-sided) P value = 0.131
Now we categorise maternal age into four groups (<30, 30-34, 35-39, 40+) to investigate the relationship with hypertension. We could do this by using three nested ifelse statements, or by using case_when() but we demonstrate a straightforward way to categorise the continuous variable using the cut() function.
Exercise 10.9: Is there any evidence of a linear association between hypertension and maternal age?
#--- Create the new variable and extract a summary
bab <- bab %>% mutate(matagegp = cut(matage,
breaks = c(0, 29, 34, 39,+Inf),
labels = c("<30", "30-34", "35-39", "40+")
))
summary(bab$matagegp)
## <30 30-34 35-39 40+
## 92 251 258 40
#--- Investigate the relationship between maternal age group and hypertension
bab %$% cc(ht, matagegp, graph = F)
##
## matagegp
## ht <30 30-34 35-39 40+
## no 76 211 232 33
## yes 16 40 26 7
##
## Odds ratio 1 0.9 0.53 1.01
## lower 95% CI 0.46 0.26 0.32
## upper 95% CI 1.83 1.12 2.89
##
## Chi-squared = 5.39 , 3 d.f., P value = 0.145
## Fisher's exact test (2-sided) P value = 0.121
We create a new variable that codes which quarter of gestation age each baby falls into. This uses exactly the same code as presented above and the same ntile() function.
Exercise 10.10: Find the values of gestational age which define the quartiles.
Exercise 10.11: What proportion of infants in the lowest quarter of gestational age were males?
#--- Get the quintiles
bab$gest4 <- as.factor(ntile(bab$gestwks, 4))
summary(bab$gest4)
## 1 2 3 4
## 161 160 160 160
#--- Extract the top of each quintile (and thus the cut points)
bab %>% group_by(gest4) %>% dplyr::summarise(max(gestwks))
## # A tibble: 4 x 2
## gest4 `max(gestwks)`
## <fct> <dbl>
## 1 1 38.0
## 2 2 39.2
## 3 3 40.2
## 4 4 42.3
#--- Get the 2x2 table of gender and gestational quartile
bab %$% tabpct(sex, gest4, graph = F)
##
## Original table
## gest4
## sex 1 2 3 4 Total
## male 80 85 86 75 326
## female 81 75 74 85 315
## Total 161 160 160 160 641
##
## Row percent
## gest4
## sex 1 2 3 4 Total
## male 80 85 86 75 326
## (24.5) (26.1) (26.4) (23) (100)
## female 81 75 74 85 315
## (25.7) (23.8) (23.5) (27) (100)
##
## Column percent
## gest4
## sex 1 % 2 % 3 % 4 %
## male 80 (49.7) 85 (53.1) 86 (53.8) 75 (46.9)
## female 81 (50.3) 75 (46.9) 74 (46.2) 85 (53.1)
## Total 161 (100) 160 (100) 160 (100) 160 (100)