4.4 Further exercises

We read in the bab (original) dataset, replicating the command given in the first chunk and create a new variable for birthweight divided into those less than 2500g, those between 2500g and 3000g, and those over 3000g. We do this with the tidyverse function case_when(). This function uses the formula syntax we have seen before: we explicitly set out the condition on the left hand side of the tilde and the value we want on the right (e.g. the first line of case_when says that when birthweight is less than or equal to 2500, set the value of lbw3 to 1). To use multiple conditions, we use the & operator to stand for ‘AND’. We can also use | to stand for ‘OR’ and ! to stand for ‘NOT’. Note that after applying case_when() we then convert this variable to a factor.

We can also use a nested ifelse approach - if the bweight variable is less than 2500, give it the value 1, else continue to the next statement - we repeat this process. In general, case_when() is more clear than using ifelse() as it explicitly sets out each new level and generates an NA when a variable does not meet any conditions, while ifelse() can lead to every value not specified being lumped into the final else condition. In this case, we have only three possible categories and thus the ifelse approach is relatively straightforward, but nesting many statements can get rapidly confusing.

Exercise 10.7: What percentage of birthweights were 3000g or more?

#--- Read in the bab file
bab <- read.dta("./BAB.dta", convert.factors = T) 

#--- Create the low birthweight categorical variable
bab <- bab %>% mutate(lbw3 = case_when(bweight <= 2500 ~ 1,
                                       bweight < 3000 & bweight > 2500 ~ 2,
                                       bweight >= 3000 ~ 3)) %>% 
               mutate(lbw3 = as.factor(lbw3))

summary(bab$lbw3)
##   1   2   3 
##  82 136 423
#--- Same as above using a nested ifelse approach                        
# bab <- bab %>% mutate(lbw3 = ifelse(bweight <= 2500, 1, ifelse(bweight < 3000, 2, 3))) %>% 
# mutate(lbw3 = as.factor(lbw3))

Now we conduct some further analyses.

Exercise 10.8: Analyse whether there is any association between sex of infant and birthweight (using lbw3). How many degrees of freedom is the chi-square test based on?

#--- Get a 2x2 table with a chi-square test
bab %$% cc(sex, lbw3, graph = F)
## 
##                 lbw3
## sex              1  2    3   
##   male           36 63   227 
##   female         46 73   196 
##                              
##   Odds ratio     1  0.91 0.68
##   lower 95% CI      0.5  0.41
##   upper 95% CI      1.63 1.12
## 
## Chi-squared = 4.04 , 2 d.f., P value = 0.133 
## Fisher's exact test (2-sided) P value = 0.131

Now we categorise maternal age into four groups (<30, 30-34, 35-39, 40+) to investigate the relationship with hypertension. We could do this by using three nested ifelse statements, or by using case_when() but we demonstrate a straightforward way to categorise the continuous variable using the cut() function.

Exercise 10.9: Is there any evidence of a linear association between hypertension and maternal age?

#--- Create the new variable and extract a summary
bab <- bab %>% mutate(matagegp = cut(matage,
    breaks = c(0, 29, 34, 39,+Inf),
    labels = c("<30", "30-34", "35-39", "40+")
  ))

summary(bab$matagegp)
##   <30 30-34 35-39   40+ 
##    92   251   258    40
#--- Investigate the relationship between maternal age group and hypertension
bab %$% cc(ht, matagegp, graph = F)
## 
##                 matagegp
## ht               <30 30-34 35-39 40+ 
##   no             76  211   232   33  
##   yes            16  40    26    7   
##                                      
##   Odds ratio     1   0.9   0.53  1.01
##   lower 95% CI       0.46  0.26  0.32
##   upper 95% CI       1.83  1.12  2.89
## 
## Chi-squared = 5.39 , 3 d.f., P value = 0.145 
## Fisher's exact test (2-sided) P value = 0.121

We create a new variable that codes which quarter of gestation age each baby falls into. This uses exactly the same code as presented above and the same ntile() function.

Exercise 10.10: Find the values of gestational age which define the quartiles.

Exercise 10.11: What proportion of infants in the lowest quarter of gestational age were males?

#--- Get the quintiles
bab$gest4 <- as.factor(ntile(bab$gestwks, 4))

summary(bab$gest4)
##   1   2   3   4 
## 161 160 160 160
#--- Extract the top of each quintile (and thus the cut points)
bab %>% group_by(gest4) %>% dplyr::summarise(max(gestwks))
## # A tibble: 4 x 2
##   gest4 `max(gestwks)`
##   <fct>          <dbl>
## 1 1               38.0
## 2 2               39.2
## 3 3               40.2
## 4 4               42.3
#--- Get the 2x2 table of gender and gestational quartile
bab %$% tabpct(sex, gest4, graph = F)
## 
## Original table 
##         gest4
## sex         1    2    3    4  Total
##   male     80   85   86   75    326
##   female   81   75   74   85    315
##   Total   161  160  160  160    641
## 
## Row percent 
##         gest4
## sex            1       2       3     4  Total
##   male        80      85      86    75    326
##           (24.5)  (26.1)  (26.4)  (23)  (100)
##   female      81      75      74    85    315
##           (25.7)  (23.8)  (23.5)  (27)  (100)
## 
## Column percent 
##         gest4
## sex         1       %    2       %    3       %    4       %
##   male     80  (49.7)   85  (53.1)   86  (53.8)   75  (46.9)
##   female   81  (50.3)   75  (46.9)   74  (46.2)   85  (53.1)
##   Total   161   (100)  160   (100)  160   (100)  160   (100)