# B Solutions to formative assessment

library(tidyverse)
library(magrittr)

## B.1 Question 1

Load the dataset into R as a tibble and check your variables are what you expect them to be using the command str

Make sure you have set your working directory and saved the file here.

chd <- read.csv("CHD.csv")

str(chd)
## 'data.frame':    462 obs. of  10 variables:
##  $sbp : int 160 144 118 170 134 132 142 114 114 132 ... ##$ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
##  $ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ... ##$ adiposity: num  23.1 28.6 32.3 38 27.8 ...
##  $famhist : chr "Present" "Absent" "Present" "Present" ... ##$ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
##  $obesity : num 25.3 28.9 29.1 32 26 ... ##$ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
##  $age : int 52 63 46 58 49 45 38 58 29 53 ... ##$ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

## B.2 Question 2

Change the variable name ‘obesity’ to ‘BMI’ and find the mean BMI of the individuals in the CHD dataset.

There are a few ways to do this, a base R solution would be like this

colnames(chd)[7] <- "bmi"

Or you could use dplyr

chd <- chd %>%
rename(bmi = "obesity")

Then calculate the mean

## B.4 Question 4

Create a boxplot for the distribution of BMI variable.

boxplot(chd$bmi, horizontal = TRUE, xlab = "BMI (kg/m^2)") ## B.5 Question 5 Find the function to give the interquartile range (explained below) of the variable tobacco and interpret the results. IQR(chd$tobacco)
## [1] 5.4475

## B.6 Question 6

Create a new data set called y which contains only the first ten entries of the CHD data set.

Again there are different ways to do this, in base R:

y <- chd[1:10, ]

Or using the tidyverse:

y <- chd %>%
slice(1:10)

## B.7 Question 7

Calculate the standard deviation for the systolic blood pressure variable using the operator %$%. chd %$%
sd(sbp)
## [1] 20.49632

## B.8 Question 8

Use ggplot to plot tobacco vs sbp, coloured red if they have chd and green if they don’t. Add in a line of best fit for each of the groups, with the relevant colour.

ggplot(data = chd) +
geom_point(mapping = aes(y = tobacco, x = sbp,  colour = as.factor(chd))) +
geom_smooth(aes(y = tobacco, x = sbp, group = as.factor(chd), colour = as.factor(chd)),
method = "lm", se = FALSE) +
labs(title = "Systolic Blood Pressure vs Tobacco Use",
x = "Systolic Blood Pressure (mmHg)",
y = "Tobacco Use") +
scale_colour_manual(aesthetics = "colour", values = c("darkgreen", "red"),
label = c("No", "Yes"), name = "CHD") +
theme_minimal()

## B.9 Question 9

Use the filter function to subset the data to those with chd and a BMI between 18.5 and 25. Then use the arrange function to order them by tobacco intake.

sub <- chd %>%
filter(chd == 1) %>%
filter(bmi >= 18.5 & bmi < 25) %>%
arrange(tobacco)

dim(sub)
## [1] 54 10

## B.10 Question 10

Write a function which does the same as above where you can enter the bounds of the BMI that you are interested in, i.e. if you call the function data_subset_BMI then the command data_subset_BMI(18.5,25) produces the same as your answer to question 9.

data_subset_bmi <- function(x, y){
chd %>%
filter(chd == "1") %>%
filter(bmi >= x & bmi < y) %>%
arrange(tobacco)
}
sub_2 <- data_subset_bmi(18.5, 25)
dim(sub_2)
## [1] 54 10