Session 12 Formative assessment

Throughout this exercise we will be working with data from a case-control study on coronary heart disease (CHD) dataset on coronory heart disease CHD.csv. To download this file, right click on the hyperlink and select ‘save link as’. Note, you may already have downloaded this file in an earlier Chapter.

The Coronary Risk-Factor Study was carried out in three rural areas in South Africa, in the White Cape region, where incidence of heart disease is particularly high.

We analyse a subset of the data in which the outcome of interest is CHD status (yes/no) and 160 cases (i.e. patients who developed CHD) and 302 controls (i.e. patient who did not develop CHD) are collected.

Each subject has nine measurements. These are:

'sbp' (systolic blood pressure),
'tobacco' (cumulative tobacco), 
'ldl' (low-density lipoprotein cholesterol),
'adiposity',
'famhist' (family history of heart disease),
'typea' (type-A behaviour),
'obesity' (BMI),
'alcohol' (current alcohol consumption)
'age' (age at onset or age of testing for controls). 
  1. Load the dataset into R as a tibble and check your variables are what you expect them to be using the command
str()
  1. Change the variable name ‘obesity’ to ‘BMI’ and find the mean BMI of the individuals in the CHD dataset

  2. Plot a histogram of the distribution of the BMI of the individuals in the CHD dataset (Use 25 breaks in your histogram). Make sure you have named the axes of your histogram and added a title.

  3. Create a boxplot for the distribution of BMI variable.

  4. Find the function to give the interquartile range (explained below) of the variable tobacco and interpret the results.

    A simple measure of spread is the interquartile range. This is the distance between the 25th and 75th percentiles of the data.

    If we line up data in order (smallest to largest), the 25th percentile is the value that occurs 25% of the way along the line; the 75th percentile is the value that occurs 75% of the way along the line.

  5. Create a new data set called \(y\) which contains only the first ten entries of the CHD data set.

  6. Calculate the standard deviation for the systolic blood pressure variable using the operator %$%.

  7. Use ggplot to plot tobacco vs sbp, coloured red if they have chd and green if they don’t. Add in a line of best fit for each of the groups, with the relevant colour.

  8. Use the filter function to subset the data to those with chd and a BMI between 18.5 and 25. Then use the arrange function to order them by tobacco intake.

  9. Write a function which does the same as above where you can enter the bounds of the BMI that you are interested in, i.e. if you call the function \(\verb|data_subset_BMI|\), then the command \(\verb|data_subset_BMI(18.5,25)|\) produces the same as your answer to question 9.


Solutions to the formative assessment are available in Appendix B. To maximise your learning, please attempt the questions as best you can before looking at the solutions.

As always, you can get help from the forums associated with your unit (where applicable), and always by emailing

We hope you enjoyed this Introductory R course. Feedback is always welcome -