Session 1 Welcome

As a health data scientist, it is vitally important that you have a firm understanding of a statistical programming language, and that you can work in a clear, reproducible fashion. This course will provide you with the baseline skills to use R for health data science.

1.1 Course objectives

  • Get you ‘up and running’ using R and RStudio on your machine.

  • Introduce the basics of programming in R (a key skill for a health data scientist).

  • Introduce good practices of workflows and reproducibility in data science.

  • Enable you to develop your skills independently in programming and data science workflow.

The emphasis is on the fundamental principles of writing scripts in R and how they are applied in practice.

1.2 Course description

A key skill to be able to work with health data is to have knowledge of and be able to use statistical software in order to manipulate, analyse and visualise data. R is statistical software which is open source and free. It is appealing not only for its financial benefits but also for the huge community of users and contributors. There are also other well-documented positives to using R, for example the majority of academic statisticians use R, it is platform independent (can be used on Windows and Linux/Mac) so researchers on different platforms can work together, not to mention the abundance of help and resources available: from books to forums. R also supports and interacts with tools for documenting your code, making it efficient and reproducible.

There is a huge range of free resources for learning R and through this course we will point you to some of these. Therefore the course will be a mixture of materials provided on these pages, and referrals to other resources. This should provide some variety in your learning as well as showing the wealth of resources available. Indeed, this course only scratches the surface, but we hope it will leave you with enthusiasm to learn more!

While some parts of the course can be read, we encourage you to have your own RStudio open, to replicate the steps being described. There are exercises interspersed throughout the course.

1.3 How the Course Will Run

This is a self-study module that should take between 20 and 40 hours to complete.

Online support tutorials and forums will run for this course if you are undertaking the course as part of one of your units. Please see the Blackboard page for the latest updates.

Solutions to the Exercises throughout the course are in Appendix @ref{sols-ex}

To contact the team, you can also email us at

1.4 Accessibility

This course is designed to be fully accessible. Using the buttons at the top of the html page, you can customise font size and background colours. You can download an offline version of the notes. Subtitles are available for all videos by clicking on the subtitles/closed caption button at the bottom of the video.

Any issues with accessibility please contact the course organisers at

1.5 Resources

There are many good, free, online resources to learn R and associated workflow. Note that none of these are essential reading, we include them here for information. We will make specific reference to parts of these resources throughout the course.

1.5.1 Online courses

  • Adventures in R – A fantastic free resource - highly recommended if you want to take a deeper dive than this course does.

  • R Bootcamp – Another great resource with interactive exercises.

  • Dataquest – Includes interactive R tutorials. Some free material, but you will have to pay for full access.

1.5.2 Books

1.5.3 Other Resources

  • Rstudio provides a page with further resources.

  • CRAN Task Views is a good start point when you want to search for R packages for specific statistical challenges.

  • If you have burning questions, searching on stackoverflow (https://stackoverflow.com/) is also a good option. Type in any search engine with “description of your question R: stackoverflow.” For example, “how to calculate mean in R: stackoverflow”

  • Tidy Tuesdays. This is a weekly project aimed at tidyverse users. Each week a new data file and associated article or chart are uploaded. Your task is to try and write R code to reproduce the figure or analysis in the article. The resources are found on github: https://github.com/rfordatascience/tidytuesday

  • If you come across other resources that you find helpful, please share with us all through forums, or by contacting