Session 2 Getting started

2.1 What is R?

R is a computer programming language with many built-in statistical functions. The language can be easily extended with user-written functions. R also has very good graphing facilities.

R is a great place to start a data science journey because it is an environment designed from the ground up to support data science.

R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many others.

This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.

See this Chapter on the history of R for more.

2.2 R versus Excel

Excel is convenient for data entry, and for quickly manipulating rows and columns prior to statistical analysis. However, Excel is a poor choice for statistical analysis beyond textbook examples, the simplest descriptive statistics, or for more than a very few columns for the following reasons:

  • Missing values are handled inconsistently, and sometimes incorrectly.
  • Data organisation differs according to analysis, forcing you to reorganise your data in many ways if you want to do many different analyses.
  • Many analyses can only be done on one column at a time, making it inconvenient to do the same analysis on many columns.
  • Output is poorly organised, sometimes inadequately labelled, and there is no record of how an analysis was accomplished.
  • Some analyses are either impossible, or very convoluted to do using Excel.

2.3 Installing R and RStudio

To demonstrate and use R, we use RStudio (an IDE of R) for the R statistical programming language. It is a tool that can help you work better and faster and includes docked windows for your console and syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

You can download and install a copy of the latest version of R for free on your own computer. You can download and install the current version of R at the R website. Select the option appropriate to the operating system you are using.

You can either run the program, or save the program to your computer and then run it to install R. When installing, you can accept the default settings. Under ‘Documentation’ you can download the document entitled ‘An Introduction to R’ by W. N. Venables, D. M. Smith and the R Development Core Team (2006) which gives a clear introduction to the language and information on how to use R for doing statistical analysis and graphics. This manual is also available through the software at Help > Manuals (in PDF) > An Introduction to R. You can download it as a PDF file and keep it on your personal computer for reference.

Similarly, to install RStudio, go to the RStudio website.

** NOTE: You will need to have R installed before installing RStudio **

  • If you’re struggling to install, you could also try looking at this Chapter which includes links to step-by-step tutorials for the installation.

When you open RStudio, your interface is made up of four panes: script/code pane, console pane, workspace pane and the files/plots/packages/help pane, as shown below. These can be re-organised via menu options View > Panes > …

Screenshot of RStudio, with four panes highlighted. Top left: the script or code area. Top right: the workspace area. Bottom left: the console area. Bottom right: area where files, plots, packages and help can be displayed.

Each of the areas has a different function:

Script/code area

  • Where we keep records of our work
  • Write scripts/code
  • Scripts can be saved

Console area

  • Contains command line
  • Execute quick commands
  • Displays executed code
  • Displays results of executed code

Workspace area

  • Shows what is loaded in memory, e.g. data.
  • Stores any object, value, function or data you create during your R session (we will cover what those are later)
  • The history tab keeps a record of all previously submitted commands

Files, Plot, help and package area

  • The files tab lists the files in the set working directory. You can also navigate to other directories
  • The plots tab displays any graphs/figures created during the R session
  • The package tab shows a list of all add-ons currently available in RStudio (but more can be installed)
  • The help tab provides information about R and commands. It can be very helpful but it is not always the case (we will discuss this later)

The layout of the panes can be changed by selecting an option in View > Panes > …

2.4 R Scripts

We can think of R as a sophisticated calculator with its own language and we need to learn communicate with our new friend. Our friend requires us to type everything accurately, closing brackets and spelling things as they are defined. A lot of errors you will experience will be from typing incorrectly so always double check what you have typed before looking for other problems.

You can type anything into the console at the prompt, and R will evaluate it and print the answer. However, we often want to do something more complicated than simple arithmetic and we want to be able to save and re-run code. For this reason, best practice is to open a new script where we will write, edit and save code. To open a new script, we go to File > New File > R Script. This will open an additional pane, as shown below.

Screenshot of RStudio. An arrow points to the top left pane, where a new script file has appeared.

You should set RStudio to open an empty environment whenever you start it. Go to Tools > Global Options > General. ‘Restore .RData into workspace at startup’ should be unticked. ‘Save workspace to .RData on exit’ should be set to ‘Never’.

Please annotate and comment your code using the # symbol as demonstrated below.

2.5 R as a Calculator

Let’s try some simple calculations. Type the following commands (in blue) into the R script you have created. You can then select the line and run it using the ‘Run’ key or by pressing Ctrl+ENTER (cmd+ENTER on a mac), and check that your results match up with those below (in black). Note that you will not get the symbols ## displayed since this is a document prompt to tell us that these are the lines outputted.

R works with vectors and the [1] at the beginning of each line of results indicates that the answer is the first component of a vector. In this case the vector is of length one but even when vectors are longer, the first line will remain preceded by [1] and any extra lines will be preceded by the value indicating which entry the first component in that line corresponds to, e.g. if we saw [25] preceding the second line of output, it would indicate that the first component in the second line of output is the \(25^{th}\) component of the vector. The output shown in this document has ## in front of what you will see if you enter it into the console yourself. This is just the formatting of this document so you can distinguish between input and output.

Note: A vector is a sequence of data elements of the same basic type, e.g. a row of numbers or a row of names.
We can consider a vector as simply a row of data. Members in a vector are officially called components.

## [1] 26
## [1] 64

R operator precedence rules follow conventional mathematical rules, e.g. BIDMAS,

## [1] 10

i.e. 2 + ((4 * 20) / 10), \[2+\Big(\frac{(4 \times 20)}{10}\Big).\]

We can compare values using the less than, greather than and double equal signs (>, <, >=, == and these will return a logical value, which is either ‘TRUE’ or ‘FALSE’ We can use the exclamation mark to mean NOT. Therefore != means not equal to.
Note: these are called Boolean statements and produce either TRUE or FALSE.

## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE

The standard mathematical constants and functions are built-in, such as \(\pi\)=3.14159…, \(\sqrt(2)\), exp(), sin(), cos(), tan() e.t.c.,

  • R is case sensitive.

  • Commands are separated by a newline.

  • The # character can be used to make comments. R doesn’t execute the rest of the line after the # symbol - it ignores it.

  • Previous session commands can be accessed via the up and down arrow keys on the keyboard. This can save time typing as these can be edited and reused.

## [1] 3.141593
## [1] 1.772454
## [1] 314.1593
## [1] 1

Some symbols have a special meaning in R, e.g. the colon between two numbers returns a vector containing the first value and all values between the values increasing / decreasing by one.

## [1] 1 2 3 4 5 6 7 8
## [1] 5.5 4.5 3.5 2.5

You can deduce that : means a sequence of numbers between the numbers provided before and after, changing by one. If the second is larger than the first, the sequence will be increasing. If the second is smaller than the first, then it will be decreasing.

2.6 Installing and loading R packages to add more functions

Part of the reason R has become so popular is the vast array of packages of user-written functions available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!

Installing these R packages couldn’t be easier (especially in RStudio). Note we only install a package ONCE.

One package that we will make extensive use of is the ‘tidyverse’ package. This is actually a wrapper for a number of other packages, see the Tidyverse website

We type the following into the R console

Note that this will take some time: make yourself a drink! Remember, you only need to do this process once.

Alternatively in RStudio, you can simply click on the Packages tab in the bottom right corner and then Install (Packages > Install). Type ggplot2 into the box ‘Packages (separate multiple with space or comma)’ and ensure the ‘Install dependencies’ is checked (it is by default).

By completing either of these methods, ‘tidyverse’ is installed in your library when you want to use it you can either type

or check the box in the list under Packages.

The range of R packages that are contributed to R is huge. Some packages allow more indepth statistical analysis, whereas some allow data to be imported. Some allow advanced graphics while some import data directly from the internet. Popular R packages are those listed by Garrett Grolemund at https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages.

Do take care in making sure you are using a version of R which supports the package; if your version is too old, the package may not install and you would need to update your version of R. If this is needed, it will appear as a warning when trying to install the package. However, you could still install an older version of the R package by specifiy link or via develop tools. The drawback is you need to carefully check whether those packages are still functioning as you would expect. Refer this for installing older version: Install older R packages

2.7 I need some help!

The help and support section of R is an invaluable resource that has contributed to the popularity of R. Help is easily accessed by clicking on the Help tab of the bottom right window in RStudio under ‘Help’.

If you’re struggling to find help because you are unsure of the function to search for, typing help.search("paste") will search for help files for functions that have something to do with “paste” (like glue or cut & paste). Finally, the quality and quantity of help for R online is particularly great and a google search beginning with an R e.g. “R paste” usually returns the most relevant solution to your problem.

Note that the search tool in the RStudio interface will not work unless the package is loaded into your environment. For this reason, it can often be a better option to search via a search engine such as google. When looking at the help page, the function name is at the top left, followed by the package nume in curly braces, as seen in Figure 3.

Fig 3. Screenshot of paste help page

Figure 2.1: Fig 3. Screenshot of paste help page