14 Appendix 1: Basic Principles of R
Data Wrangling Recipes in R: Hilary Watt
14.1 Uses and merits of R
Uses of R
- View data (similarly to in excel)
- Data Management – clean data, prepare for analysis, modify variables, create new variables
- Restructure data – merge datasets, restructure clustered datasets between long & wide formats & one & multiple lines per cluster
- Produce tables
- Producing quick and/ or high-quality graphs
- Statistical analysis – extensive packages available
Merits of R
- Can save commands to rerun later with/without modifications
- R has help files that aid with syntax
- R Studio interface gives some guidance to help with debugging
- There are R teaching resources on the internet
- Regularly updated with new Statistical methods; written by users.
- Both R and R studio are free, so you can always access/ use them.
- Many different ways to do anything, which gives the potential for finding efficient ways to do almost anything.
- Compared to Stata: almost endless flexibility to combine functions however you want, potentially easier to automate and extract results for complex mathematical modelling
- Compared to SPSS: more flexible and extensive options available. Writing code means that data-steps and analyses are documented and can be rerun following amendments and/ or repeated on other variables.
14.2 Core principles of R
14.2.1 R layout showing windows that appear when R studio is opened:
R studio can be opened by clicking the following icon, once R studio is pinned to the task bar:
RStudio Windows | Location | Description |
---|---|---|
Source Window | upper-left | R script files for writing, submitting, and saving R code. |
View database | upper-left | Datasets shown in spreadsheet format (View command). |
Console Window | lower-left | Output from R code is printed here (possible, but not recommended to write and submit R code from here) |
Environment | upper-right | R datasets with variables and other available objects |
History | upper-right | R code/ commands that have been run |
Files | lower-right | Files stored on computer are shown here |
Plots | lower-right | R graphs produced are displayed here |
Packages | lower-right | Installed packages are listed here |
Help | lower-right | R help files shown here when opened |
Viewer | lower-right | Rarely used, showing web content |
14.2.2 Retrieving lost RStudio windows
If you lose an RStudio window, click on View and choose the name of the relevant window to reopen it:
To get the default arrangement of windows use the following menu option: View – Panes – Show all panes:
14.2.3 R Script file
Open an R script file to save R code, to avoid losing all your work.
Select file (top left of RStudio), select “new file”, then “R script”, to open a new R script file, as shown:
Immediately name and save the R script file: by clicking on the save icon (at top of R script file; R script file is top left of RStudio).
Remember to create a directory/ series of directories, to save your work. Make sure you know where you have saved files! It may be worthwhile exploring R projects, as a way of keeping your files together (not taught in this handbook).
14.2.4 Write comments for your (programmer) benefit, within R code
R script files are collections of R code, interpreted by R, for action. R code can include code that amends datasets, produces tables and graphs and analyses data. Even very experienced R users are often clueless as to what bare code is doing – including code that they wrote themselves. Hence, any sensible coder adds comments into R code to help navigate and understand the code. Don’t assume you have the memory the size of an elephant. Instead, add in comments such as “Checking for errors”, “looking at shapes of distribution”, “this appears to be a Normal distribution”. You might want to document reasons for any choices made (in data management and methods of analysis) in the R script.
# Comments follow-on from the “#” symbol: which might be at the beginning of a line or after a command.
# Checking for outliers
hist(anaemia$weight) #plots histogram
# table(anaemia$weight) code line staarting with # is commented out so is not run by R
RStudio helps you by displaying comments in a different colour.
14.2.5 R commands and functions
I teach opening and viewing datasets within RStudio using menus. However, it is essential to write code to achieve most tasks in R. It is essential to include code that opens datasets into R scripts (provided also when opening using menus).
There is little or no distinction between most commands and functions in R. This leads to considerable flexibility over how commands and functions can be combined. Whilst this is a great strength of R, one drawback is that when you make a mistake, R might interpret the code in a completely unexpected way, making it harder to spot any errors.
When you initially install R, commands/ functions are installed that form “base R”. However, it is common practice to use other functions. Such functions are stored within packages. Such packages can be drawn into R as required.
Functions and commands are case-sensitive.
14.2.6 Valid case-sensitive variable and object names
Variable names in R must start with a letter, then can include letters, numbers and dots “.”, underscores “_”.
BMI.male.kgm2, bmi_female, BMI_t1, BMI_t2, BMI_t3 are all valid names for objects/ variables/ variables within dataframes.
BMI, Bmi and bmi are three separate variable (or object) names.
Longer descriptive names may help you understand your own code – potentially avoiding many hours of struggle/ debugging. This compares to using brief variable names where their content is not clear.
RStudio has drop-down menus that may appear when you start to write variable names, saving you from having to write them out for yourself; using them avoids typos, which again can save time debugging.
Names (such as BMI) could refer to a variable that is not in any dataframe, or to a dataframe, or a single value (such as BMI=23.43).
14.2.7 Dollar format for variable names within dataframes
Some dataframes are available within R for your use. To access them, use the data()
function. Write the following into your R script file (RStudio top right), with one data()
command on each line.
data (sleep) # to make dataframe named “sleep” available – look in environment window.
data (ChickWeight) # to make dataframe “ChickWeight” available – see environment window.
Put the cursor within the relevant line, then CRTL-RETURN – submits the command.
The command (& any output/ error messages) appears in the CONSOLE window (bottom left).
The environment window (RStudio top right) also changes:
If
Several dataframes can be open within R at the same time. The following image shows 4 dataframes, each with a blue arrow (each is available using data() function). For CO2 and Orange dataframes, variable names are not currently revealed (blue arrow points right). After clicking on the blue arrow, variables within the dataframe are shown (blue arrow points down, for dataframes named sleep and trees):
There are 2 components of variables names within dataframes, separated by dollar sign$
: data_frame_name $ variable_name
trees$Girth # refer to variable Girth within dataframe named trees
trees$Height # refer to variable Height within dataframe named trees
trees$Volume # refer to variable Volume within dataframe named trees
sleep$extra # refer to variable extra within dataframe named sleep
sleep$group # refer to variable group within dataframe named sleep
sleep$ID # refer to variable ID within dataframe named sleep
The contents of the variable are printed in R, when the name is written on its own line. Have a go at writing some yourself, within the R script file (then CTRL-RETURN with cursor on relevant line to submit).
Use of dollar format for variable names may make it easier to debug code. For instance, drop-down menus of variable names appear – you can select the correct item, to avoid risk of typing errors.
14.2.8 Matrix format df [row, column] for variables/ items within dataframes
As well as the $
format (e.g.
trees$Girth
), you need to know the following format, which includes a variable name format. Applied to dataframe named df :
df[row, column] # refers to the item in row & column specified within df
df[row, ] # refers to the specified row within df (column left blank => all columns)
df[ ,column] # refers to the specified column within df (row left blank => all rows)
Examples: Note that this shows variable within dataframe sleep and lists the first few values (=rows) within each variable (which are columns).
sleep[1,2] # refers to dataframe named sleep, 1st row, 2nd column = 1st value from “group” var = 2
sleep[1,] # refers to dataframe named sleep, 1st row, all cols = 1st row, extra=0.7, group=1, ID=1
sleep[,2] # refers to dataframe named sleep, 2nd column = variable “group” (all 20 rows/ obs)
sleep[,”group”] # refers to dataframe named sleep, column/variable named “group” (all 20 rows/ obs)
sleep[,’group’] # refers to dataframe named sleep, column/variable named “group” (all 20 rows/ obs)
Note: single quotes and double quotes are often inter-changeable within R.
WARNING: when copying from outside R, you may need to delete quotes and write back in for them to work – unlike above, quotes should appear vertical!!
Simpler version for a variable/ column/ vector: variablename [k] = kth item within the variable
sleep$extra[3]
is -0.2 (3rd row for this variable) (1st element is 0.7, 2nd is -1.6, 3rd is -0.2: see below)
14.2.9 Variables within dataframes and “stand-alone” variables
Many researchers choose to keep all variables within dataframes. An advantage is that all variables are then automatically the same length. The alternative of using “stand-along variables” can result in error messages can result from differences in lengths of variables. For stand-alone variables, there is sometimes the need to add explicit code (not shown here) so that the length of newly-created variables is the same as other variables.
The assignment operator <- is required to create new variables.
(AVOID: could theoretically use single equal sign “=” to assign a new variable – knowledge provided for debugging purposes only: please avoid to make debugging easier)
The new variable name is written to the left of the assign symbol <-
For stand-alone variables (risking errors from differing variable lengths), the new variable name has only one component:
For variables within dataframes, the new variable name has two components, an existing dataframe, then dollar $, then new name. Once the variable is created, using this dollar format means that RStudio provides drop-down menus of variable names – using them avoid typos and helps with debugging:
For variables within dataframes, there is an alternative format, an existing dataframe “anaemia”, with [ row, column] format where the new variable is a named column (more cumbersome – RStudio does not provide drop-down menus to help with names in this format:
To understand the fundamentals of how R works, with assign symbol, the following video has good feedback: Teaching R software “assign” fundamentals: assigning values to objects/ variables. R using R studio. - YouTube
14.2.10 Missing data codes
Missing data (of any type) is coded as NA.
When a calculation cannot be performed/ does not result in a number, the answer might be:
Inf = plus infinity or -Inf = negative infinity
or NaN = “not a number”, such as 0/0.
14.2.11 Installing packages – drawing in extra functions
The tidyverse package is very valuable for data-management, for tabulating and for graphing data (ggplot). If you have not already done so, please install it now, as follows:
install.packages(“tidyverse”) # downloads tidyverse onto computer – do this only once
library(tidyverse) # makes the relevant packages available – required each time R/ RStudio is opened
When you load the tidyverse
, the following packages are loaded: ggplot2
, dplyr
, tidyr
, readr
, purr
, tibble
, stringr
, and forecats
. For more information on the packages tidyverse
loads see here.
Many other packages are available, and you will likely use many more as you keep on using R. Asking friends and colleagues, searching the internet, and browsing others R code are all good ways to discover useful packages. There is a choice of packages for many different tasks. R packages stored on Comprehensive R Archive Network; CRAN have passed certain checks and standards and are actively maintained.
This manual may include commands such as library(readr)
, which needs to be run each time R is open. This implies the need to run the following code once to download the relevant packages, if not already done: install.packages(“readr”)
. Note, as mentioned above, if the tidyverse is loaded then readr
will also be loaded.
14.3 Further resources
Here are some useful references that teach data manipulation using Tidyverse:
Grammar of data manipulation in tidyverse, easier for some things than base R
This handbook occasionally uses pipes, recognised by the %>%
symbol. Other aspects of tidyverse’s dplyr are also used.
Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.