2 Read data into R: set working directory, save R scripts, save R datafiles

Data Wrangling Recipes in R: Hilary Watt

Value of the few available menu options in RStudio

2.1 Avoid writing directory name, using a method that enables it to be copied

See next three sections show you how to avoid the need to type in your directory name. The process is to read the data into R using menus (once you have made sure the file is in the desired directory for this project). Then you copy the code from the “history” section (code is written there, even though you used menus). Then you can copy the directory element of that code and point R towards this specific directory (using setwd() command). R then looks in this specified directory when reading any files into R, and save R files to this directory, without you need to write down the directory name again.

2.2 Prepare data prior to reading in, tidying up with excel as required

Prepare in excel or similar. The first line of the data file should be sensible variable names, with all subsequent rows being data. It is good practice to choose variable names that are sufficiently long for you to remember what information they contain - using RStudio, you can select names from drop-down menus to avoid typos. Debugging and amending code is easier with meaningful variable names, which ultimately saves time and effort. It is standard practice to save data as .csv format from within excel for this purpose, although it is also easy to read data in from excel (.xls) format.

2.3 Choose directory for project

Firstly, choose/ create a directory for your R files related to this book or for your analysis project. Save relevant files, included anaemia.csv dataset (or project datasets) into this directory.

2.4 Read in & view R data using RStudio menus

Open the file using menus within RStudio – this avoids the need to write the directory name.

In Environment tab (top right of R layout), click “Import Dataset” – select (From Text (base)…) for .csv file:

(find the relevant directory and double click on the filename)

Set Heading to “Yes” => first line of data in .csv file becomes variables names. Click on Import.

There is a check box “string as factors” which will automatically convert all string/ character variables into factors, which is appropriate when you want them treated as categorical data. It is possible to amend this later for individual variables. You might choose to select this option as appropriate.

When you’ve successfully imported it using menus, click on “History” tab, (RStudio, top right). Find the “open dataset” command (read.csv) and the View command (last 2 lines in history, assuming data-frame was just opened).

Select the 2 code lines with mouse and click “to source” to move into open R script file. This records which dataset is used for your R script code and enables quick reopening of the relevant dataset.

Note: There are 2 options to read in .CSV files. The readr() option may be better for more complex datasets (for dates and you can specify some data-formats as you read data in).

2.5 SPSS, SAS, Stata file can also be readily imported into R using menus

In the Environment window (top right), click on import dataset, and select from SPSS, from SAS or from Stata as required. This is on the same menu as importing .csv files as described above. You can similarly then find the code in the history window (top right).

2.6 Code to read in dataset

Using read.csv() for loading a CSV file

anaemia  <-  read.csv("<file path>/anaemia.csv",  header  =  TRUE)

In practice, replacing with the name of your own file path:

anaemia  <-  read.csv("C:/Users/hilarywatt/R_handbook/anaemia.csv",  header  =  TRUE)

(this is the command that appeared when using menus above).

Windows computers: to copy file name with file path, hold down shift key and right click on the file name. Select “copy as path” from the list that appears.

MORE ADVANCED options for reading in data:

Written command version of “from text (readr)”, requires us to specify: library(readr) BEFORE using relevant readr() code.

Importing data into R (including from excel):Introduction to Importing Data in R

Intermediate importing data (from databases, from web, from stats packages): Intermediate Importing Data in R

Chapter 11 of R for Data Science, Wickham & Grolmund

2.7 Point R to chosen working directory

getwd() shows current working directory. If you save or open files without specifying a directory, R looks here.

setwd(dir) can set the working directory (replace “dir” with your chosen working directory – my strategy is to open files using menus, then copy and paste directory element from open file command).

setwd("C:/Users/hilarywatt/R_handbook")

This enables use of shorter read and write commands, since they no-longer need to point to the relevant directory:

anaemia  <-  read.csv("anaemiaB.csv",  header  =  TRUE)

2.8 Saving R code into R script files

Open an R script file to save your R code, to avoid losing all your work. Be certain to add in many comments (starting #, so that R does not attempt to interpret them as code). Otherwise, your own code may soon become impenetrable to you.

Select file (top left of RStudio), select “new file”, then “R script”, to open a new R script file.

Immediately name and save the R script file: by clicking on the save icon (at top of R script file; R script file is top left of RStudio). Save regularly as you code.

2.9 Write comments for your (programmer) benefit, within R code

R script files are collections of R code, interpreted by R, for action. R code can include code that amends datasets, produces tables and graphs and analyses data. Even very experienced R users are often clueless as to what bare code is doing – including code that they wrote themselves. Hence, any sensible coder adds comments into R code to help navigate and understand the code. Don’t assume you have the memory the size of an elephant. Instead, add in comments such as “Checking for errors”, “looking at shapes of distribution”, “this appears to be a Normal distribution”. You might want to document reasons for any choices made (in data management and methods of analysis) in the R script.

# Comments follow-on from the “#” symbol: which might be at the beginning of a line or after a command.

# Checking for outliers

hist(anaemia$weight) #plots histogram

# table(anaemia$weight) code line staarting with # is commented out so is not run by R

RStudio helps you by displaying comments in a different colour.

2.10 Saving R datasets

It is essential to keep a tidy file of your data cleaning and modification code. Then rerun with any required amendments. If you save an R dataset, then with a tidy R script file that creates it, you can feel confident to overwrite these. This avoids potential confusion of having many different versions of your dataset.

Always keep a copy of your datasets as provided to you. Be certain to always write data with a different name to this original file.

Saving as R format means that the format of all variables will be retained. You can potentially keep only cleaned versions of variables that you will need for your analysis. With a tidy R script that creates this, it is easy to rerun keeping more variables (after cleaning them) if you later find you need more.

# save dataframe or similar object to R file
# saveRDS(object, file = "my_data.rds")

Save a data-frame (or similar) as an RDS file in R: file = specifies the name of the file where the R object is saved or read from. Example: save edited and cleaned version of the anaemia dataset.

# save anaemia dataset
saveRDS(anaemia, file = "data/anaemia_cleaned.rds")

If a file called anaemia_cleaned.rds already exists, this command will replace the old dataset with the new dataset, without warning.

Datasets can also be exported from R into other forms such as comma separated value (CSV), tab-delimited, SAS, or STATA.

?write.csv provides help on exporting a data frame to a CSV file.

2.11 Best not to save your workspace, so reject this suggestion when closing R

Whilst R offers to save your work space by default, this is best avoided. The work space is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, etc.). It is better to re-run your R scripts and generate the necessary output, or only save relevant plots, than to save everything each time. With large data sets, this would quickly take up substantial amounts of storage space.

The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.