2 Read data into R: set working directory, save R scripts, save R datafiles
Data Wrangling Recipes in R: Hilary Watt
Value of the few available menu options in RStudio
2.1 Avoid writing directory name, using a method that enables it to be copied
See next three sections show you how to avoid the need to type in your directory name. The process is to read the data into R using menus (once you have made sure the file is in the desired directory for this project). Then you copy the code from the “history” section (code is written there, even though you used menus). Then you can copy the directory element of that code and point R towards this specific directory (using setwd() command). R then looks in this specified directory when reading any files into R, and save R files to this directory, without you need to write down the directory name again.
2.2 Prepare data prior to reading in, tidying up with excel as required
Prepare in excel or similar. The first line of the data file should be sensible variable names, with all subsequent rows being data. It is good practice to choose variable names that are sufficiently long for you to remember what information they contain - using RStudio, you can select names from drop-down menus to avoid typos. Debugging and amending code is easier with meaningful variable names, which ultimately saves time and effort. It is standard practice to save data as .csv format from within excel for this purpose, although it is also easy to read data in from excel (.xls) format.
2.3 Choose directory for project
Firstly, choose/ create a directory for your R files related to this book or for your analysis project. Save relevant files, included anaemia.csv dataset (or project datasets) into this directory.
2.6 Code to read in dataset
Using read.csv()
for loading a CSV file
In practice, replacing
(this is the command that appeared when using menus above).
Windows computers: to copy file name with file path, hold down shift key and right click on the file name. Select “copy as path” from the list that appears.
MORE ADVANCED options for reading in data:
Written command version of “from text (readr)”, requires us to specify: library(readr)
BEFORE using relevant readr()
code.
Importing data into R (including from excel):Introduction to Importing Data in R
Intermediate importing data (from databases, from web, from stats packages): Intermediate Importing Data in R
2.7 Point R to chosen working directory
getwd()
shows current working directory. If you save or open files without specifying a directory, R looks here.
setwd(dir)
can set the working directory (replace “dir” with your chosen working directory – my strategy is to open files using menus, then copy and paste directory element from open file command).
This enables use of shorter read and write commands, since they no-longer need to point to the relevant directory:
2.8 Saving R code into R script files
Open an R script file to save your R code, to avoid losing all your work. Be certain to add in many comments (starting #, so that R does not attempt to interpret them as code). Otherwise, your own code may soon become impenetrable to you.
Select file (top left of RStudio), select “new file”, then “R script”, to open a new R script file.
Immediately name and save the R script file: by clicking on the save icon (at top of R script file; R script file is top left of RStudio). Save regularly as you code.
2.9 Write comments for your (programmer) benefit, within R code
R script files are collections of R code, interpreted by R, for action. R code can include code that amends datasets, produces tables and graphs and analyses data. Even very experienced R users are often clueless as to what bare code is doing – including code that they wrote themselves. Hence, any sensible coder adds comments into R code to help navigate and understand the code. Don’t assume you have the memory the size of an elephant. Instead, add in comments such as “Checking for errors”, “looking at shapes of distribution”, “this appears to be a Normal distribution”. You might want to document reasons for any choices made (in data management and methods of analysis) in the R script.
# Comments follow-on from the “#” symbol: which might be at the beginning of a line or after a command.
# Checking for outliers
hist(anaemia$weight) #plots histogram
# table(anaemia$weight) code line staarting with # is commented out so is not run by R
RStudio helps you by displaying comments in a different colour.
2.10 Saving R datasets
It is essential to keep a tidy file of your data cleaning and modification code. Then rerun with any required amendments. If you save an R dataset, then with a tidy R script file that creates it, you can feel confident to overwrite these. This avoids potential confusion of having many different versions of your dataset.
Always keep a copy of your datasets as provided to you. Be certain to always write data with a different name to this original file.
Saving as R format means that the format of all variables will be retained. You can potentially keep only cleaned versions of variables that you will need for your analysis. With a tidy R script that creates this, it is easy to rerun keeping more variables (after cleaning them) if you later find you need more.
Save a data-frame (or similar) as an RDS file in R: file = specifies the name of the file where the R object is saved or read from. Example: save edited and cleaned version of the anaemia dataset.
If a file called anaemia_cleaned.rds
already exists, this command will replace the old dataset with the new dataset, without warning.
Datasets can also be exported from R into other forms such as comma separated value (CSV), tab-delimited, SAS, or STATA.
?write.csv
provides help on exporting a data frame to a CSV file.
2.11 Best not to save your workspace, so reject this suggestion when closing R
Whilst R offers to save your work space by default, this is best avoided. The work space is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, etc.). It is better to re-run your R scripts and generate the necessary output, or only save relevant plots, than to save everything each time. With large data sets, this would quickly take up substantial amounts of storage space.
The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.
Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.