Section 2 Tips for effective R programming

Before getting into actual R code, we’ll start with a few notes about how to use it most effectively. Bad coding habits can make your R code difficult to read and understand, so hopefully these tips will ensure you have good habits right from the start.

2.1 Use RStudio’s “projects” feature

Every project you do in R should be set up in its own folder, through RStudio’s “projects” feature. To start a fresh project, go to File -> New project and create a new folder. Reopen the project in RStudio whenever you want to work with it.

When sending your work to other people, you can send them the whole folder and know that they’ll have access to all the required files and scripts.

One of the main benefits is that this sets R’s “working directory” to the project folder by default. Any files you load or save can just be referenced relative to that folder, so instead of:

data = haven::read_spss("R:/Project/2013/Analyses/Regressions/Data/Raw.sav")

You can just do:

data = haven::read_spss("Data/Raw.sav")

This will also work for anyone you send the project to, so you don’t have to worry that the file is in a different location on their machine.

2.2 Using scripts

It’s best to put every step of your data cleaning and analysis in a script that you save, rather than making temporary changes in the console.

Ideally, this will mean that you (or anyone else) can run the script from top to bottom, and get the same results every time, i.e. they’re reproducible.

2.2.1 Script layout

Most R scripts I write have the same basic layout:

Loading the libraries I’m using
Loading the data
Changing or analysing the data
Saving the results or the recoded data file

For a larger project, it’s good to create multiple different scripts for each stage, e.g. one file to recode the data, one to run the analyses.

When saving the recoded data, it’s best to save it as a different file - you keep the raw data, and you can recreate the recoded data exactly by rerunning your script.

R won’t overwrite your data files when you change your data, unless you specifically ask it to. When you load a file into R, it lives in R’s ‘short-term memory’, and doesn’t maintain any connection to the file on disk. It’s only when you explicitly save to a file that those changes become permanent.

2.2.2 A tip for better reproducibility

By default, RStudio will save your current “workspace” when you quit. While convenient, this can mean that you make one-off changes to your data and forget to save that command in your script. Starting with a fresh session every time you open RStudio means you’ll learn to keep every step of your analysis in your script, and you’ll know that you can get back to where you were by rerunning the script.

To disable the default setting, go to Tools -> Global options.., and in the General tab, and:

Uncheck “Restore .RData”
Set “Save workspace on exit” to “Never”

RStudio’s “save workspace” settings

2.3 Long or wide data

R works better with long data, whereas SPSS generally works better with wide data. Roughly speaking, long data means:

There is one “observation” of each participant/subject per row - e.g. one survey, one session
All the measurements of the same type are in one column - so all the K6 scores, from multiple surveys, would be in a single K6 column, with an additional Time or Survey column that identifies which survey they come from.

Example long data:

ID	Survey	Drinking	AnxietyTotal
1	1	No	6
1	2	Yes	8
2	1	No	5
2	2	No	7

Example wide data:

ID	Drinking_T1	Drinking_T2	AnxietyTotal_T1	AnxietyTotal_T2
1	No	Yes	6	8
2	No	No	5	7

2.3.1 Going from wide to long

If your data is currently in wide format, you may have to reshape it before working with it in R. The new pivot_longer and pivot_wider functions from the tidyr package are good for reshaping data like this. To go from the wide data above to a long dataset:

wide %>%
    tidyr::pivot_longer(
        cols = Drinking_T1:AnxietyTotal_T2,
        names_to = c(".value", "Time"),
        names_sep = "_"
    )

This can get more complicated if the columns haven’t been named consistently, or have multiple pieces of info stored in the name (e.g. t1_male_mean)

At the time of writing, the pivot functions were still in beta, but should be in the new tidyr version shortly.

2.4 Writing readable code

There are two very good reasons to try to write your code in a clear, understandable way:

Other people might need to use your code.
You might need to use your code, a few weeks/months/years after you’ve written it.

It’s possible to write R code that “works” perfectly, and produces all the results and output you want, but proves very difficult to make changes to when you have to come back to it (because a reviewer asked for one additional analysis, etc.)

2.4.1 Basic formatting tips

You can improve the readability of your code a lot by following a few simple rules:

Put spaces between and around variable names and operators (=+-*/)
Break up long lines of code
Use meaningful variable names composed of 2 or 3 words (avoid abbreviations unless they’re very common and you use them very consistently)

These rules can mean the difference between this:

lm1=lm(y~grp+grpTime,mydf,subset=sext1=="m")

and this:

male_difference = lm(DepressionScore ~ Group + GroupTimeInteraction,
                     data = interview_data,
                     subset = BaselineSex == "Male")

R will treat both pieces of code exactly the same, but for any humans reading, the nicer layout and meaningful names make it much easier to understand what’s happening, and spot any errors in syntax or intent.

2.4.2 Keeping a consistent style

Try to follow a consistent style for naming things, e.g. using snake_case for all your variable names in your R code, and TitleCase for the columns in your data. Either style is probably better than lowercase with no spacing allmashedtogether.

It doesn’t particularly matter what that style is, as long as you’re consistent. There is a suggested style guide for the tidyverse, but I don’t follow it 100%, I just try to be consistent within my code.

2.4.3 Writing comments

One of the best things you can do to make R code readable and understandable is write comments - R ignores lines that start with # so you can write whatever you want and it won’t affect the way your code runs.

Comments that explain why something was done are great:

# Need to reverse code the score for question 3
data$DepressionQ3 = 4 - data$DepressionQ3

Comments that explain what is being done are less useful. People who already understand R code should be able to tell what is happening just by looking at your code (especially if you’re following the other advice about writing readable code), so these comments can be redundant:

# Calculate the mean of the anxiety scores
anxiety_mean = mean(data$AnxietyTotal)

The exception to this is when you’ve run into something that was tricky to get working, and you need to explain it so other people don’t run into the same issue:

# (Example only, do not run)
# This fails to converge if we don't set the fix_missing option
drinking_model = logit_regression(Drink ~ Group, fix_missing=TRUE)

2.5 Don’t panic: dealing with SPSS withdrawal

2.5.1 RStudio has a data viewer

As you get used to R, you should find that you get more comfortable using the console to check on your data. You can often see a lot of the information you need by printing the first few rows of a dataset to the console. The head() function prints the first 6 rows of a table by default, and you can select the columns that are most relevant to what you’re working on if there are too many:

head(iris[, c("Species", "Petal.Length")])

##   Species Petal.Length
## 1  setosa          1.4
## 2  setosa          1.4
## 3  setosa          1.3
## 4  setosa          1.5
## 5  setosa          1.4
## 6  setosa          1.7

However, you can also use RStudio’s built-in data viewer to get a more familiar, spreadsheet style view of your data. In the Environment pane in the top-right, you can click on the name of any data you have loaded to bring up a spreadsheet view:

Data viewer example

This also supports basic sorting and filtering so you can explore the data more easily (you’ll still need to write code using functions like arrange() or filter() if you want to actually make changes to the data though).

2.5.2 R can read SPSS files

The haven package can read (and write) SPSS data files, so you can read in existing data:

survey_one = haven::read_spss("Data/Survey1_Final.sav")

R doesn’t deal with “value labels” in the same way as SPSS, and haven tries to keep information about the SPSS value labels available. However, it’s best to just convert everything to R’s way of dealing with categorical variables, i.e. factors, using haven’s as_factor() function:

survey_one = haven::as_factor(survey_one, levels = "both")

The levels = "both" option puts both the numeric value and the text label into the factor labels, like "[0] No", "[1] Yes". As you get more comfortable with R you may want to just use levels = "labels" so you just get the text labels like "No", "Yes".

You may need to convert your data from wide to long, since SPSS prefers wide and R prefers long.

The haven package can also read SAS and Stata data, and there are packages like readxl for Excel files. It’s generally easy to read your data into R from any format designed for tables of data.

2.6 Here be dragons: the bad parts of R

There are a few tools in R that tend to create more problems than they solve. Unfortunately beginners often end up using them (sometimes because bad tutorials recommend them). My personal list of tools to avoid includes:

attach(): This copies all the individual variables from a dataset into R’s environment, so you can access them with just var_name instead of dataset$var_name. The problem is:
- You end up with a lot of variables in your environment that are hard to keep track of.
- The variables can get out of sync with each other in ways that wouldn’t be possible if they were kept in the dataset.
get() and assign(): These allow you to look up and create variable names using strings, so instead of looking up model3, you can programmatically create the variable name like get(paste0("model", model_num)).
- Again, you can end up with a lot of variables in your environment.
- People often use these when they want to run 100 different versions of a model (there are sometimes good reasons to do this). Instead of creating 100 different variables called model1, model2, …, model100, it’s usually possible to save these in a single list or dataframe. Saving them all in a list means it’s much easier to process and work with the results.