7 Automating your work

Now that we have covered plotting, manipulating and tidying data in R, in this session we will combine those aspects of data analysis with some very basic programming to automate a series of repetitive tasks.

In this session we will work with a series of results files from the WEHI drug screening facility, available here. Please save the screening_plates folder into your Desktop folder named WEHI_tidyR_course.

Next create and save a new .R file, ‘Week_4_tidyverse.R’ in your Desktop WEHI_tidyR_course folder.

7.1 Experimental design

Cells from a cancer cell line are grown in 1536-well plates. Each test well contains a different potential anti-cancer compound. There are also positive and negative control wells in each plate. After 24 hours, a cell viability reagent is added and the luminescence in each well, corresponding to cell viability, is measured.

Our task is to identify the screening ‘hits’ in each plate by:
a) making a plot displaying cell viability per well, and
b) calculating simple statistics to identify significant deviations from control cell viability.

7.2 For loops

Before embarking on this task, let’s get familiar with a little function called the ‘for loop’. This is a staple of computer programming in any language, and allows us to perform the same task on any number of input values or datasets, known as ‘looping through inputs’.

The for loop in R has a special structure requiring both standard () and curly brackets {}. A vector of values for which the repetitive function is performed, is given within standard brackets.
The job to perform on each dataset is given in curly brackets.
A single value in the input vector is referred to as i. The value stored in this ‘i’ variable changes as the function ‘loops through’ the vector.

Furthermore, the result calculated within the curly brackets must be ‘printed’ out the console using the print() function.

Let’s make a very simple for loop to multiply a series of input values by 50.

First, we will get the code working for a single input: the number 3. This value, stored in ‘i’, will be multiplied by 50, and the result assigned to a variable. Then the result variable will be printed so that we can see the answer.

## [1] 150

Now let’s create a vector called ‘loop_values’ that contains several values to loop through.

Finally we write the for loop

## [1] 50
## [1] 150
## [1] 250
## [1] 400
## [1] 650

Take a look at the value of i in your Environment pane. It now contains 13, not 3. This is because the for loop has run to completion, and so i contains the last value in the loop_values vector.

Around 1 minute into this clip Sheldon uses a similar loop function (not strictly a for loop) to identify a mutually agreeable activity to pursue with a new friend, Kripke. He is ‘iterating’ over a set of inputs (Kripke's interests) and calculating {if he wants to participate}.

7.3 paste()

OK now before we get sucked into a youtube vortex let’s cover a second function that’s very handy for automating tasks. paste() is used to combine values into a single value, in a similar fashion to unite() from Week 3.
paste() is not part of the tidyverse, but is often used to create variable names and values. This function requires two values, separated by a comma. By default the output value will contain a space separating the input values, however users can choose a different separator using the sep = command.

To state Kripke’s interests, we can paste two character values together to form a sentence :

## [1] "Kripke prefers horseriding"

paste() is more useful in the context of a for loop, where it can add context to the results by adding a fixed value (usually text) to a variable.

Here we create a vector of Kripke’s interests, and loop through them using paste(), to produce multiple sentences.

## [1] "Kripke prefers horseriding"
## [1] "Kripke prefers swimming"
## [1] "Kripke prefers ventriloquism"

7.4 Catching loop results

Although not required for this session, readers will eventually want to store the results of a for loop together in a single variable. Beginners can skip down to Part 2, but for completeness, here we will look at two ways to ‘catch’ all results from a for loop.

7.4.1 Catch in vector

In order to catch results as a vector, we first need to set up an empty variable. As the loop progresses, this variable should accumulate the values resulting from each step.

To create an empty vector, we use the function vector()

To make it clearer, the result generated by each step of the loop (a single-value vector), is now assigned to a variable named ‘step_result_v’.
The trick is to concatenate each step result to the loop output vector, using c(). Crucially, at each step, loop_output_v is being overwritten to contain both itself, and one additional value: the step_result_v.

## [1] "Kripke prefers horseriding"
## [1] "Kripke prefers swimming"
## [1] "Kripke prefers ventriloquism"

We can now check the contents of the output

## [1] "Kripke prefers horseriding"   "Kripke prefers swimming"      "Kripke prefers ventriloquism"

7.4.2 Catch in data frame

Alternatively, its possible to catch the loop output as a data frame. The main differences are that the step_result must be assigned into a dataframe rather than a vector, and we use bind_rows() instead of c(), to add rows to the output.

First load the tidyverse:

Then create an empty data frame:

Now write a loop which includes a step to create a 1x1 dataframe, using the data_frame() function. This dataframe stores step_result_v as a row in the column.

The 1 x 1 dataframe is then appended or ‘bound’ to the loop_output_df using bind_rows().

Readers trying to achieve more complex loop function results with dataframes can check out the map functions in R for Data Science.