22 Good data practices

We have reached the substantive end of this course. You will go on with your journey as an R programmer, a data scientist and a researcher, no doubt doing things in some of the same ways as we have discussed here, and some different ways of your own. In parting, I want to communicate some good practices for writing code and archiving your work. At least, these seem like good practices to me, having worked on many projects with a wide variety of collaborators.

When you work on the data for a project, you are so engrossed in what you are doing that it all makes sense to you. You fail to appreciate that the code you are writing and the files and directories you are creating will be baffling to a collaborator or a colleague coming to them cold. More importantly, they will baffle and frustrate you two months later when you come back to them in order to revise the paper. You need to learn to structure, comment and organize your work in a way that maximizes transparency and clarity, facilitates scrutiny, and minimizes possible errors. And you need to do this now, as you go along. Often people think that they will sort out the code and the data archive later, once everything is done. But later never really comes. A good chef tidies their work area as they go along.

When we cut mere stones, we must always be envisioning cathedrals. You should always assume your code matters, is public, and will be used by many people, now and in the future. It will of course be public, as you will need to archive it in an online repository. The part about many people using it I cannot guarantee, I am afraid, but referees and other people in the field do seem to use my code and data from time to time, even code and data from quite obscure projects. In any event, the code and data archive are public parts of your work, as integral and important as the written paper. So, make them pretty.

22.1 Tips for writing good scripts

These days, my students and collaborators often send me their scripts in the course of working together on a project. This is part of the pleasure of working together. They say ‘can you have a look at the model selection part at line 387 and see if it looks right to you?’.

Almost invariably, I cannot make this part work. Even once I can, I can rarely understand the results. Why?

My correspondent is suffering a curse of knowledge. A curse of knowledge is someone knows so much about a situation that they do not realise their knowledge is not shared by others. In the course of working on the model selection part at line 387, they know (but I don’t) that they have to first make the data frame df7 which is created at line 93. Oh yes, and df7 is itself made from df6, which itself is made 35 lines earlier by selecting a subset of the main data frame df. Oh, and to get the Sex variable which is in the model, you have to have merged in the participant demographic variables from the file demographics.csv at line 115. Oh, yes, and calculated that derived variable at line 205. Oh, and they forgot to mention, they are using the contributed packages bananas but it does not say this anywhere in the script. And the variable qf54 is the dependent measure, obviously, whilst the experimental condition is variable c_USE_THIS_ONE, with values 1 for intervention and 4 for control. You see the problem.

This all makes for inefficient collaboration, messy thinking, difficult reproducibility, and rapid forgetting. By following the scripting tips in this section, you can try to make things better for your colleagues and your future self.

22.1.1 Use transparent and consistent naming conventions

Give all of the variables in your data files maximally clear (full word) names. Thus, Condition, not co, Neuroticism, not pers_3, and so on. It means typing a few more characters, but the transparency saving is large.

Also, use a consistent convention about how the names are formed. If one variable name starts with a capital letter, then all should do so. If some names have multiple words, then have a consistent rule about whether the separator is . or _. (I do not recommend variable names including spaces.) Some people prefer the convention of full stops (.) in variable names and underscores (_) in function names. It does not matter too much what system you adopt, as long as it is consistent.

Relatedly, always code your categorical variables using transparent words, so that Condition has values Control and Intervention, not 1 or 2, or anything else. Again the point is not to minimize the number of bytes in the data frame, but to maximize immediate transparency.

22.1.2 Section and comment your script

In an RStudio script file, any line that begins and ends with #### is read as a section heading. This means it becomes available to navigate to in the little bar at the bottom of the script window. It is also visibly different in the script window itself, especially if you put a line of blank space before and after it. Use this feature to give your script a clear structure of headings.

In between your headings, put in comments to tell yourself and the reader what you are doing in the following lines. Comment a lot. Too much is better than too little. Here is a hypothetical example:

#### Preparing the data frame ####

# First figure out which experimental group is which
# By seeing which one has a worse mood after than before
d %>% mutate(Difference_Mood = Final_Mood - Initial_Mood) %>% group_by(Mood_induction_condition) %>% summarise(M = mean(Difference_Mood), SD = sd(Difference_Mood))
# Looks like 1 was negative and 2 was neutral

# Now let's recode the condition variable:
d <- d %>% mutate(Condition = case_when(Mood_induction_condition == 1 ~ "Negative", Mood_induction_condition == 2 ~ "Neutral"))

#### Now the next task ####
#...

22.1.3 Make your scripts sectionally modular, with a head and a body

A problem you face is that the operations in your script must be written in a linear order, but you sometimes the reader wants to jump straight to a later part, say the part where you make the figures. You don’t want the user to have to run all of the earlier sections to make the code making figure 1, at line 320, work correctly. And, you don’t want the figure 1 code at line 320 to stop working just because you changed something at line 115 in order to fit the statistical model.

In other words, you want your script to be modular. That means it should consist of a number of parts that can be operated and changed independently: the section that calculates the descriptive statistics; the one that fits the statistical model; the one that makes the figures; and so on.

If you took modularity to the extreme, then you would need to do basic operations common to the whole script, like reading in the data, loading contributed packages, recoding, and calculating derived variables, separately de novo in every section. This would make sure that each module worked autonomously, but it would be highly repetitive.

The compromise position I use is to have one common section, the head, and thereafter to make my sections completely modular. The head is the first section, and contains a small number of general operations that are needed for all or most of the modules in the script. To make any other section of the script work, you need only run the head plus the relevant section; no intermediate sections should be required.

The head should be identified with a heading and commented into subsections.

The head should contain the following elements:

Loading contributed packages. Load here all packages required by any module in the script.
Reading in data files. Read in all the data files that are going to be used at any point in the script.
Merging or reshaping data frames. Do any merging or reshaping that is going to be needed at any point.
Renaming and recoding variables. Give your variables their definitive names and values. Set the levels of any factors if required.
Calculation of derived variables. Calculate derived variables like scores from scales, or indices. This applies especially to those that are going to be used in several places.
Application of exclusions or subsetting. If there is some exclusion condition that is going to apply to the whole paper without exception (like, in a longitudinal cohort study, your paper only concerns the data from one time point), then I would apply this in the head. On the other hand if you are going to have different exclusions for different parts of the analysis, do the exclusions as they crop up within each module rather than in the head.

The user should know that once they have run all the code of the head, they are ready to jump to any section and find that it works correctly.

22.1.4 Consider a separate data wrangling script

Sometimes the head of your script can get very long. For example you may have to read in and merge multiple data files, rename and recode many variables, add up scores from scales, and reshape the data frame from wide to long format. It is important that these operations are done reproducibly, since they are part of the chain of evidence that goes from your experiment to your paper. But they don’t make for gripping reading, and most users would rather fast forward through them.

In cases like this, I sometimes separate my work into a data wrangling script and a data analysis script. In the data wrangling script, I do all the tedious work described above, the work that takes my raw data files and makes them ready to start analysing and plotting. At the end of the data wrangling script, I save a processed version of the data (usually as an .Rdata file, as this retains factor levels for ordered factors, and a number of other useful things).

The head of the data analysis script is then very simple: it consists in loading contributed packages, loading the processed data file, and off we go. The user can choose their own adventure: either work through the data wrangling themselves, or just start at the beginning of the data analysis, where things are getting interesting.

For completeness, I usually put the raw data files and wrangling script, as well as the processed data file and analysis script, in the data repository. You might want to accompany these with a ‘readme’ file explaining what everything is.

In terms of reproducibility, the data analysis script plus the processed data file should allow reproduction of the paper; and the data wrangling script plus the data wrangling file should allow reproduction of the processed data file.

22.1.5 Make the structure of the body correspond to the paper

So, you have a head and then modular sections. But, what sections should you have, and what order should they come in? Here, a useful tip is to mirror the structure of your results section in the script. For example, your results section may look something like this:

Demographics of sample
Descriptive statistics of measures
Research question 1: Model and figure
Research question 2: Model and and figure
Additional exploratory analyses

In this case, the obvious choices for the sections in your script are:

Head
Demographics of sample
Descriptive statistics of measures
Statistical model for research question 1
Figure for research question 1
Statistical model for research question 2
Figure for research question 2
Additional exploratory analyses

In an ideal world, there should be complete coherence between your your script, your results section, and your preregistration. That is, your research questions or predictions as stated in your preregistration are picked up, using the same nomenclature and order, in your script, which in turn are picked up in the Results section of the paper. The reader should be able to have these three documents side by side and see how they relate. This does not exclude of course that you make justified departures from your preregistration; only that the relationship between these three documents is clearly identifiable.

22.1.6 Consider using R Markdown

You can go further with the idea of making your script and your results section correspond to one another. You can make your script and your results section be the very same document. You do this by writing your document in R Markdown.

R Markdown is a way of writing formatted text documents that include sections of R code (or code in other programming languages) alongside their text. You write your R Markdown document in RStudio and save it as a file with the extension .Rmd. You can render your R Markdown into a number of formats for reading, such as Microsoft Word, PDF and HTML. When you output your file to one of the formats, the code will all be run and the output included alongside the text. This course is written in R Markdown, which I then output into HTML for the web or PDF for the print version.

You include R code within R Markdown in two ways. The first is code chunks. These are blocks of code which appears within the text in grey boxes. Underneath the code is the output that would be produced in the console and plot window from running that code.

The second way of including R within an R Markdown text document is inline code. Let’s say that within a sentence you want to cite some numbers, like the mean and standard deviation of a variable. Instead of running the relevant R code, then retyping the number you see into your text document, you simply call R to do the relevant calculation within the text. As long as the necessary variable is in your R environment within the script, R does the calculation and places it into the document in the indicated place. This is obviously much less error prone than retyping. As someone who has spend months of their life typing numbers from R output into Word documents, this is a game changer in terms of error reduction, and potentially time saving too. In particular, if the data change (you decide to apply a different exclusion criterion for example), all the numbers in your tables and results update automatically and are right. As long as there are no errors in your code, of course !

I won’t go into detail here about the details of how to write R Markdown. It is very intuitive and there are good materials on the web. It’s actually a very good Word processor as well.

I have experimented with everything from writing my entire paper in R Markdown, to using R Markdown to make a statistical document that I put into the online repository, but is not the actual paper. A good compromise solution is to write the Results section (or Methods and Results) in R Markdown, then output this, once you are happy with it, as a Word document that you drop into the Introduction and Discussion, which you have written in Word. Most journals will want a Word version at some point, and often your collaborators will too, so as things currently are, it is unlikely you will avoid Word altogether. A Results section written in R Markdown is fully reproducible by definition; by running your R Markdown file on your data file, a user will end up with the Results section you got.

22.1.7 Maximize local modularity and minimize intermediate objects

I used to write scripts that made many intermediate objects. For example, if my main data frame was d but I wanted to fit a statistical model to only the data from participants with a reaction time of less than 700 msec, I would first make a data frame with the required data:

d.include <- subset(d, GRT<700)

Then later I would fit the model:

m1 <- lm(SSRT ~ Condition, data=d.include)

However, there is no need to do this in two steps. You can just write the model as:

m1 <- lm(SSRT ~ Condition, data=subset(d, GRT<700))

Why do I now prefer the second way to the first? If you are going to fit many models, all of them including only participants with GRT < 700, then the second way will probably mean more characters in your script overall. However, the second way has a number of advantages. Your environment does not get littered with intermediate objects, each of which has to have a different name. The line defining the model m1 is more locally modular, in that it does not depend on having previously run the line defining d.include, which might be at a distal location in the script. And, the line defining the model m1 is more transparent. It is more obvious that m1 has been fitted only to the data from participants with GRT < 700, because the call actually says so explicitly. Thus, these days I prefer the second version, because more of the work is done locally to the operation itself.

There are many other examples of this kind. Let’s say you want to work out the mean of the variable SSRT by participants (for a case where participants do the same task multiple times), and then work out the mean and standard deviation of those means by Sex. You could do this via the intermediate object participant.summary, as follows:

participant.summary <- d %>% group_by(ParticipantID) %>%
  summarise(participant.mean.SSRT = mean(SSRT), 
            Sex = first(Sex))

And then later:

participant.summary %>% group_by(Sex) %>%
  summarise(mean(participant.mean.SSRT), 
            sd(participant.mean.SSRT))

However, there is no need for the intermediate object. You can do everything in one go:

d %>% group_by(ParticipantID) %>%
  summarise(participant.mean.SSRT = mean(SSRT), 
            Sex = first(Sex)) %>%
  group_by(Sex) %>%
  summarise(mean(participant.mean.SSRT), 
            sd(participant.mean.SSRT))

No intermediate objects with funny names. The pipe is a bit long, but the user can work through what it is doing step by step.

There are exceptions to this rule. Sometimes it is just simpler or much more economical to make an intermediate object to which you will apply further operations. And, defining the included data set as an object of its own would mean that you could change that one line and all the various statistical models downstream of that would all update automatically. But in general I would privilege transparency and local modularity over minimisation of the length of the script.

22.2 Archiving your work, and the chain of custody

In research, what you are trying to do is to provide evidence about a case. So, you need to take your evidence seriously, like police, lawyers and forensic investigators do. You need to have in mind the chain of custody, the sequence of processes that goes from you gathering the data on a paper questionnaire or in your field notebook or the machine in your laboratory, to your processed data files, to your statistical results, to the claims in your written paper. Each link in the chain needs to be verifiable and reproducible. You need to be able to show how the evidence was gathered, stored, how it was transmitted to the next stage, and if it was changed, exactly in which ways and why.

Taking the chain of custody seriously entails three commitments:

Preserve all the evidence. Keep your data sheets and digitize them; make electronic copies of your field notebooks; keep all the raw forms of the files even if you mostly work with a more processed form for the analysis. You should never throw anything away, but archive raw forms alongside any more processed forms. The only exception is evidence destruction you have to do to preserve the anonymity of your participants (deleting identifying personal information, replacing video recordings with anonymised transcripts).
Alter the material only in recoverable and traceable ways. Let’s say you want to exclude participants with a GRT greater than 700 msec, or replace reaction times of greater than 1000 msec with the value 1000 msec. Never alter the raw data files (for example by going into Excel and changing values in individual cells). Instead, write a section of script that applies the relevant operations explicitly. The reasons should be obvious: if it is in the script, it is publicly verifiable that you did this operation, and at which stage. Plus, you can change your mind. You (or someone else) can still recover the original values or data points and include them in an alternative analysis.
Open the whole chain to inspection. You should archive your code and data on a repository such as the Open Science Framework, and include links to this archive in all your papers. The archive should not just contain the final, highly processed file and analysis code. The whole chain of custody should be open to inspection. So, in your data repository, you might have a directory of digitised raw data sheets or individual participant experimental files, then a spreadsheet or wrangling script that gets the data into processed form, then a processed data file, then a script that does the analysis in the paper and makes the figures. Users can forensically inspect any stage in this sequence if they have a mind to. As I always stress, the advantages of doing this are not just for the credibility of your work with the scrutinizing community, important though this is. There are also advantages for you. Repositories like the Open Science Framework give you essentially limitless digital storage space forever, for free. If you organize a nice public archive with a clear chain of custody for each project, you are much more likely to be able to work effectively on your own data months or years later when you have moved jobs or your old laptop has stopped working or you can’t understand the mess of files on your local drive. And some other researcher may want to include your evidence in a meta-analysis: you will have somewhere clear and simple to point them to.

Take some time, then, to make a good data archive, probably involving some structure of directories and some directories, and a ‘readme’ file that tells your future self and other people what all the files are and what they do. You will want to include the pre-registration, protocols, and as much of the material from all the steps of the chain of custody as you can manage.

22.3 Summary

In this unit, we have examined some ways you can make your scripts more usable and more transparent, for yourself and other people. We have also discussed the idea of the chain of custody of evidence, and guidelines for what you should include in a the archive of your project in the data repository.