Making your analyses understandable, reproducible, and extendable
Motivating scenario: You are working on a data analysis project. You wan tto be able to share your efforts with you form the future and other people.
The Most important tool for #ReproducibleResearch is the mindset, when starting, that the end product will be reproducible.
Keith Baggerly
One of my roles as a professional scientist is to serve as a data editor for The American Naturalist. This means that I evaluate the data and code repository for each paper that is likely to be published in that journal.
My role at the American Naturalist is part of a growing movement for open Science - that the products of science should be “FAIR - that is – Findable, Accessible, Interoperable and Reusable” (see this article, and this article too). This means that both raw data, and the procedures that were used to analyze those data should be available and easy for anyone to pick up and build of of. The continued refinement of our scientific understanding by sharing data and code is essential for
Additionally open science is part of an ethical obligation to make the products of publicly funded research available to the people who support it. By embracing open science, we ensure that knowledge is not locked behind paywalls or limited to select groups but is instead freely available to foster innovation, education, and societal advancement.
In addition high minded principles - there is a more pragmatic and selfish reason to make you work reproducible — as a scienctist, “your closest collaborator is you six months ago, but you don’t reply to emails.” (Quote attributed to Mark Holder). This humorous observation highlights a very real challenge. Research is complex and can stretch over months or even years. Without well-documented data and code, it can be extremely difficult to retrace your steps, understand your past analyses, or explain your results. By keeping a good record of your workflows, you make your future self’s life much easier. Whether you need to revisit an analysis, explain a decision to a reviewer, or tweak a model based on new insights, maintaining clear, organized, and reproducible records ensures that your work is understandable and adaptable.
A key to the scientific method is reproducibility. This is why scientific papers have a methods section. Nowadays - the internet allows for even more detailed sharing of methods. Additionally it is the expectation in most fields that data is made available after publication on repositories like data DRYAD or DRUM. The previous chapter disccussed best practices in collecting and storing data.
SO we have discussed how to ensure that your data are open and accessible. But what you do to data and how you analyze it is as much a part of science as how you collect it. As such, it is essential to make sure your code:
Here are the principles form The American Naturalist’s policy about this:
REQUIRED:
Scripts should start by loading required packages, then importing raw data from files archived in your data repository.
Use relative paths to files and folders (e.g. avoid setwd()
with an absolute path in R), so other users can replicate your data input steps on their own computers.
Make sure your code works. Shut down your R. (or type rm(list=ls())
into the console and run the code again. You should ge the same results. If not, go back and fix your mistakes.
Annotate your code with comments indicating what the purpose of each set of commands is (i.e., “why?”). If the functioning of the code (i.e., “how”) is unclear, strongly consider re-writing it to be clearer/simpler. In-line comments can provide specific details about a particular command.
Annotate code to indicate how commands correspond to figure numbers, table numbers, or subheadings of results within the manuscript.
If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.
RECOMMENDED:
Test code ideally on a pristine machine without any packages installed, but at least using a new session.
Use informative names for input files, variables, and functions (and describe them in the README file).
Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data.
Organise your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables. Sections can be separate script files run in order (as explained in your README) or blocks of code within one script that are separated by clear breaks (e.g., comment lines, #————–), or a series of function calls (which can facilitate reuse of code).
Label code sections with headers that match the figure number, table number, or text subheading of the paper.
Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda at the end of your script.
While sometimes it makes sense to separate your code from the products it generates, we often want to create a single document that contains both the code and its output. RMarkdown allows you to do just that. You can write text just like you would in any document (in fact, this entire book is written in RMarkdown) and then insert code in designated “code chunks.”
To get started with RMarkdown, I recommend reading two helpful tutorials: one from Our Coding Club and another from Reproducible Medical Research with R. These guides should give you a solid foundation.
That said, let me offer some additional pointers to help you as you dive into RMarkdown:
Compile Often: Ideally, try to compile your document after completing each code chunk. If there’s an error in your R code, RMarkdown can be quite finicky, and it’s best to catch mistakes early to avoid frustration later.
Keep It Clean: Only include code that is essential for producing your results. Functions like head()
or View()
are useful when you’re developing your code, but they tend to produce unnecessary output when you’re documenting it for others.
Suppress Unnecessary Messages: When you’re writing a code chunk, you can suppress warnings and messages by adding message=FALSE, warning=FALSE
to your chunk options like so:
```{r, message=FALSE, warning=FALSE}```
This helps keep your document clean and focused.
Adjust Plot Sizes: You can also adjust the size of the plots that R generates by specifying the size in your code chunk options. For example: ```{r, fig.width=6, fig.height=4}``` Customizing plot sizes is often a good idea to ensure they fit well within your document.
For even more information, take a look at the RMarkdown book, which offers a deep dive into everything you can do with this powerful tool.
While sometimes separating code and products of that code makes sense. We often want to generate a single document with both your code and its products. RMarkadown allows you to do just that. You can type as you would (or as I do – this whole book is written in RMarkdown) and then add in code in “code chunks”.
Read these two turtorials / guides about writing in RMarkdown – the first from Our Coding Club, and the second from Reproducible Medical Research with R.
These should get you started. But also le me give you some pointers
head()
or View()
are great for writing code, but just spit out unnecessary junk when you’re documenting your code.For even more infomraiton, check out the RMarkdown book.