Chapter 09: Reproducible analyses

Making your analyses understandable, reproducible, and extendable

Motivating scenario: You are working on a data analysis project. You wan tto be able to share your efforts with you form the future and other people.

The Most important tool for #ReproducibleResearch is the mindset, when starting, that the end product will be reproducible.

Keith Baggerly

One of my roles as a professional scientist is to serve as a data editor for The American Naturalist. This means that I evaluate the data and code repository for each paper that is likely to be published in that journal.

My role at the American Naturalist is part of a growing movement for open Science - that the products of science should be “FAIR - that is – Findable, Accessible, Interoperable and Reusable” (see this article, and this article too). This means that both raw data, and the procedures that were used to analyze those data should be available and easy for anyone to pick up and build of of. The continued refinement of our scientific understanding by sharing data and code is essential for

  1. Rapid and continued scientific progress,
  2. The spread of novel methods and ideas, and
  3. The integrity of scientific research - both in the eyes of the public and the scientific community.

Additionally open science is part of an ethical obligation to make the products of publicly funded research available to the people who support it. By embracing open science, we ensure that knowledge is not locked behind paywalls or limited to select groups but is instead freely available to foster innovation, education, and societal advancement.

In addition high minded principles - there is a more pragmatic and selfish reason to make you work reproducible — as a scienctist, “your closest collaborator is you six months ago, but you don’t reply to emails.” (Quote attributed to Mark Holder). This humorous observation highlights a very real challenge. Research is complex and can stretch over months or even years. Without well-documented data and code, it can be extremely difficult to retrace your steps, understand your past analyses, or explain your results. By keeping a good record of your workflows, you make your future self’s life much easier. Whether you need to revisit an analysis, explain a decision to a reviewer, or tweak a model based on new insights, maintaining clear, organized, and reproducible records ensures that your work is understandable and adaptable.

Open and reproducible data

A key to the scientific method is reproducibility. This is why scientific papers have a methods section. Nowadays - the internet allows for even more detailed sharing of methods. Additionally it is the expectation in most fields that data is made available after publication on repositories like data DRYAD or DRUM. The previous chapter disccussed best practices in collecting and storing data.

Reproducible code.

SO we have discussed how to ensure that your data are open and accessible. But what you do to data and how you analyze it is as much a part of science as how you collect it. As such, it is essential to make sure your code:

  1. Reliably works – even on other computers
  2. And can be understood.

Here are the principles form The American Naturalist’s policy about this:

REQUIRED:

RECOMMENDED:

R Markdown

Writing in RMarkdown

While sometimes it makes sense to separate your code from the products it generates, we often want to create a single document that contains both the code and its output. RMarkdown allows you to do just that. You can write text just like you would in any document (in fact, this entire book is written in RMarkdown) and then insert code in designated “code chunks.”

To get started with RMarkdown, I recommend reading two helpful tutorials: one from Our Coding Club and another from Reproducible Medical Research with R. These guides should give you a solid foundation.

That said, let me offer some additional pointers to help you as you dive into RMarkdown:

For even more information, take a look at the RMarkdown book, which offers a deep dive into everything you can do with this powerful tool.

While sometimes separating code and products of that code makes sense. We often want to generate a single document with both your code and its products. RMarkadown allows you to do just that. You can type as you would (or as I do – this whole book is written in RMarkdown) and then add in code in “code chunks”.

Read these two turtorials / guides about writing in RMarkdown – the first from Our Coding Club, and the second from Reproducible Medical Research with R.

These should get you started. But also le me give you some pointers

For even more infomraiton, check out the RMarkdown book.

Figure 1: The accompanying getting data quiz link.

References