3.1 Why Data Management?

Why are we talking about data management in the first place? There are a number of factors that have led to people talking about good data management. We are going to learn today about some of the mandated reasons, and also some personal practical reasons you should care about how you manage your data.


3.1.1 Funding Agency Requirements

In the first place, if you have tried to apply for a grant recently, you may have noticed that some funders have required data sharing or data management plans as part of the grant. The NIH since 2003 has required a data sharing plan for grant applications requesting more than $500,000, and a new policy that applies to all research funded by the NIH will go into effect in 2023.


3.1.2 Publisher Requirements

In addition, an increasing number of top-tier publishers are also establishing data sharing requirements:

PLOS has been the first publisher to require that authors show proof that they have shared their data somewhere

Other journals like Nature, Science, and Cell have language written into their policies stating that data necessary to understand and assess the conclusions of a manuscript must be shared.


3.1.3 Why Data Management? – What’s in it for me?

Ok, so we know we should practice good data management because it’s required as part of the grant and publishing processes. Because of this data management can feel like a chore, and just one more hoop to jump through. But really, the reason data management is important is because it supports some of the core scientific values like transparency and reproducibility. Also since data from one project can contribute to another project it helps to further scientific progress and innovation. In thinking about, for example, our current situation with COVID19 it is easy to see how important open and well-documented data is for understanding for the spread of the disease in a timely manner.


I mentioned reproducibility, so let’s talk about that for a minute. Many of you are probably familiar with this Nature survey of over 1500 researchers which found that over 70% of respondents have failed to reproduce another scientist’s experiments and more than half have failed to reproduce their own. The highlighted section on right hand side shows that a number of survey respondents cited data management related issues as contributing to difficulties reproducing research such as unavailability of methods, code, and raw data.


But what do you personally get out of good data management? Consider the scenarios on this slide. Are any familiar to you? Let’s say you work on a research team. If a graduate student finishes their degree and leaves the project. What happens to their data when they are gone? Can the data be understood without that person to explain it? When someone new joins the team, what is the onboarding process like? Are the methods well documented? I think everyone has probably had experience with losing data – maybe even losing a storage device, or having a device become obsolete. Or maybe you have had a hard time understanding what you meant when you re-read notes or look at data that was collected some time ago? Would someone else be able to understand the information if they needed to? Good data management can help in situations such as these.


So while having well managed data that you can share is often a requirement, and is good for scientific efforts overall, it is also just really beneficial to you as a researcher. It will help you by:

  • Improving the organization and efficiency of your project
  • Making the process more comprehensible to you, your team, and others
  • Increasing your research impact by helping you produce High quality data (that may make you more likely to be cited)
  • Ensuring that you and others can access data in the future
  • Also prevents data loss!

3.1.4 Don’t end up here!

One risk of poor data management, aside from not getting your study published in the first place, is retraction of an article.Retractions are not always caused by deliberate falsification of data.

Consider this example from the New England Journal of Medicine:

The editors found multiple errors in a table within the paper and since the authors could not find the primary data their paper was retracted. So poor data management can have very serious consequences.