1.1 Introduction

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.

H.G. Wells (1903, paraphrased by S.S. Wilks, see link)

Why do we need data science? One argument begins with the insight that we need certain skills and mental capacities to cope with the complex demands of modern society. In reference to a person’s ability to read and write, this set of skills and capacities has been referred to as some sort of “literacy”: Whereas the terms statistical literacy and numeracy emphasize the ability of dealing with quantitative information, the term risk literacy emphasizes the ability of understanding risk-related information. We will use the term data literacy as an umbrella term that spans across all these concepts and additionally includes skills and tools for making sense of data.

Statistical thinking in medicine

The call for more data literacy is not a new phenomenon (see, e.g., Gigerenzer, 2002, 2014), as illustrated by the above quote from 1903 that is commonly attributed to the science fiction writer H.G. Wells. When looking for a domain that illustrates the desirability of a society with a high degree of statistical thinking or data literacy, our medical and health services are an ideal candidate. Over the past decades and centuries, rapid advances in biology and medicine have identified many risks and obstacles to healthy and fulfilled lives, as well as many means, habits, and cures for living better and longer. But the existing abundance of information and — often conflicting — advice on health and nutrition is often difficult to navigate and understand, even for experts.

If large portions of our population is challenged by health-related information, what about professionals in the medical sciences? Again, the problem has been identified quite a while ago:

First sentences of an editorial in The Lancet (1937).

Figure 1.1: First sentences of an editorial in The Lancet (1937).

Figure 1.1 shows the first sentences of an anonymous editorial on “Mathematics and Medicine” that appeared in The Lancet (in 1937). Although this editorial is more than 80 years old, its verdict on the troubled relationship between mathematics and medicine still holds true today. And as every area of biological and medical research is getting more and more specialized, the gap between generating scientific insights and understanding them may even be widening.

But misunderstandings of numbers and statistics are not limited to biological and medical information. Other scientific areas that produce vast amounts of data with important implications for our present and future societies include all branches of the natural sciences (e.g., chemistry, physics, and climate research) and humanities (e.g., arts, economics, philosophy, and political sciences). Today, any news report is likely to contain a variety of scientific and numeric facts that a large proportion of its audience finds hard or impossible to understand. Decision makers in families, corporations, and countries are urged to take numeric information into account, but would find it difficult to understand and explicate how various quantitative measures were derived.

The term collective statistical illiteracy refers to the wide-spread inability to understand the meaning of statistical facts and numbers. Unfortunately, statistical illiteracy is ubiquitous not only in medical professionals, but also among experts from other fields (including scientists), and is more likely the norm, rather than the exception (see Gigerenzer et al., 2007, for examples). As a consequence, we have (at least some) politicians, parties, and societal groups that are or appear to be incapable of understanding basic facts and are unable or unwilling to make sound judgments on the basis of evidence.

Health illiteracy

In 2021, we are still in the midst of an epidemic — and vaccinations by statistical literacy do not seem as wide-spread as deadly viruses, dangerous ignorance, blatant lies, and self-serving distortions. Although COVID-19 is a real threat, the wide-spread inability for understanding basic facts and health-related information is a force that — in combination with biases and increasing polarization of rival interests — erodes trust in scientific facts, public institutions, and the fabric of our democratic society. Again, this problem is not new and has been called by a variety of names. In 2009, The Lancet has identified health illiteracy as “the silent epidemic” (Editorial, 2009, p. 2028):

Referred to as the silent epidemic, health illiteracy is the inability to comprehend and use medical information that can affect access to and use of the health-care system. (…)
Although it is estimated that up to half of US adults have trouble interpreting medical information, the exact number is unknown because a reliable national health literacy measurement method is not available.

Editorial of The Lancet (2009, Vol 374 December 19/26, p. 2028)

Given the availability of sophisticated health-care to a majority of people, the sad irony of health illiteracy in many developed countries is that those services are “accessible to all but not understood by all” (ibid). Thus, health illiteracy harms people in many modern societies.

Defining data literacy

A concise definition of data literacy is provided by Ridsdale et al. (2015) (p. 2):

Data literacy is the ability to collect, manage, evaluate, and apply data, in a
critical manner.

While we agree that data literacy is “an essential ability required in the global knowledge-based economy,” and that “the manipulation of data occurs in daily processes across all sectors and disciplines” (ibid), the following definition is more explicit by mentioning some preconditions, emphasizes both the technical and reflective abilities and skills, and characterizes the desired outcomes:

Data literacy is the ability and skill of making sense of data.
This includes numeracy, risk-literacy, and the ability of using tools
to collect, transform, analyze, interpret, and present data,
in a transparent, reproducible, and responsible fashion.

From literacy to science

Most people would agree that a boost in data literacy would be desirable or necessary for achieving enlightenment in a modern society. But how does this relate to the new buzzword of data science? Well, statistics are based on data — and somebody needs to gather and select data, analyze and process it in appropriate ways, and present the results in a transparent fashion. As data is a key building block of science, it seems desirable that the collection, transformation, and evaluation of data is conducted in a scientific fashion. The term “data science” and a corresponding scientific discipline raises new questions:

  • What is data? And what is science?

  • What is the subject matter and scope of data science?

  • Which skills and tools do data scientists need and use?


Editorial. (1937). Mathematics and medicine. The Lancet. https://doi.org/10.1016/S0140-6736(00)86570-8
Editorial. (2009). The health illiteracy problem in the USA. The Lancet, 374, 2028. https://doi.org/10.1016/S0140-6736(09)62137-1
Gigerenzer, G. (2002). Reckoning with risk: Learning to live with uncertainty. Penguin.
Gigerenzer, G. (2014). Risk savvy: How to make good decisions. Penguin.
Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. M., & Woloshin, S. (2007). Helping doctors and patients make sense of health statistics. Psychological Science in the Public Interest, 8(2), 53–96. https://doi.org/10.1111/j.1539-6053.2008.00033.x
Ridsdale, C., Rothwell, J., Smit, M., Ali-Hassan, H., Bliemel, M., Irvine, D., Kelley, D., Matwin, S., & Wuetherick, B. (2015). Strategies and best practices for data literacy education: Knowledge synthesis report. Dalhousie University. http://hdl.handle.net/10222/64578