1.1 Introduction

The plural of anecdote is data.

Raymond Wolfinger (1969/70, see link)

Motivation

Why do we need data science?

One argument begins with the insight that we need certain mental capacities and skills to deal with the complex demands of modern society. In reference to a person’s ability to read and write, this set of capacities and skills has been referred to as some sort of literacy: Whereas the terms statistical literacy and numeracy emphasize the ability of dealing with quantitative information, the term risk literacy emphasizes the ability of understanding risk-related information. We will use the term data literacy as an umbrella term that includes all these terms and additionally emphasizes skills to use tools that allow transforming and making sense of data.

The call for more data literacy is not a new phenomenon (see, e.g., Gigerenzer, 2002, 2014). A quote commonly attributed to the science fiction writer H.G. Wells is

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.

H.G. Wells (1903, paraphrased by S.S. Wilks, see link)

First sentences of an editorial in The Lancet (1937).

Figure 1.1: First sentences of an editorial in The Lancet (1937).

Figure 1.1 shows the first sentences of an anonymous editorial on “Mathematics and Medicine” that appeared in The Lancet (in 1937). Although this editorial is more than 80 years old, its verdict on the troubled relationship between mathematics and medicine still holds true today. But misunderstandings of numbers and statistics are not confined to medical information. Today, any news report is likely to contain a variety of numeric facts that a large proportion of its audience finds hard or impossible to understand. Decision makers in families, corporations, and countries are urged to take numeric information into account, but would find it difficult to explicate how various quantitative measures were derived.

The term collective statistical illiteracy refers to the widespread inability to understand the meaning of statistical facts and numbers. Unfortunately, statistical illitarcy is widespread not only in doctors, but also among experts from other areas (including scientists), and is more likely the norm, rather than the exception (see Gigerenzer, Gaissmaier, Kurz-Milcke, Schwartz, & Woloshin, 2007, for examples).

As a consequence, we have politicians, parties, and societal groups that seem incapable of understanding basic facts and are unable or unwilling to make sound judgments on the basis of evidence. In 2020, we are in the midst of an epidemic — and statistical literacy does not seem as wide-spread as deadly viruses, dangerous ignorance, blatant lies, and self-serving distortions.

The problem of health illiteracy:

Referred to as the silent epidemic, health illiteracy is the inability to comprehend and use medical information that can affect access to and use of the health-care system. (…)
Although it is estimated that up to half of US adults have trouble interpreting medical information, the exact number is unknown because a reliable national health literacy measurement method is not available.

Editorial of The Lancet (2009, Vol 374 December 19/26, p. 2028)

So let’s agree that a boost in data literacy would be desirable and may perhaps even necessary. But how does this relate to the new buzzword of data science? Well, somebody needs to gather and select data, analyze and process it in appropriate ways, and present the results in a transparant fashion.

This raises new questions:

  • What is the subject matter and scope of data science?
  • Which skills and tools do data scientists need?