The plural of anecdote is data.
Raymond Wolfinger (1969/70, see link)
In this section, we first reflect on the notions of “data” and “science” to obtain an idea of what “data science” could be. Beyond introducing some key issues, this serves to develop a common terminology, but also to realize some important assumptions and implications of data science.
What is data? Interestingly, we are entirely familiar with the term, but find it difficult to explain its meaning. We will first consider some candidate definitions and then reflect on the connotations evoked by the term.
According to Wikipedia, data “are characteristics or information, usually numerical, that are collected through observation.” This is not very satisfying, as phrases like “characteristics or information” and “collected through observation” are quite vague and “usually numerical” allows for exceptions.
A second definition comes to the rescue: “In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.” This sounds smart and may even be true, but the trick here is to delegate the definition to related concepts and hope that the person asking the original question will give up: If data are the values of variables, what are those?
An important aspect for understanding data is that the word is the plural of a singular datum, which is a Latin term for “something given.” Thus, the entity given denotes (or “represents”) something else. More explicitly, data consists of representations with a characteristic structure: Data are signs or symbols that refer to and describe something else.5 The data we usually deal with consists of written symbols (e.g., text) and measurements (e.g., numbers) that were recorded for some purpose.
For instance, if someone aimed to collect or compare health-related characteristics of some people, he or she would measure and record these characteristics in some file. In such a file of health records, some values may identify persons (e.g., by name or some ID code), while others describe them by numbers (e.g., their age, height, etc.) or various codes or text labels (e.g., their address, profession, diagnosis, notes on allergies, medication, vaccinations, etc.). The variables and values are typically stored in tabular form: If each of the table’s rows describes an individual person as our unit of observation, its columns are called variables and the entries in the cells (which can be referenced as combinations of rows and columns) are values. The contents of the entire table (i.e., the observations, variables, and values) are typically called “data.”
Note that the unit of observation (shown as the rows of our data file) does not always need to be an individual person. For instance, we could be interested in the characteristics of a region, country, or time period.
If the idea of some table stored as a file on some drive, server, or USB stick seems too mundane for your tastes, ask yourself about the “data” needed to describe a biological organism.
What data are required to discover a new species?
What are the data for or against human-made climate change?
And what data is provided by the DNA of your cells?
Although these appear to be lofty questions, collecting evidence to answer them requires translating various measurements or sensory signals (e.g., waves of sound or light) into numbers, words, diagrams, or maps.
Collecting this information and creating displays for understanding and communicating it will typically involve some types of table with variables for some unit of observation.
What is being measured and recorded crucially depends on the purpose (i.e., the question to be answered) and existing conventions (i.e., the field of investigation). Thus, the “data” is whatever we consider to be useful for answering our questions.
Rather than defining data, this section explicated the term by introducing some related notions (e.g., representations, variables, values). This is possible — and perhaps inevitable — as we are all familiar with the term and know how to use it in various contexts. Some related discussions in this contexts include:
The distinction between raw data (e.g., measurements, inputs) and processed data (results or interpretations): Data are often portrayed as a potential resource: Something that can be collected or generated, harvested, and — by processing — refined or distilled to turn it into something more valuable. The related term data mining also suggests that data is some passive raw material, whereas the active processing and interpretation of data (by algorithms, rules, instructions) can generate interest and revenue.
The distinction between data and information: In contrast to data, the term information has a clear formal definition (see Wikipedia: Information theory). And while data appears to be neutral, information is usually viewed as something positive and valuable. Hence, when data is being used and perceived as useful, we can generate insight, knowledge, or perhaps even wisdom from it?
The field of signal detection theory (SDT, see Wikipedia: Detection theory) distinguishes between signal and noise. Does data include both signal and noise, or should we only count signals as data?
Note that all these distinctions attribute value to data in the context of some human interest.
The image of data
Irrespective of its definition, data appears to have a split personality.
On the one hand, data clearly has an image problem.
For many, the term ‘data’ sounds as cold and uncomfortable as ‘math’ or ‘homework’ — they may have their uses, but are perceived as dry and boring, and definitely not woke or sexy. A somewhat nerdy and inhuman image of data is also evoked by the fictional character Data in the science fiction series Star Trek: The Next Generation (see Figure 1.2).
As Data is somewhere in between a computer and a human being, his skills and quirky limitations inspire both awe and alienation. To quote from Wikipedia: Data (Star Trek): “His positronic brain allows him impressive computational capabilities. He experienced ongoing difficulties during the early years of his life with understanding various aspects of human behavior and was unable to feel emotion or understand certain human idiosyncrasies, inspiring him to strive for his own humanity.”
But the image of data is not always portrayed in a negative or odd fashion. Recently, people have been alerted to the value of data and several tech companies are specializing in collecting and trading data In this context, the job of a ‘data scientist’ has been touted as the “sexiest job of the 21st century” (Davenport & Patil, 2012).
How can we explain this discrepancy? The basic idea of the more positive view is that data is a raw material or resource that can be used and exploited — and eventually turned into something more valuable. Thus, data science is the alchemy (see Wikipedia: Alchemy) of the digital age: Just like medieval scholars aimed to purify and transform base materials (like lead) into noble materials (like gold), modern data scientists aim to create value and insights out of bits and bytes.
An assumption underlying these efforts is that data is the foundation of various information types that can eventually be turned into knowledge and wisdom. Figure 1.3 shows their arrangement in a hierarchical structure of a DIKW pyramid:
Although it is difficult to explicate the distinguishing features of these terms, it is clear that data is more basic than knowledge, and that wisdom is both rarer and more valuable than data or information.
The key distinction between the lower base and the upper tips of the pyramid is that data needs to address some hypothesis or answer some question to be useful, and must be interpreted and understood to become valuable. Thus, data can only analytically be separated from information, knowledge, and wisdom. In practice, all layers of the pyramid mutually support and complement each other: Using data in a smart and systematic fashion requires a lot of knowledge, skill, and wisdom; and successfully interpreting and understanding data informs and transforms our knowledge and wisdom.
Another reason for the close interaction between data and theory is that we need theoretical models for understanding and interpreting data. When analyzing data, we are typically interested in the underlying mechanisms (i.e., the causal relationships between variables). Importantly, any pattern of data can be useless and misleading, when the data-generating process is unknown or ignored. Knowledge or at least assumptions regarding the causal process illuminate the data and are required for its sound interpretation. The importance of theoretical assumptions for data analysis cannot be underestimated (see, e.g., the notion of causal models and counterfactual reasoning in Pearl & Mackenzie, 2018). Thus, pitting data against theory is nonsense: Using and understanding data is always based on theory.
Where else does the term ‘data’ appear in textbooks, news, and media reports? Which connotations and evaluations are evoked by the term?
Rather than defining data, perhaps we can distinguish between a wide and a narrow sense of the term:
wide sense: Any signal or phenomenon that can be measured, recorded, or quantified.
narrow sense: Encoded information, serving someone’s interest(s).
Does this capture your understanding of the term? Or is the wide sense too wide and the narrow sense too narrow?
Above, we used the term representations to capture the characteristic structure of data: Data is “something given” that denotes or describes something else. A more abstract way of expressing this is that a key property of representations is their intentionality or aboutness: They are things that represent something else (see the related notions of signs and symbols).
One way to better understand data is by studying the properties of representations. Representations are the symbolic or physical instantiations of data. Despite the abstract term, representations are concrete: To become manageable and manipulable, data needs to be represented (e.g., as categories, numbers, or words).
A key insight when dealing with representations is that the assignment relation between some concept (e.g., a number) and its representation is somewhat arbitrary. And as different representations have different properties, it matters how information is represented. To illustrate this point, Figure 1.4 shows how different numeral systems represent the numbers 1 to 12 (from De Cruz et al., 2010):
Figure 1.4 shows five different symbolic representations of the number ‘one.’ Note that it would be easy to think of additional ways to represent the same number. For instance, “uno,” “eins,” and “egy” represent the same number in different languages. Interestingly, both the Arabic and binary representation show a single digit \(1\), but both representations still are different. For instance, adding \(1+1\) requires different symbolic operations in both systems.
In addition to using different symbols, different representations require different operations. This becomes obvious when calculating in different representational systems. Whereas it may seem easy to compute the sum of \(3920 + 112\), we find it rather difficult to add up \(MMMDCCCCXX + CXII\). A lot of this impression is simply due to our extensive experience with the Hindu-Arabic system, as opposed to Roman numerals. However, beyond mere effects of familiarity, different representations require different trade-offs (e.g., between external and internal operations). For instance, adding and multiplying with Roman numerals is quite simple and straightforward. Compared to Hindu-Arabic numerals, using Roman numerals requires less memory capacity, but a larger number of symbols (see Schlimm & Neth, 2008).
When dealing with representations, the distinction between a variable and its values can be expressed as follows:
A variable is a dimension or property that describes a unit of observation (e.g., a person) and can typically assume different values.
By contrast, values are the concrete instantiations that a variable assigns to every unit of observation and are further characterized by their range (e.g., categorical vs. continuous values) and their type (e.g., logical, numeric, or character values).
For instance, an individual can be characterized by the variables
age, and whether or not s/he is classified as an
The values corresponding to these variables would be of type text (e.g., “Lisa”), numeric (e.g., age in years), and logical (
FALSE, defined as a function of
In scientific contexts, variables are often defined as numeric measures and computed according to some rule (e.g., by some mathematical formula) from other variables. For instance, the positive predictive value (PPV) of a diagnostic test is a conditional probability \(P(TP|T+)\) that can be computed by dividing the test’s hits (e.g., the frequency count of true positive cases, \(TP\)) by the total number of positive test outcomes (\(T+\)). Given its definition (as a probability), the range of values for the variable PPV can vary continuously from 0 to 1.
One difficulty in addressing the properties of representations is that we cannot do so without using representations. For instance, language usually uses words (i.e., a type of representation) to refer to things, but can also talk about representations. In written language, we can use quotation marks or emphasis to indicate that we are talking about a word, rather than use it in its ordinary meaning. For instance,
- The word “rhythm” is difficult to spell.
- The word word is a four-letter word.
Types of data
Elementary data types:
- truth values (logical)
- text (characters, strings)
Other data types:
- dates and times
Exactly how numbers or text relates to actual phenomena involves many issues of representation and measurement.
Shapes of data
Elementary shapes of data:
- 1D: vectors and lists
- 2D: matrices and tables
- \(n\)D: arrays
- non-rectangular data
Explain difference between structured vs. unstructured data.
Data wrangling as (a) changing between scales or data types and (b) re-shaping data:
Re-coding data in a different form or type.
Transforming the same data into a different shape (e.g., one table into another).
Identify and distinguish between different data moves (a term used by Tim Erickson): Do something with data to change it in order to solve some task. Difference to applying a function: A function uses data, but does not necessarily change it.
What is science?
Naive view of science: Discovery of facts and phenomena.
A parody of science
See popular conspirancy theories: To gain critical distance from so-called “fake news” and “fake facts,” many people set out to discover the truth for themselves.
Research driven by some individual’s perspective and interests. Googling some detail: Fine, if it fits into my opinion or theory. If it doesn’t fit: Can I make some adjustment to my current theory that makes it fit?
They claim to be skeptical and put on scientific hats, but actually perform a parody of real science.
What is unscientific about this?
Not primarily the facts and conclusions, although many of them are blatantly wrong on many levels. Instead, the main problem lies in their motivation and methodology.
Key question: What role does scientific evidence play in the debate?
Many opinions and positions masquerading as “science” are really just pseudo-science or bullshit (BS, see Frankfurt, 2009, for a definition).
Polluting the world with BS is far easier than cleaning it up — a fact that has been elevated to the status of a principle (see Bergstrom & West, 2021, p. 11):
The amount of entergy needed to refute bullshit is an order of magnitude bigger than [that needed] to produce it.
Alberto Brandolini (2014)
Hence, questioning and doubting is cheap, but discovering mechanisms and truths very difficult and challenging. Importantly, a “scientific argument” is not looking for a scientific study that conforms to my existing opinion.
Science is not really an edifice of facts and theories, but primarily a method and a mindset.
Any method or tool can also be abused. Selectively using science to support an existing opinion is an abuse of science. Thus, the fact that people can cite some study to support an opinion is not an indication of a scientific mindset.
What is a scientific mindset?
A hallmark of science: Openness for data, looking for evidence irrespective of whether it confirms or refutes one’s current assumptions and beliefs.
Theoretically, looking for disconfirming evidence is more successful than confirmation.
It would be naive to assume that science is objective, as any scientific framework is also based on premises and assumptions. However, it is no contradiction to strive for objectivity.
Mutual interaction of concepts, theoretical hypotheses, and empirical facts.
Two problems with conspiracy theories:
Skepticism and doubt can be good — but to serve their function, they must be honest and unbiased, rather than being used in a self-serving and egocentric fashion that only caters to one’s own interests and preferences. While evidence is good, selective evidence is worthless and misleading.
Lacking truth criteria:
Beyond universal doubt, we also need criteria for accepting theories, facts, and findings.
Key features of actual science:
Models, and theories: Describing phenomena by constructs and developing hypotheses about relationships between constructs. Need for criteria for determining the truth / agreeing on assumptions and findings.
Finding out facts vs. developing measures and methodologies.
Many social elements: Community of practicioners needs to agree on criteria and methods.
- Agreeing on units and the value of constants (e.g., \(\pi\))
- Agreeing on accepted methods: internal vs. external validity
- Agreeing on criteria for evaluating results (e.g., statistical methods) and accepting conclusions
- Dealing with uncertainty (rather than delivery of certainty): Contrary to the expectations of politicians and critics, science does not discover eternal truths and certainties. Instead, it allows for multiple views (as long as they do not contradict current evidence) and changes its mind when evidence changes.
Two important dichotomies:
Specialization (digging deeper) vs. abstraction (generalization). See our matrix lens model…
Fitting vs. prediction: Explanation is cheap, but prediction is what really counts.
1.2.3 Data science
Now we can return to our initial question: What is data science?
Conundrum: We mentioned above that data is typically considered dry and dull, rather than sexy. If this is so, it should be surprising that data scientist is touted as the “sexiest job of the 21st century” (Davenport & Patil, 2012). What is the secret superpower that endows a dull topic with sex appeal? Reason for difference evaluation: Data is increasingly ubiquitous and considered to be a valuable resource for doing other things…
Meanings differ depending on whom you ask…
As an academic discipline, we primarily focus on conceptual foundations and aim for a sound methodology: Developing reliable and valid methods for collecting, processing, and deriving conclusions from data.
An inevitable question:
- What is the relation of data science to statistics?
Which skills do we need?
Need for statistics?
Need for programming?
fortune(52) of the fortunes package:
Can one be a good data analyst without being a half-good programmer?
The short answer to that is, ‘No.’
The long answer to that is, ‘No.’
Frank Harrell (1999), S-PLUS User Conference, New Orleans
Similar arguments can be made for the disciplines of mathematics, philosophy, formal logic, psychology…
Tukey argued for detective skills (Tukey, 1969). And what about investigative skills, journalism, ability to tell a story and express numeric facts as narratives?
Key skill: Dealing with data.
Data scientists are the people that discover, mine, select, organize, transform, and present information.
Analyzing information: Transforming and processing data (in an appropriate fashion)
Organizing information: Importance of documentation, explication, and practices of reproducible research (e.g., in R Markdown, see Section 1.3).
Communicating results: Designing information and displays.
Dealing with data requires in an appropriate and responsible fashion requires a lot of abstract and methodological knowledge, but also mastery of corresponding tools…
I suppose it is tempting, if the only tool you have is a hammer,
to treat everything as if it were a nail.
Abraham H. Maslow (1966, p. 15f.)
Which tools should we use to process and analyze data?
A recent anecdote illustrates the troubles of importing data:
Importing data is an early and usually mundane step in the process of data analysis. Under ideal circumstances, reading data would be so seamless that it would remain unnoticed.
The fact that books include chapters on Importing data (into R) reminds us that our world is not ideal: Depending on their sources and types, messy datasets can be difficult to read. This is unfortunate, as it often prevents people from using R and drives them to use less powerful software, which may seem easier or more convenient to use.
A striking example of bad software choices is the Public Health England (PHE)’s recent decision to import CSV-files on Covid-19 test results into Microsoft Excel’s XLS file format. Due to this file format’s artificial limit to a maximum of 65.536 rows of data, nearly 16.000 coronavirus cases went unreported in the U.K. (which amounts to almost 24% of the cases recorded in the time span from Sep 25 to Oct. 2, 2020). Two noteworty quotes from this BBC article (Kelion, 2020) are:
… one expert suggested that even a high-school computing student would know that better alternatives exist.
… insiders acknowledge that the current clunky system needs to be replaced by something more advanced that excludes Excel, as soon as possible.
When designing a tool, a possible goal could be to make it as versatile or flexible as possible. One way of increasing flexibility is combining a variety of tools in a toolbox. A typical Swiss Army knife provides a good example for this approach.
Toolbox analogy: Swiss knife with additional tools (see Section 1.1.3 on Terminology).
Interestingly, the toolbox achieves flexibility by combining many highly specialized tools.
Apply analogy to R: R can be thought of as a toolbox — but RStudio is a toolbox as well. Similarly, any R package is also a toolbox, with individual functions as its tools. Thus, we do not necessarily see an ordered hierarchy of systems (like an elaborate system of Matryoshka dolls: Swiss-knife-like tools in toolboxes, that are contained in more elaborate toolboxes), but rather a wild medley of boxes and tools.
A potential problem:
The curse of featuritis or feature creep (see Wikipedia): Many technological gadgets — and especially software tools — contain too many functions. As a consequence, the devices or programs get bloated and over-complicated, rather than simple and functional.
Given the ubiquity of computers (especially as phones, tablets, but also watches or cars), we are getting used to universal machines. However, does a coffee machine also need to contain a clock, a radio, or play a game or video? Do we really need a text editor handling email, file browser, and spreadsheet calculations?
Again, the Swiss knife analogy is helpful: We can add more and more features, but at some point, a knife gets clumsy. For instance, the XAVT Swiss Army Swisschamp Xavt Pocket Knife, Multi is as wide as long, weighs 350 grams, costs over EUR 300, and contains 83 functions (a watch, various wrenches, a fish scaler). Its advertisement states that “versatility knows no limits” and it is “fit for all tasks.” However, I suspect it mainly serves as an expensive collector’s item that is rarely used for anything.
Make everything as simple as possible,
but not simpler.
Opposite idea: the KISS principle, short for “keep it simple, stupid” or “keep it short and simple” urges designers to avoid excessive complexity and keep things as simple and straightforward as possible.6 Many systems work best when they are reduced to their essential parts and functions.
The tension between flexibility and specialization is inevitable and will always be present. Striking the right balance can be difficult. It often is unclear what is “as simple as possible.” Especially when tasks and demands change dynamically, it can be hard or impossible to determine what is essential.
1.2.6 Fit for tasks
When is a tool a good one?
We have seen that there is an inevitable tension between specialization vs. universality. Our solution consists in embracing flexibility, simplicity, and a multiplicity of solutions, while being aware of key constraints.
Lesson to learn here: Best to adopt a pragmatic perspective and use any tool that fits to the job. Note that this use of “fit” implies a match between task and tool.
But tasks do not occur in isolation, but are embedded in contexts that involves a certain environment and individuals or grooups (often called agents or users) whose goal it is to solve the task.
These elements provide additional constraints — and the really tough ones are often not the technological ones, but people’s habits and historical and institutional conventions. Thus, the right tool needs to fit not only to the task, but also to the environment in which the task takes place and to the person using it.
Idea of ecological rationality (ER) (Todd et al., 2012) (where “ecological” means environmental fit, rather than “bio” or “green”): Triad of entities. Aim for a match between task, tool, and the user’s skills and capacities.
Figure 1.7 shows that ER is a fit between three entities:
Some task, strategy, or tool (e.g., some algorithm, physical toolbox, or software package).
The environment (e.g., phyical environment, but also commercial, institutional, media, or social environments).
The agent’s capacities and skills (e.g., cognitive and perceptual processes, but also knowledge and learned skills).
Designing for fit
Upshot of ER: Not only one, but multiple levers for promoting change.
Fit can be increased by education, but also by changing our environment, or by designing better tools.
Physical phenomena — like smoke from a fire or sound waves — can also become data when someone is interested in using and interpreting them. For convenience, we assume here that data has already been recorded in some representational form.↩︎