Goal: Introduce some key concepts and topics.
By defining technical terminology we acquire a common language, but also can reflect on important assumptions and implications.
What is data?
Interestingly, we are entirely familiar with the term, but find it difficult to define it. Actually, we won’t answer this question, but it still merits some reflection.
According to Wikipedia, data “are characteristics or information, usually numerical, that are collected through observation.” Note: Data is plural (of a singular datum), “characteristics or information” sounds fishy, and “usually numerical” and “collected through observation” are not false, but dubious.
A second definition comes to the rescue: “In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.” This sounds smart and may even be true, but the trick here is to delegate the definition to related concepts and hope that the person asking the original question will give up: If data are the values of variables, what are those?
Typical data: Measurements (numbers) or written descriptions (text). Example: Health records on some card and chip: Past check ups and diagnoses, allergies and vaccinations, medications.
Lofty questions: Consider various sensory signals (sounds or light), numbers, words in text, maps, etc. A new species of fish is discovered: Is this data? A tree or its leaves? What about DNA?
Mundane version: Files stored on some drive, server, or USB stick are considered to be data. (Part of this identification may be based on the fact that someone purposefully make the effort of recording and saving them on a medium that allows future access and use.)
Notion of raw data: Data as a resource: Something that can be collected or generated, harvested, and — by processing — refined or distilled to turn it into something more valuable. (Note the related concept of data mining.)
Data vs. information: Data as passive element (raw material), rather than active elements (algorithms, rules, instructions)? When data is used and useful, we can generate insight, knowledge, or perhaps even wisdom from it?
Technical view from SDT: Distinguishing a signal from noise? Note: Attributions of value assumes some human interest and context.
wide sense: Any signal or phenomenon that can be measured, recorded or quantified
narrow sense: Encoded information, subject to human interest
The image of data
Data seems to have an image problem. See the in-human image of data (Figure 1.2). The notion of ‘data’ sounds like ‘math’ or ‘homework’ — they may have their uses, but seem dry and boring, most definitely not hip or sexy.
Note: The somewhat inhuman quality of data is apparent in the fictional character Data in the science fiction series Star Trek: The Next Generation. Quote from Wikipedia: “His positronic brain allows him impressive computational capabilities. He experienced ongoing difficulties during the early years of his life with understanding various aspects of human behavior and was unable to feel emotion or understand certain human idiosyncrasies, inspiring him to strive for his own humanity.”
Interestingly, the job of a data scientist has also been touted as the “sexiest job of the 21st century” (Davenport & Patil, 2012)
- How can we explain this discrepancy?
Basic idea: Data is a raw material or resource that can be used and exploited — and often turned into something more valuable.
- Data vs. knowledge: See the popular distinctions between data, information, knowledge, and wisdom, all of which have distinct connotations. Figure 1.3 shows their arrangement in a hierarchical structure of a DIKW pyramid:
Data seems dead, when the process generating the data is ignored. Caveat: It is important to distinguish between data and the data-generating process. Knowing details of the latter often illuminates the former (see Pearl & Mackenzie, 2018, for striking examples).
To become manageable and manipulable, data needs to be represented. Importantly, different representations have different properties.
Example: Numbers vs. numerals in various systems (see De Cruz, Neth, & Schlimm, 2010, for details).
Figure 1.4 shows different ways in which numeral systems represent numbers.
1 = I = “one”/“uno”/“eins” all represent the same number, but in different ways. This seems trivial, but try calculating in different representations.
Different representations engage in different trade-offs (see Schlimm & Neth, 2008).
Intentionality (or aboutness) as a key property of representations: Something that represents something else. (See related notions of signs and symbols.)
Note a non-trivial circularity: Language usually uses words (i.e., representations) to refer to things, but can also talk about representations. In written language, we need to use quotation marks or emphasis to indicate that we are talking about a word. For instance,
- The word “rhythm” is difficult to spell.
- The word “word” is a four-letter word.
Types of data
Elementary data types:
- truth values (logical)
- text (character strings)
- dates and times
Other data types:
Exactly how numbers or text relates to actual phenomena involves many issues of representation and measurement.
Shapes of data
Elementary shapes of data:
- 1D: vectors and lists
- 2D: matrices and tables
- \(n\)D: arrays
- non-rectangular data
Explain difference between structured vs. unstructured data.
Data wrangling as (a) changing between scales or data types and (b) re-shaping data:
Re-coding data in a different form or type.
Transforming the same data into a different shape (e.g., one table into another).
Identify and distinguish between different data moves (a term used by Tim Erickson): Do something with data to change it in order to solve some task. Difference to applying a function: A function uses data, but does not necessarily change it.
What is science?
Naive view of science: Discovery of facts and phenomena.
A parody of science
See popular conspirancy theories: To gain critical distance from so-called “fake news” and “fake facts,” many people set out to discover the truth for themselves.
Research driven by some individual’s perspective and interests. Googling some detail: Fine, if it fits into my opinion or theory. If it doesn’t fit: Can I make some adjustment to my current theory that makes it fit?
They claim to be skeptical and put on scientific hats, but actually perform a parody of real science.
What is unscientific about this?
Not primarily the facts and conclusions, although many of them are blatantly wrong on many levels. Instead, the main problem lies in their motivation and methodology.
Two problems with conspiracy theories:
Skepticism and doubt can be good — but to serve their function, they must be honest and unbiased, rather than being used in a self-serving and egocentric fashion that only caters to one’s own interests and preferences.
Lacking truth criteria:
Beyond universal doubt, we also need criteria for accepting theories, facts, and findings.
Key features of actual science:
Models, and theories: Describing phenomena by constructs and developing hypotheses about relationships between constructs. Need for criteria for determining the truth / agreeing on assumptions and findings.
Finding out facts vs. developing measures and methodologies.
Many social elements: Community of practicioners needs to agree on criteria and methods.
- Agreeing on units and even the value of constants (e.g., pi)
- Agreeing on accepted methods: internal vs. external validity
- Agreeing on criteria for evaluating results (e.g., in statistics)
- Delivery of certainty vs. dealing with uncertainty: Contrary to the expectations of politicians and critics, science does not discover eternal truths and certainties. Instead, it allows for multiple views (as long as they do not contradict current evidence) and changes its mind when evidence changes.
Two important dichotomies:
Specialization (digging deeper) vs. abstraction (generalization). See our matrix lens model…
Fitting vs. prediction: Explanation is cheap, but prediction is what really counts.
1.2.3 Data science
Now we can return to our initial question: What is data science?
Conundrum: We mentioned above that data is typically considered dry and dull, rather than sexy. If this is so, it should be surprising that data scientist is touted as the “sexiest job of the 21st century” (Davenport & Patil, 2012). What is the secret superpower that endows a dull topic with sex appeal? Reason for difference evaluation: Data is increasingly ubiquitous and considered to be a valuable resource for doing other things…
Meanings differ depending on whom you ask…
As an academic discipline, we primarily focus on conceptual foundations and aim for a sound methodology: Developing reliable and valid methods for collecting, processing, and deriving conclusions from data.
An inevitable question:
- What is the relation of data science to statistics?
Which skills do we need?
Need for statistics?
Need for programming?
fortune(52) of the fortunes package:
Can one be a good data analyst without being a half-good programmer?
The short answer to that is, ‘No.’
The long answer to that is, ‘No.’
Frank Harrell (1999), S-PLUS User Conference, New Orleans
Similar arguments can be made for the disciplines of mathematics, philosophy, formal logic, psychology…
Tukey argued for detective skills (Tukey, 1969). And what about investigative skills, journalism, ability to tell a story and express numeric facts as narratives?
Key skill: Dealing with data.
Data scientists are the people that discover, mine, select, organize, transform, and present information.
Analyzing information: Transforming and processing data (in an appropriate fashion)
Organizing information: Importance of documentation, explication, and practices of reproducible research (e.g., in R Markdown, see Section 1.3).
Communicating results: Designing information and displays.
Dealing with data requires in an appropriate and responsible fashion requires a lot of abstract and methodological knowledge, but also mastery of corresponding tools…
I suppose it is tempting, if the only tool you have is a hammer,
to treat everything as if it were a nail.
Abraham H. Maslow (1966, p. 15f.)
Which tools should we use to process and analyze data?
A recent anecdote illustrates the troubles of importing data:
Importing data is an early and usually mundane step in the process of data analysis. Under ideal circumstances, reading data would be so seamless that it would remain unnoticed.
The fact that books include chapters on Importing data (into R) reminds us that our world is not ideal: Depending on their sources and types, messy datasets can be difficult to read. This is unfortunate, as it often prevents people from using R and drives them to use less powerful software, which may seem easier or more convenient to use.
A striking example of bad software choices is the Public Health England (PHE)’s recent decision to import CSV-files on Covid-19 test results into Microsoft Excel’s XLS file format. Due to this file format’s artificial limit to a maximum of 65.536 rows of data, nearly 16.000 coronavirus cases went unreported in the U.K. (which amounts to almost 24% of the cases recorded in the time span from Sep 25 to Oct. 2, 2020). Two noteworty quotes from this BBC article (Kelion, 2020) are:
… one expert suggested that even a high-school computing student would know that better alternatives exist.
… insiders acknowledge that the current clunky system needs to be replaced by something more advanced that excludes Excel, as soon as possible.
When designing a tool, a possible goal could be to make it as versatile or flexible as possible. One way of increasing flexibility is combining a variety of tools in a toolbox. A typical Swiss Army knife provides a good example for this approach.
Toolbox analogy: Swiss knife with additional tools (see Section 1.1.3 on Terminology).
Interestingly, the toolbox achieves flexibility by combining many highly specialized tools.
Apply analogy to R: R can be thought of as a toolbox — but RStudio is a toolbox as well. Similarly, any R package is also a toolbox, with individual functions as its tools. Thus, we do not necessarily see an ordered hierarchy of systems (like an elaborate system of Matryoshka dolls: Swiss-knife-like tools in toolboxes, that are contained in more elaborate toolboxes), but rather a wild medley of boxes and tools.
A potential problem:
The curse of featuritis or feature creep (see Wikipedia): Many technological gadgets — and especially software tools — contain too many functions. As a consequence, the devices or programs get bloated and over-complicated, rather than simple and functional.
Given the ubiquity of computers (especially as phones, tablets, but also watches or cars), we are getting used to universal machines. However, does a coffee machine also need to contain a clock, a radio, or play a game or video? Do we really need a text editor handling email, file browser, and spreadsheet calculations?
Again, the Swiss knife analogy is helpful: We can add more and more features, but at some point, a knife gets clumsy. For instance, the XAVT Swiss Army Swisschamp Xavt Pocket Knife, Multi is as wide as long, weighs 350 grams, costs over EUR 300, and contains 83 functions (a watch, various wrenches, a fish scaler). Its advertisement states that “versatility knows no limits” and it is “fit for all tasks.” However, I suspect it mainly serves as an expensive collector’s item that is rarely used for anything.
Make everything as simple as possible,
but not simpler.
Opposite idea: the KISS principle, short for “keep it simple, stupid” or “keep it short and simple” urges designers to avoid excessive complexity and keep things as simple and straightforward as possible.5 Many systems work best when they are reduced to their essential parts and functions.
The tension between flexibility and specialization is inevitable and will always be present. Striking the right balance can be difficult. It often is unclear what is “as simple as possible.” Especially when tasks and demands change dynamically, it can be hard or impossible to determine what is essential.
1.2.6 Fit for tasks
When is a tool a good one?
We have seen that there is an inevitable tension between specialization vs. universality. Our solution consists in embracing flexibility, simplicity, and a multiplicity of solutions, while being aware of key constraints.
Lesson to learn here: Best to adopt a pragmatic perspective and use any tool that fits to the job. Note that this use of “fit” implies a match between task and tool.
But tasks do not occur in isolation, but are embedded in contexts that involves a certain environment and individuals or grooups (often called agents or users) whose goal it is to solve the task.
These elements provide additional constraints — and the really tough ones are often not the technological ones, but people’s habits and historical and institutional conventions. Thus, the right tool needs to fit not only to the task, but also to the environment in which the task takes place and to the person using it.
Idea of ecological rationality (ER) (Todd, Gigerenzer, & the ABC Research Group, 2012) (where “ecological” means environmental fit, rather than “bio” or “green”): Triad of entities. Aim for a match between task, tool, and the user’s skills and capacities.
Figure 1.7 shows that ER is a fit between three entities:
Some task, strategy, or tool (e.g., some algorithm, physical toolbox, or software package).
The environment (e.g., phyical environment, but also commercial, institutional, media, or social environments).
The agent’s capacities and skills (e.g., cognitive and perceptual processes, but also knowledge and learned skills).
Designing for fit
Upshot of ER: Not only one, but multiple levers for promoting change.
Fit can be increased by education, but also by changing our environment, or by designing better tools.