This section contains some clarifications (e.g., how data science differs from statistics), explicates some assumptions, and provides recommendations on how to succeed in this course.
What is data science?
It is a capital mistake to theorize before one has data.
Arthur Conan Doyle (1891): A Scandal in Bohemia
Defining data science
In our technology-driven environment, data is cheap and ubiquitous. Every call, trip, or visit of a web page leaves a trail of data. Apart from raising important issues of privacy, the notion of big data highlights another problem: In an era in which digital information is collected everywhere, we are drowning in data. While this data promises to be valuable and meaningful, we frequently fail to understand and use it. Unfortunately, we often lack the skills and tools of dealing with data, which are a prerequisite of making sense of it.
Data science is the art and craft of dealing with data. Making sense of data is a challenging task that requires a combination of knowledge, tools, and experience on many different levels. In academia, data science is increasingly relevant not only for natural and computer sciences, but also for the social sciences, humanities, and arts. While getting good at data science often starts as a scientific enterprise, it also is an increasingly important skill, art, and craft in many applied and business contexts. Just like other skills, arts and crafts, data science requires special tools, experience in choosing and using them, and an awful lot of dedicated practice.
Data science vs. statistics
Data science is often confused with statistics — especially by social science students, who tend to view any formalism as a new type of statistics specifically designed to torture them. There is considerable overlap between data science and statistics, but they are not identical:
Statistics is primarily a mathematical discipline that examines the properties of samples, probability distributions, and inferences from samples. Statistics typically involves formulas and numbers, but does not necessarily require getting your hands dirty with real-world data. In the context of psychology, statistics mostly quantifies differences or relationships between groups and tests effects of experimental manipulations or treatments.
Data science typically begins with messy data from real-world sources. Data literacy (as a basic ability to deal with data) is an essential prerequisite for applying statistics, but does not require statistics to yield meaningful results. Data science can be described as using statistics to solve real-world problems, but pursues a different main goal: understanding the data, rather than testing effects.
In short, statistics mostly summarizes data to test hypotheses, while data science transforms and visualizes data to promote the generation of hypotheses.3 In science, both objectives are important and complementary, but basic data literacy and data science enable us to understand and deal with data before and beyond statistics. Dealing with data in a variety of ways enables new insights (e.g., by visualizing properties and relationships) and allows us to think more clearly about the causes and implications of data.
Learning data science
This book and course are the result of a learning process. Since 2016, I’ve been teaching a variety of R courses at the University of Konstanz. While generations of students have enjoyed or fought through the iterations of this course, my own views on teaching basic aspects of data literacy to an audience of social science students also suffered a series of transformations. My initial enthusiasm was tempered by the realization that many students have no background on computer programming, few ideas how they may benefit from more computational approaches, and limited willingness or time to invest efforts into acquiring corresponding skills.
Whereas many students find the clean structure of formalisms intuitively appealing, others are intimidated or repulsed by the prospects of reading or writing computer code. While I cannot claim to have discovered the magic bullet that satisfies all needs, I hope that even the most reluctant and sceptical students can catch a glimpse of the promises of data science for their future aspirations. In the following paragraphs, I explicate my current working assumptions and provide some tips on mastering this course.
Here are my current assumptions when teaching this course:
An offer and an opportunity: When taking a moment to look at our world in general, being in a position to spend days, weeks, or months on learning (or teaching) technical skills in a safe and welcoming environment is an incredible luxury and blessing. Hence, try to make the most out of this situation and accept this course not only as a challenge, but as an offer and an opportunity to learn something new.
Act responsibly: In the past, I have forced students to read materials and submit weekly solutions to exercises by threatening them with bad grades. As a consequence, students complained about the amount of work required of them, challenged the deadlines for submitting their solutions, and demanded that all course materials and exercises should be available long before a particular session was taught. Paradoxically, incrementally posting exercises and withholding their solutions until some deadline had expired required a lot of effort on my part — and still left me with a sense of failure when someone was unable solve the exercises. To respond to these requests and make my efforts more manageable, this book now contains all materials, exercises, and solutions for an entire semester worth of study. This may seem generous or lenient on my part, but actually shifts the responsibility from instructor to student: While I am still there to explain and clarify important concepts and commands, you now need to monitor your progress by working through the exercises and check your understanding by comparing your solutions to those provided here. Hence, this course now assumes that you are an adult who is motivated to learn its material and willing and able to invest the effort and time required to do so.
Any usefulness depends on you: Many students who take the plunge into data science with R find it quite enjoyable and rewarding after a while. But even if you do not enjoy the experience, it may still be an investment that is likely to pay off later. However, acquiring new methods and skills is not inherently rewarding or useful. Instead, their value depends on what we use them for. So even when studying and solving small practice problems, try to keep an eye open for the larger goals and purposes that you can pursue with these skills in the future.
Mastering this course
The main ingredient for succeeding in this — and any other — course of study is sustained focus and a keen attention to detail. The Stoic philosopher Epictetus has aptly summarized this attitude by the following quote:
Practise yourself (…) in little things,
and thence proceed to greater.
Epictetus Discourses Book I, 18
When spelled out more explicitly, the recipe for succeeding in this course is very similar to succeeding in life in general:
Learning involves effort: Beginning to study data science is similar to learning to play an instrument or mastering some sport: First, you need some infrastructure — equipment (like hard- and software), plus training materials — to get started. Once the basics are in place, you can benefit from the advice of experts and peers, but mostly need a lot of practice. Just like in many other fields, being enthusiastic and having talent usually helps, but dedicated practice is essential even when you happen to be a genius. Getting good at data science requires both curiosity and routine:
Curiosity implies interest, motivation, and fun: If you really want to find out or achieve something (e.g., understand some data or conduct some analysis), you are willing to find out how this can be achieved and will overcome the obstacles that may appear along the way. Perceiving tasks as a challenge rather than a chore will allow you to enjoy the efforts invested, rather than suffering from them.
Routine implies discipline, stamina, and lots of practice. It is impossible to acquire new skills without investing time and effort. Crucially, habitual practice (e.g., daily use of data science tools) helps developing various organizational skills (e.g., using keyboard shortcuts, naming objects, formatting code, and structuring files or projects) that are non-trivial and will profoundly affect your productivity (far beyond this course).
Use social resources: Theoretically, you could work through the entire book by yourself, but such a solitary endeavour requires a lot of determination and stamina. Fortunately, this course and its social context (instructor and classmates) are there to help you to stay focused, provide orientation and motivation, ask and answer any question, and allow you to continuously check your progress and understanding.
Ultimately, learning and practicing any art, craft, or science is also a process of socialization: Striving towards common goals and belonging to a community that shares similar methods, principles, and values. And although the R community can sometimes react harshly when some newbie asks a question at the wrong place, there is hardly a more enthusiastic and welcoming bunch of people to push towards new horizons.
Beware of side effects: Becoming an R user will profoundly transform your thinking — not only about data and code, but also about the types of problems you are trying to solve and understand. While programming (in R or any other language) can be useful and enjoyable, it also has addictive potential, and can not only open doors, but also lead into dead ends. So make sure to take regular breaks, and stay focused on good questions behind and beyond the data and the tools.
A tidyverse caveat
There is a lot to admire about the set of R packages comprising the tidyverse. Personally, I like that they share a vision that strives for simplicity and transparency, and that they provide a bold and ambitious approach towards designing a consistent set of tools. That said, being bold and ambitious usually comes with costs. As most tidyverse packages are still under active development, it can be difficult to keep up with their current versions. In addition, many tidyverse developers are deliberately opinionated (see the Tidy tools manifesto) and not afraid of making radical changes (as various iterations of reshape, reshape2, and tidyr testify). This situation essentially creates 2 potential problems:
Moving target: Just like an expectation of deflating prices can motivate economic actors to postpone investments, expecting that some tools are likely to change in the near future could make us uncertain about studying them at this point.
Fragmentation: Adding a set of alternative tools can be well-intentioned and phrased positively — as increasing diversity, offering more choices, increasing our freedom, etc. Nevertheless, adding options also implies an increase in complexity and reduction of unity.
In my view, there are good reasons for the current hype around the tidyverse and it has now matured enough to be studied, taught, and used. Although the prospect of constant changes may curb our enthusiasm for any particular technology, we should not allow this to slow us down. Even when methods or functions may still fluctuate, the underlying paradigms and principles remain relevant or change at a much slower pace. Importantly, no particular package, method, or tool should ever be enshrined purely on ideological grounds. Instead, we should always focus on the goals and tasks beyond the tools and handle any new technology with care and a healthy dose of skepticism. Just like hypotheses in science should be abandoned in favor of more successful ones, we should not be afraid of replacing tools when better ones become available — and trust that our experience gained along the way will still be valuable in the future.
Nevertheless, tidyverse novices should be aware that not everyone is convinced by its approach. As the corresponding packages often deviate from the traditional core — and lore — of R, they are bound to confuse or complicate matters from a purist’s point of view. As a consequence, some experts are skeptical and caution against an indiscriminate and universal adoption of the tidyverse and lament R’s fragmentation into different dialects. An important point is that the increasing popularity of the tidyverse is partly due to its close connections to R Studio, which provides a promotional platform that other packages — like data.table or vtreat — lack. So even though the arguments on the pros and cons of the tidyverse typically gravitate around the features of tools and their performance, the debate’s implications ultimately touch upon serious issues of influence and power.4
See the following links for some arguments and elaborations of this ongoing discussion:5
- The tidyverse curse (by Bob Muenchen, posted on 2017-03-23)
- Tidyverse and data.table, sitting side by side (by Dirk Eddelbuettel, posted on 2018-01-21)
- What is “tidy data”? (by John Mount, posted on 2019-05-11)
- TidyverseSkeptic (by Norm Matloff, revised on 2019-09-30)
As a beginner, you could probably care less about ideological debates, but being aware of their existence helps you to stay critical. So unless you have personal stakes in this discussion, I suggest adopting a pragmatic approach: Let’s observe the ongoing developments, use the tools that we like and work well for us, and leave it up to the historians to decide whether the tidyverse dialect will ever become the vernacular of R. Rather than blindly jumping on the tidyverse bandwagon, we can credit its benefits, but also point out when something seems like a limitation or peculiarity. Overall, we should occasionally remind ourselves that R was a powerful language long before the tidyverse was conceived and should not be surprised when base R commands often provide good alternatives to tidyverse functions.
Importantly, your primary goal at this point should be to stay enthusiastic about the tasks you can solve with R, and steadily expand the scope of your skills to tackle new tasks. Hence, focus on skills, tasks and principles, rather than on particular functions or tools, and try to welcome situations in which multiple alternatives are available. Although choices can occasionally be confusing, we rarely are overwhelmed by them. And realizing that there are multiple ways to perform some task does not have to be paralyzing, but can be stimulating and liberating. While there are always elegant and clumsy ways to perform some task, the most important thing is getting things done. So stay calm, relaxed, and adopt an open mindset: In the beginning, whatever works is fine — we can work on polishing the details later.
See Section 4.1 of Chapter 4 for the difference between exploratory and confirmatory data analysis, and Breiman (2001) and Donoho (2017) for a more nuanced discussion of the relations between data and statistics.↩
By being a commercial enterprise, it is not surprising when R Studio Inc., is pursuing a slightly different agenda than the R Foundation, which is a non-profit organization. Just like other technological enterprises — think Google or Facebook — R Studio is creating amazing products and services. Although it is appropriate to be excited about and grateful for their innovations, it would be naive to think that there is no price to pay when becoming dependent on a company’s products.↩
Note the recency of these posts’ dates. We are living in exciting times, as far as developments in data science are concerned.↩