Preparations

To teach and learn data science, we need some infrastructure (tools), materials (information), and agree on some rules (structure).

Software

Working through this book assumes an installation of three types of software programs:

  1. An R engine: The R project for statistical computing (R Core Team, 2021a) is the origin of all things R. A current distribution of R (e.g., R version 3.4.2) for your machine can be downloaded from one if its mirrors.

  2. An R interface: RStudio provides an integrated development environment (IDE) for R.2

  3. Additional tools: The R packages of the tidyverse (Wickham, 2019b) and ds4psy (Neth, 2021b).

To understand the differences between these components, two analogies are helpful:

  • R vs. RStudio: Analogy of a car — engine vs. the driver’s console (means of input and output, viewing system information).

  • R vs. R packages: Toolbox, swiss knife with additional tools (see Section 1.1.3 on Terminology).

Welcome to the R world

Once you have installed all this software, take a moment to reflect on a curious fact: You just installed software that was written by hundreds of highly-trained experts, who dedicate years of their professional lives to its creation and improvement. Interestingly, you could just download their products and had to pay nothing to do so. This is possible because most R developers subscribe to an open source philosophy that was ridiculed by corporations when it started in the 1980s and 1990s, but has become one of the most powerful paradigms in software development.

But actually, you did invest time and effort to install all these programs and packages. And by doing so, you are taking first steps to join a world-wide community that shares certain interests, assumptions, and ideals. Welcome to the R community — but be aware: Learning R can profoundly transform your life.

Working with RStudio

The distinctions between R, RStudio, and R packages are somewhat confusing at first. Thus, it is good to know that we will typically be using RStudio to interact with R and manage our R packages. The basic idea of an integrated development environment (IDE) is to make it easier to access and manage all R-related concerns through a single interface.

Given its large variety of functions, the RStudio interface is divided into many sub-windows that can be arranged and expanded in various ways. At this point, we only need to distinguish between the main Editor window (typically located on the top left), the Console (for entering R commands), and a few auxiliary windows that may display outputs (e.g., a Viewer for showing visualizations) and provide information on our current Environment or the Packages available on our computer. A useful window is Help: Although its main page provides mostly links to online materials, any R package contains detailed documentations on its functions and examples that illustrate their use.

Figure 0.1 shows the RStudio cheat sheet on the RStudio IDE and illustrates that there are dozens of other functions available. As you get more experienced, you will discover lots of nifty features and shortcuts. Especially foldable sections and keyboard shortcuts (see Alt + Shift + K for an overview) can make your life in R a lot easier. But don’t let the abundance of options overwhelm you — I have yet to meet a person who needs or uses all of them.

RStudio cheat sheet (from RStudio Cheat Sheets).

Figure 0.1: RStudio cheat sheet (from RStudio Cheat Sheets).

A useful feature of RStudio is that collections of files can be combined into projects. For instance, it makes sense to store everything related to this course in a dedicated directory on your hard drive (e.g., in a folder “ds4psy”) and create an RStudio project (also named ds4psy) that uses this directory as its root. An immediate benefit of using projects is that your entire workflow gets more organized.3

Reading

This i2ds book is currently being written and is likely to change frequently. Although it provides a useful and growing collection of new materials, it will remain fragmentary for a few months.

Where appropriate, we will use other books for specific topics and materials. Our two main sources are freely available online: The textbook R for Data Science (Wickham & Grolemund, 2017) provides a classic, but more tidyverse-centric introduction to data science:

  • Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Sebastopol, Canada: O’Reilly Media, Inc. [Available at http://r4ds.had.co.nz.]

By contrast, our own textbook Data Science for Psychologists (Neth, 2021a) targets the special needs for social science students:

  • Neth, H. (2021). ds4psy: Data Science for Psychologists.
    Social Psychology and Decision Sciences, University of Konstanz, Germany.
    Textbook and R package (version 0.7.0, May 12, 2021).
    Retrieved from https://bookdown.org/hneth/ds4psy/.

The URL of the supporting R package ds4psy (Neth, 2021b) is https://CRAN.R-project.org/package=ds4psy.

Relying on the ds4psy textbook makes sense for many topics: As the overlap of required concepts and tools is large despite different audiences, it avoids repetition and redundancies. However, where this (i2ds) course deviates from the other (ds4psy) one, this collection contains additional and new material.

General distinctions of this (i2ds) course: Creating more room for reflections and critical thinking.

Whereas the ds4psy book (or course) essentially covers a new topic and R package per chapter (or week), this book often devotes multiple chapters (or weeks) on one topic.

This course provides the background to many other books and courses by putting things in perspective.

On the surface, this seems like covering

  • less content: aiming for two main parts (in two semesters: foundations vs. applications)
  • slower pace: allowing more time for context and reflections

However, by first reflecting on the concepts and implications, this course provides a more solid foundation and aims to promote a deeper understanding.

Specific details:

  • reflect on the nature of representations, visualizations, etc.
  • start with reproducible research and RMarkdown
  • two sessions for basics (base R)
  • two sessions for visualization (in base R and ggplot2)
  • data used from survey

A more technical introduction is provided by r4ds (Wickham & Grolemund, 2017):

  • Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Sebastopol, Canada: O’Reilly Media, Inc. [Available at http://r4ds.had.co.nz.]

Where it makes sense, we will point at chapters of related textbooks (e.g., Baumer, Kaplan, & Horton, 2021):

  • Baumer, B. S., Kaplan, D. T., & Horton, N. J. (2020). Modern Data Science with R (2nd ed.). CRC Press, Taylor & Francis Group, Boca Raton/London/New York.
    [Available at https://beanumber.github.io/mdsr2e/.]

Other textbook candidates include:

  • Statistical Inference via Data Science: A ModernDive into R and the Tidyverse.
    By Chester Ismay and Albert Y. Kim (2020).
    Available online at https://moderndive.com/.

  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical data science with R. Shelter Island, NY: Manning.

Writing

One of the most important skills conveyed in this course — and one that is relevant far beyond the topic of data science — is called literate programming (Knuth, 1984).

Basic idea: Design a system that enables distinctions (e.g., between text and code) in order to merge different objects (e.g., computations, text, figures, and tables). Most importantly, separate form from content — and enable dedicated focus on content when creating new material.

This paradigm requires some new skills and tools that may seem awkward and unfamiliar at first, but are easily acquired — and likely to change your life by opening doors to a world of opportunities.4

For our purposes, literate programming enables and develops practices of reproducible research. From the very first session, you are able to complete exercises and interpret results in the same document. The separation of content and form makes it very easy to generate and submit well-formatted reports for a variety of purposes (not just exercise solutions, but research reports, blogs, and theses).

In this course, the main tool that allows weaving text and code together in a single document is R Markdown (which is developed and supported by the RStudio interface mentioned above).

For instructions on combining text and code, see


  1. Installing RStudio typically provides many additional R packages. Two packages we will use extensively are knitr (Xie, 2021) and rmarkdown (Allaire et al., 2020).↩︎

  2. See the introductory chapters of R for Data Science (Wickham & Grolemund, 2017) for short, but helpful instructions on organizing your workflow with RStudio — especially the even-numbered chapters basics (Chapter 4), scripts (Chapter 6), and projects (Chapter 8).↩︎

  3. For instance, this and countless other books and websites are enabled by this paradigm.↩︎