Chapter 6 Importing data
Most of the data we have examined so far was conveniently included in R packages. In the previous chapter on Tibbles (Chapter 5), we learned how data can be entered from scratch. In reality, however, we usually want to analyze data that we obtain from some source in some electronic format. Hence, importing data is the rule, rather than the exception.
Importing data is an early and usually mundane step in the process of data analysis. Under ideal circumstances, reading data would be so seamless that it would remain unnoticed. The fact that we need a chapter on it, reminds us that our world is not ideal: Depending on their sources and types, messy datasets can be difficult to read. This is unfortunate, as it often prevents people from using R and drives them to use less powerful software, which may seem simpler and easier to use.43
Fortunately, R has tools to facilitate the import of all kinds of datasets. One such tool is the readr package (Wickham, Hester, et al., 2024) — a core component of the tidyverse (Wickham et al., 2019) that provides a range of functions to read data from a variety of files and formats.
While importing and exporting data may be a rather mundane step in any data analysis, they often are necessary for everything else that follows. Thus, it pays off to take a closer look at the process of importing data and learn some tricks to deal with obstinate datasets. Similarly, some knowledge about file paths and data formats is a precondition for saving your own files in forms that make them accessible to others. This chapter should reduce future frustrations by covering the most important cases.
References
A striking example of bad software choices is the Public Health England (PHE)’s recent decision to import CSV-files on Covid-19 test results into Microsoft Excel’s XLS file format. Due to this file format’s artificial limit to a maximum of 65.536 rows of data, nearly 16,000 coronavirus cases went unreported in the U.K. (which amounts to almost 24% of the cases recorded in the time span from Sep 25 to Oct. 2, 2020). Two noteworty quotes from this BBC article (Kelion, 2020) are: “… one expert suggested that even a high-school computing student would know that better alternatives exist.” and “… insiders acknowledge that the current clunky system needs to be replaced by something more advanced that excludes Excel, as soon as possible.”↩︎