Chapter 1 Introduction
It goes without saying that data science is an expansive subject area. A key tool supporting data science work is the statistical programming language R, which has grown signficantly over the past few years and has, in the words of Roger Peng, “become the de facto programming language for data science. Its flexibility, power, sophistication, and expressiveness have made it an invaluable tool for data scientists around the world.” (Peng 2018)
Proprietary tools (including statistical programs like SAS, SPSS, and Stata) have had a one-version-at-a-time approach to development and release. With each release, many new features are added, extending the functionality. These releases are scheduled big events; as one example, SAS has been releasing version updates roughly once a year since 2010.
The open source nature of R and its package-centric ecosystem has encouraged the expansion of functionality, and like the universe as a whole, this expansion is in all directions simultaneously and non-stop. There are no annual updates with the latest bells and whistles–it sometimes seems that every day brings news that additional features are being applied to existing R packages, and new packages are being released.
This “bazaar”(Raymond 1999) of development makes it impossible to be an expert on all things R, and that paper-based text books may be out-of-date before they get to the presses. More importantly, it also means that somebody somewhere may have already found a creative solution to the problem you’ve just encountered.
This is where the R community really shines. The spirit of the bazaar extends to open data, open science, open research, and open development. Your friends in the R community around the world are making their tools and techniques available, and are using social media channels to tell one another about it, which spurs on collaboration and futher innovations, in a never-ending virtuous cycle.
This compendium is simply my attempt to gather together pieces of R methods, based on the data science problems I’ve confronted, and what I’ve found at that particular time.
The paragraphs above are a long-winded way of saying that I acknowledge that what’s here is woefully incomplete and likely out of date! If you’re aware of something newer, more streamlined or efficient, accurate, or just better, please don’t hesitate to let me know.
1.1 Structure
The book’s layout is currently in seven broad categories:
1.1.1 Data Science
- Combining chapters about the scope and definitions of data science This thing called “data science”, along with overviews of good research practice, from opinionated analysis design to script style Statistical & Data Science Practice, and some high-level materials on Using R.
1.1.2 Data
- What is data science with data? This section covers topics from Bias to Anonymity and Confidentiality.
1.1.3 Data Visualization
- The Data Visualization section points to some key resources on data visualization theory and practice, and data visualization in R, with a focus on the package ggplot2 and its related extension packages.
1.1.4 Quantitative Methods
- This section covers a variety of methods of analyzing and modeling data, from Bayesian Methods (which, along with some other approaches, gets its own chapter) to Time series analysis.
1.1.5 Spatial Data
- Chapters on Spatial Data (including mapping) and Small Area Estimation.
1.1.6 Other Methods
- Text Analysis and Text Mining covers ways of tackling text mining, including natural language processing (NLP) and sentiment analysis.
1.1.7 Communication
- The goal of data science is to provide insights that didn’t exist prior to your work. Sharing those insights, including what you did and how you did it, require effective communication. The Communicating your results chapter starts the section that also includes information about effective writing and presentation methods.
1.2 Other notes
Style and format: see “Software information and conventions” in Xie, Thomas & Presmanes Hill (Yihui Xie and Hill 2019)