Notes on Statistics with R (SwR)

Author

Peter Baumgartner

Published

2024-05-01 15:43

Preface

This is work in progress
  • I have finished chapter 1-9 (about 80% of the book). These chapters also contain experiments and additionally self-created exercises.
  • Currently I am working on chapter 10.

WATCH OUT: This is my personal learning material and is therefore neither an accurate replication nor an authoritative textbook.

I am writing this book as a text for others to read because that forces me to become explicit and explain all my learning outcomes more carefully. Please keep in mind that this text is not written by an expert but by a learner.

Text passages with content I am already familiar I have skipped. Section of the original text where I needed more in-depth knowledge I have elaborated and added my own comments resulted from my personal research.

Be warned! In spite of replicating most of the content this Quarto book may contain many mistakes. All the misapprehensions and errors are of course my own responsibility.

Content and Goals of this Book

This Quarto book collects my personal notes, trials and exercises of Statistics With R: Solving Problems Using Real-World Data by Jenine K. Harris (Harris 2020).

This introductory textbook for Statistics with R has three outstanding features:

Data Wrangling

The book applies real data sets with all their problems: missing data, inconsistent structure, not appropriate data types, not understandable labels, accompanying extensive code books etc. Data management is therefore a major part of the book, a subject often not taught. Many introductory textbooks work with already cleaned data and miss the necessity to guide students to bring messy data into an analyzable form.

Inclusion

The author aims to support women and other underrepresented groups to pursue a data science career. By choosing a narrative style with three prototypical feminine characters the book discusses the approaches not only in details but also shows the effect of not optimal coding solutions. The solutions are developed step by step and each improvement replicates the code already written. These repetitions helps not only to compare the differences but shows that code has to be developed bit by bit, tested, improved, and tested again. Another interesting practice shown in the book is to try different approaches to the same problem and the re-usability of already written code. All these practices help to lower barriers and to facilitate learning statistics with R.

Compelling social science topics

The three characters (Leslie a statistics student, who wants to learn R; Nancy an experienced data scientist and coding specialist; Kiara a data management guru worried especially about reproducibility) discuss real-world problem analysis from different angles. Every chapter starts with a short introduction about the background of the social problem the text is going to analyze. By introducing using publicly available data sources that have to be modified and cleaned one learns very important transferable skills. After working through the book it should be easy to work on his/her own research questions using public data sets.

Text passages

Quotes and personal comments

My text consists mostly of quotes from the first edition of Harris’ book. I converted my kindle book into a PDF file which I copied via the annotation system in Zotero into my Quarto files.

Example 1 : Quote

“NA is a reserved “word” in R. In order to use NA, both letters must be uppercase (Na or na does not work), and there can be no quotation marks (R will treat “NA” as a character rather than a true missing value)” (Harris, 2020, p. 121) (pdf)

Example 1 has links to my PDF and also to my annotation of the PDF. These links are a practical way for me to get the context of the quote. But as the linked PDF is saved locally at my hard disk these links do not work for you! (There is an option about Zotero groups to share files, but the PDF is not free to use and so I can’t offer this possibility.)

Often I made minor editing (e.g., shorting the text) or put the content in my own wording. In this case I couldn’t quote the text as it does not represent a specific annotation in my Zotero file. In this case I ended the paraphrase with (Harris ibid.).

In any case most of the text in this Quarto book is not mine but coming from different resources (Harris’ book, R help files, websites). Most of the time I have put my own personal notes into a notes box as shown in Example 2.

Example 2 : Personal note

Assessment 1 : This is a personal note

In this kind of box I will write my personal thoughts and reflections. Usually this box will appear stand-alone (without the wrapping example box).

Glossary

I am using the {glossary} package to create links to glossary entries.]

R Code 1 : Load glossary

## 1. Install the glossary package:
## https://debruine.github.io/glossary/

library(glossary)

## If you want to use my glossary.yml file:

## 1. fork my repo
##    https://github.com/petzi53/glossary-pb

## 2. Download the `glossary.yml` file from
##    https://github.com/petzi53/glossary-pb/blob/master/glossary.yml)

## 3. Store the file on your hard disk
##    and change the following path accordingly

glossary::glossary_path("../glossary-pb/glossary.yml")
Listing / Output 1: Install and load the glossary package with the appropriate glossary.yml file

If you hover with your mouse over the double underlined links it opens an window with the appropriate glossary text. Try this example: Z-Score.

WATCH OUT! Glossary text not authorized by the author of SWR

I have added many of the glossary entries when I was working through other books either taking the text passage of these books I was reading or via an internet recherche from other resources. I have added the source of glossary entry. Sometimes I have used abbreviation, but I need still to provide a key what this short references mean.

Jenine Harris has collected her own glossary. Where ever it is suitable for my learning path I have added her entries into my dictionary. To apply the glossary into this text I have used the {glossary} package by Lisa DeBruine.

If you fork the repository of this quarto book then the glossary will not work out of the box. Load down the glossary.yml file from my glossary-pb GitHub repo, store it on your hard disk and change the path in the code chunk Listing / Output 1.

In any case I am the only responsible person for this text, especially if I have used code from the resources wrongly or misunderstood a quoted text passage.

R Code and Datasets

Harris provides R code and datasets via her Github site but you can also download these files directly from the Student Resources of the publisher’s SAGE website.

Harris introduces and uses in the book Google’s R Style Guide with camelCase. The reference is pointing to a fork of the Tidyverse Style Guide. I am going to use underscore (_) or snake case to replace spaces as studies has shown that it is easier to read (Sharif and Maletic 2010). But I will use the other Google modifications from the tidyverse style guide:

  • Start the names of private functions with a dot.
  • Don’t use base::attach().
  • No right-hand assignments.
  • Use explicit returns.
  • Qualify namespace.

Especially the last point (qualifying namespace) is important for my learning. Besides preventing conflicts with functions of identical names from different packages it helps to learn (or remember) which function belongs to which package. I think this justifies the small overhead and helps to make R code chunks self-sufficient. (No previous package loading, or library calls in the setup chunk.) To foster learning the relation between function and package I embrace the package name with curly brakes and format it in bold.

I am using the package name also for the default installation of base R. This wouldn’t be necessary but it helps me to understand where the base R functions come from. What follows is a list of base R packages of the system library included into every installation and attached (opened) by default:

  • {base}: The R Base Package
  • {datsets}: The R Datasets Package
  • {graphics}: The R Graphics Package
  • {grDevices}: The R Graphics Devices and Support for Colours and Fonts
  • {methods}: Formal Methods and Classes
  • {stats}: The R Stats Package
  • {utils}: The R Utils Package

I am not using always the exact code snippets for my replications because I am not only replicating the code to see how it works but also to change the values of parameters to observe their influences.

In “Statistics with R” there are all names of function arguments explicitly mentioned. This is also the case for function with just one argument, for instance base::summary(object = <r object to summarize>). When it is clear then I will follow the advice from Hadley Wickham:

When you call a function, you typically omit the names of data arguments, because they are used so commonly. If you override the default value of an argument, use the full name (tidyverse style guide).

For educational reasons Harris develops code step by step and replicates the complete code including the previous — already explained — snippets. In these cases I use tabs as an organizing structure so that one can see (and compare) the piecemeal development.

Resources

Resource 1 : Resources used for this Quarto book

Packages introduced in the preface

Resource 2 glossary: Glossaries for Markdown and Quarto Documents


Package Profile 1: {glossary}: A Package to Create Glossaries for Markdown and Quarto Documents

{glossary}: Glossaries for Markdown and Quarto Documents

Add glossaries to markdown and quarto documents by tagging individual words. Definitions can be provided inline or in a separate file.

There is a lot of necessary jargon to learn for coding. The goal of glossary is to provide a lightweight solution for making glossaries in educational materials written in quarto or R Markdown. This package provides functions to link terms in text to their definitions in an external glossary file, as well as create a glossary table of all linked terms at the end of a section.


Glossary

term definition
Z-score A z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. (<a href="https://www.statisticshowto.com/probability-and-statistics/z-score/#Whatisazscore">StatisticsHowTo</a>)

Session Info

Session Info

Code
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29)
#>  os       macOS Sonoma 14.4.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Vienna
#>  date     2024-04-22
#>  pandoc   3.1.13 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  cli           3.6.2      2023-12-11 [1] CRAN (R 4.3.0)
#>  colorspace    2.1-1      2024-01-03 [1] R-Forge (R 4.3.2)
#>  commonmark    1.9.1      2024-01-30 [1] CRAN (R 4.3.2)
#>  curl          5.2.1      2024-03-01 [1] CRAN (R 4.3.2)
#>  digest        0.6.35     2024-03-11 [1] CRAN (R 4.3.2)
#>  evaluate      0.23       2023-11-01 [1] CRAN (R 4.3.0)
#>  fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
#>  glossary    * 1.0.0.9000 2023-08-12 [1] Github (debruine/glossary@819e329)
#>  glue          1.7.0      2024-01-09 [1] CRAN (R 4.3.0)
#>  highr         0.10       2022-12-22 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.8.1    2024-04-04 [1] CRAN (R 4.3.2)
#>  htmlwidgets   1.6.4      2023-12-06 [1] CRAN (R 4.3.0)
#>  jsonlite      1.8.8      2023-12-04 [1] CRAN (R 4.3.0)
#>  kableExtra    1.4.0      2024-01-24 [1] CRAN (R 4.3.2)
#>  knitr         1.46       2024-04-06 [1] CRAN (R 4.3.3)
#>  lifecycle     1.0.4      2023-11-07 [1] CRAN (R 4.3.0)
#>  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
#>  markdown      1.12       2023-12-06 [1] CRAN (R 4.3.0)
#>  munsell       0.5.1      2024-04-01 [1] CRAN (R 4.3.2)
#>  R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
#>  rlang         1.1.3      2024-01-10 [1] CRAN (R 4.3.0)
#>  rmarkdown     2.26       2024-03-05 [1] CRAN (R 4.3.2)
#>  rstudioapi    0.16.0     2024-03-24 [1] CRAN (R 4.3.2)
#>  rversions     2.1.2      2022-08-31 [1] CRAN (R 4.3.0)
#>  scales        1.3.0      2023-11-28 [1] CRAN (R 4.3.2)
#>  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
#>  stringi       1.8.3      2023-12-11 [1] CRAN (R 4.3.0)
#>  stringr       1.5.1      2023-11-14 [1] CRAN (R 4.3.0)
#>  svglite       2.1.3      2023-12-08 [1] CRAN (R 4.3.0)
#>  systemfonts   1.0.6      2024-03-07 [1] CRAN (R 4.3.2)
#>  vctrs         0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
#>  viridisLite   0.4.2      2023-05-02 [1] CRAN (R 4.3.0)
#>  xfun          0.43       2024-03-25 [1] CRAN (R 4.3.2)
#>  xml2          1.3.6      2023-12-04 [1] CRAN (R 4.3.0)
#>  yaml          2.3.8      2023-12-11 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────