Chapter 2 Introduction to R

Nhan Thi Ho, MD, PhD

2.1 What is R?

R is a free software environment for statistical computing and graphics. To download, install, and learn more, go to R cran (R Core Team 2024b).

2.2 RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. To download, install, and learn more, go to RStudio page.

2.3 Rmarkdown

R Markdown is a file format for making dynamic documents with R.
It turns your analyses into high quality REPRODUCIBLE documents, reports, presentations and dashboards (Allaire et al. 2024).
Learn more at Rmarkdown page.

2.4 Base R Example

R comes with a few base packages for base functions.

Example: calculate, generate, summary data:

# Use R as a calculator: 
((2+2)*3)/2
## [1] 6
sqrt(9)
## [1] 3
mean(c(2,4,5,9))
## [1] 5
#Check how to use a function
#?rnorm
#use the function 
data1<- rnorm(n=30,mean=1.5, sd=0.5)
summary(data1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6084  1.2695  1.5829  1.5306  1.8650  2.4256
data2<-rnorm(n=35, mean=1.8,sd=1)
summary(data2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.369   1.345   2.021   1.978   2.426   4.643

Basic statistical test: t-test example:

#?t.test
t.test(x=data1, y=data2, alternative="two.sided",var.equal=FALSE, conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -2.4155, df = 53.351, p-value = 0.01917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.81885042 -0.07595964
## sample estimates:
## mean of x mean of y 
##  1.530573  1.977978

2.5 Install, load, update R packages

In RStudio: packages>install>packagename From R console: type: install.packages(“packagename”)

Install and load 01 package, which is my R package “metamicrobiomeR” (N. Ho 2020):

#install.packages("metamicrobiomeR")
#load package for use
library("metamicrobiomeR")

Install and load multiple packages

#install multiple packages 
Packages <- c("knitr","rmarkdown","installr","haven","readr","yaml","tidyverse")
#install.packages(Packages)
#load multiple packages
lapply(Packages, library, character.only = TRUE)

Commands useful for updating packages

# download and install newest version of the package
#update.packages()
# shows which packages have updates
#old.packages ()
# looks for new packages that are not already installed
#new.packages ()

Getting help for package, function, data

#help for package
#help(package="metamicrobiomeR")
#help for function
#help("taxa.compare", package="metamicrobiomeR")
#or load package
#library(metamicrobiomeR)
#?taxa.compare
#help for data 
#help(tabsex4, package = "metamicrobiomeR") # display data help page

2.6 Working with R objects

2.6.1 Vector

A vector is a collection of elements:

x<-c(1,3,4,7,8,10)
x
## [1]  1  3  4  7  8 10
a<-c("red","blue","white")
a
## [1] "red"   "blue"  "white"
#creating vector
seq(1,10, by=2)
## [1] 1 3 5 7 9
rep("boring",3)
## [1] "boring" "boring" "boring"

2.6.2 Matrix

A matrix is a 2-dimensional representation of a vector (two-dimensional rectangular data set):

y<-matrix(x, nrow=2)
y
##      [,1] [,2] [,3]
## [1,]    1    4    8
## [2,]    3    7   10

2.6.3 Array

An array is an n dimensional representation of a vector:

z <- array(x, dim = c(1, 3, 2)); z
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    4
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    8   10

2.6.4 List

A list is a collection of “bins”, each containing any kind of R object:

mylist <- list(x, y, z, a); mylist
## [[1]]
## [1]  1  3  4  7  8 10
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    8
## [2,]    3    7   10
## 
## [[3]]
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    4
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    8   10
## 
## 
## [[4]]
## [1] "red"   "blue"  "white"

2.6.5 Data frame

A data frame is a list in tabular form where each “bin” contains a data vector of the same length. A data frame is the usual tabular data set familiar to epidemiologists. Each row is an record and each column (“bin”) is a field. Both matrix and data.frame are tabular data-types.

  • All columns in a matrix must have the same data type (numeric, character, etc.) and the same length.
  • A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). Just like a table in a database or excel sheet.
smoker <- c("Y", "Y", "N", "N", "Y", "Y", "Y", "N", "Y")
cancer <- c("Y", "N", "N", "Y", "Y", "Y", "N", "N", "Y")
quant<-c(1,2,3,4,5,6,7,8,9)
mydf <- data.frame(smoker, cancer,quant);
mydf[1:4,] # display rows 1 to 4
##   smoker cancer quant
## 1      Y      Y     1
## 2      Y      N     2
## 3      N      N     3
## 4      N      Y     4

2.6.6 Assessing the structure of data objects

data("tabsex4", package="metamicrobiomeR")
str(tabsex4)
## 'data.frame':    701 obs. of  23 variables:
##  $ id                           : chr  "k__bacteria.p__actinobacteria" "k__bacteria.p__bacteroidetes" "k__bacteria.p__firmicutes" "k__bacteria.p__fusobacteria" ...
##  $ Estimate.genderMale          : num  0.2269 -0.1549 -0.056 -0.0648 -0.0887 ...
##  $ Std. Error.genderMale        : num  0.121 0.119 0.104 0.282 0.118 ...
##  $ t value.genderMale           : num  1.874 -1.305 -0.537 -0.23 -0.751 ...
##  $ Pr(>|t|).genderMale          : num  0.0619 0.1931 0.5919 0.8186 0.4532 ...
##  $ Estimate.bfNon_exclusiveBF   : num  -0.3688 0.2557 0.2362 -0.0605 0.3699 ...
##  $ Std. Error.bfNon_exclusiveBF : num  0.139 0.136 0.116 0.364 0.134 ...
##  $ t value.bfNon_exclusiveBF    : num  -2.647 1.887 2.035 -0.166 2.764 ...
##  $ Pr(>|t|).bfNon_exclusiveBF   : num  0.00857 0.0602 0.04276 0.86812 0.00606 ...
##  $ Estimate.bfNo_BF             : num  -1.329 0.135 0.356 -0.403 1.153 ...
##  $ Std. Error.bfNo_BF           : num  0.412 0.499 0.392 1.083 0.407 ...
##  $ t value.bfNo_BF              : num  -3.223 0.27 0.908 -0.372 2.834 ...
##  $ Pr(>|t|).bfNo_BF             : num  0.00141 0.7873 0.36478 0.7103 0.0049 ...
##  $ Estimate.age.sample          : num  0.1251 0.0242 0.0115 0.0285 -0.1348 ...
##  $ Std. Error.age.sample        : num  0.0377 0.0367 0.0319 0.0857 0.038 ...
##  $ t value.age.sample           : num  3.318 0.66 0.359 0.333 -3.549 ...
##  $ Pr(>|t|).age.sample          : num  0.001022 0.510068 0.719525 0.73934 0.000448 ...
##  $ pval.adjust.genderMale       : num  0.48 0.899 0.961 0.961 0.961 ...
##  $ pval.adjust.bfNon_exclusiveBF: num  0.059 0.2333 0.1767 0.9976 0.0448 ...
##  $ pval.adjust.bfNo_BF          : num  0.0219 1 1 1 0.0468 ...
##  $ pval.adjust.sample           : num  0.00786 0.82595 0.85375 0.85375 0.00505 ...
##  $ study                        : chr  "Subramanian et al 2014 (Bangladesh)" "Subramanian et al 2014 (Bangladesh)" "Subramanian et al 2014 (Bangladesh)" "Subramanian et al 2014 (Bangladesh)" ...
##  $ pop                          : chr  "Bangladesh" "Bangladesh" "Bangladesh" "Bangladesh" ...

2.6.7 Useful functions to assess R objects

Function Returns
str summary of data object structure
attributes list with data object attributes
mode mode of object
typeof type of object; similar to mode but includes double and integer, if applicable
length length of object
class class of object, if it exists
dim vector with object dimensions, if applicable
nrow number of rows, if applicable
ncol
dimnames list containing vectors of names for each dimension, if applicable
rownames vector of row names of a matrix-like object
colnames vector of column names of a matrix-like object
names vector of names for the list (for a data frame returns field names)
row.names vector of row names for a data frame

2.6.8 Understanding vectors

Operation of vectors

#Generate vector
x1 <- 5; y1 <- c(10, 20, 30)
x2 <- c(5, 5, 5); y2 <- c(10, 20, 30)
x1 * y1
## [1]  50 100 150
x2 * y2
## [1]  50 100 150

2.6.8.1 Boolean queries and subsetting of vectors

lecturer <- c('Nhan', 'Hien', 'Phuong', 'Tien', 'Duy')
lecturer
## [1] "Nhan"   "Hien"   "Phuong" "Tien"   "Duy"
ms.ages <- c(41, 39, 33, 26, 25)
ms.ages
## [1] 41 39 33 26 25
ms.ages >= 30 # logical vector for lecturers with ages >=30 
## [1]  TRUE  TRUE  TRUE FALSE FALSE
thirtysomething <- (ms.ages >= 30) & (ms.ages < 40)
lecturer[thirtysomething] # indexing using logical
## [1] "Hien"   "Phuong"
#naming vector
names(ms.ages)<-lecturer
ms.ages
##   Nhan   Hien Phuong   Tien    Duy 
##     41     39     33     26     25

2.6.8.2 Operations on single vectors

Some operations on single vectors.

Function Description Function Description
sum summation rev reverse order
cumsum cumulative sum order order
diff x[i+1]-x[i] sort sort
prod product rank rank
cumprod cumulative product sample random sample
mean mean quantile percentile
median median var variance, covariance
min minimum sd standard deviation
max maximum table tabulate character vector
range range xtabs tabulate factor vector

2.7 Managing biomedical data in R

2.7.1 Entering data

Enter simple data frame

subjname <- c('Lop', 'Chung', 'Minh')
subjno <- 1:length(subjname)
age <- c(34, 56, 56)
sex <- c('Male', 'Male', 'Female')
dat <- data.frame(subjno, subjname, age, sex); dat
##   subjno subjname age    sex
## 1      1      Lop  34   Male
## 2      2    Chung  56   Male
## 3      3     Minh  56 Female

2.7.2 Importing data from a file

From local: (Example data of Bangladeshi children (Subramanian et al. 2014)).

#your path to dataset
fdir1<-"./metamicrobiomeR-master/inst/extdata/QIIME_outputs/Bangladesh/"
#Read in text file 
dat1<-read.delim(paste(fdir1,"Subramanian_et_al_mapping_file.txt",sep=""),header=T,sep ="")
#Or simply: 
#dat1<-read.delim("./metamicrobiomeR-master/inst/extdata/QIIME_outputs/Bangladesh/Subramanian_et_al_mapping_file.txt",header=T,sep ="")

From URL:

#Read in csv file
test.csv <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv", header=T)

2.7.3 Some other data import functions

Data import with “rio” makes our life much easier! All you need is import()!

#install.packages("rio")
#install.packages(c('csvy', 'feather', 'fst', 'hexView', 'readODS', 'rmatio'))
library(rio)
test1<-import(paste(fdir1,"Subramanian_et_al_mapping_file.txt",sep=""))

Import excel files with multiple sheets:

#import always  get only 1 dataframe
test<-import(paste(fdir1,"nature13421-s2.xlsx",sep=""),sheet=1,skip=1) 
 
#Import multiple sheets of an excel file and create a list of dataframes
list1<-import_list(paste(fdir1,"nature13421-s2.xlsx",sep=""))
#str(list1)
#View(list1[[1]])

Multiple files from a directory

#Sys.glob() allows you to find files using a regular expression
dir1<-Sys.glob(paste(fdir1,"alpha_div_collated/*.txt",sep=""))
list2<-import_list(dir1)
#str(list2)

2.7.4 Exporting data

#save one dataframe as an csv file
#write.csv(dat1,file=paste(fdir1,"dat1.csv",sep=""),row.names = F)

#save multiple data objects
#save(dat1,list1,list2, file=paste(fdir1,"manyfile.rda",sep=""))

#load saved files
#print(load(paste(fdir1,"manyfile.rda",sep="")))

2.7.5 Basic data manipulation with “tidyverse”

“tidyverse” includes a collection of open source packages for the R programming language that share similar design philosophy, grammar, and data structures” for tidy data (Wickham 2023b).

2.7.5.1 Pipes

Pipes take the output of a command, and feed it as the input of another command.

library(tidyverse)
#pipe from library(magrittr), loaded along with other packages
9 %>% sqrt %>% `+`(8) 
## [1] 11

2.7.5.2 Subset data

Subset data by column and row

dat2a<-test %>%
  slice(1:50) %>%
  select(-starts_with("..."))

2.7.5.3 Change column names

#change column name
colnames(dat2a)<-tolower(colnames(dat2a))
oldname<-colnames(dat2a)
newname<-c("child.id",
           "family.id",
           "cohort",
           "gender",
           "zygosity",
           "whz","waz","haz",
           "age.first.fe",
           "age.last.fe", 
           "no.fe.sam",
           "sam.int.mean",
           "month.ebf",
           "age.first.solid",
           "no.diarhea.yr",
           "pc.days.diarrhea",
           "frac.sam.atb.7day",
           "set")
dat2a<-dat2a %>% rename_at(vars(oldname), ~newname)
#colnames(dat2a)

2.7.5.4 Subset data by row

dat.tw<-dat2a %>% filter(cohort == "Healthy Twins & Triplets")
#more filter 
dat.tw<-dat2a %>% 
  filter(cohort == "Healthy Twins & Triplets")%>%
  filter(!is.na(zygosity))%>%
  filter(between(whz,-1,1))
#or
dat.tw<-dat2a %>% 
  filter(cohort == "Healthy Twins & Triplets" & 
           !is.na(zygosity) & 
           between(whz,-1,1)) 

2.7.5.5 Select columns

#Base R
#Select by position (cautious!)
subdat<-dat2a[,c(1,4:6)]
#or select specific column names
subdat<-dat2a[,c("child.id","gender","zygosity","whz")]

#pipes and select()
subdat<-dat2a %>% select(c(1,4:6))
subdat<-dat2a %>% select(child.id,gender,zygosity,whz)

2.7.5.6 Discard columns

#Base R
subdat<-dat2a[,-c(1,4:6)]
subdat<-dat2a[,colnames(dat2a)[!colnames(dat2a) %in% c("child.id","gender","zygosity","whz")]]
#or
subdat<-dat2a %>% select(-c(1,4:6))
subdat<-dat2a %>% select(-c("child.id","gender","zygosity","whz"))

2.7.5.7 Rename columns

#Base R
#colnames(dat2a)[colnames(dat2a)=="child.id"]<-"childid"
#with tidyverse
dat2a<-dat2a %>% select(childid=child.id, everything())
dat2a<-dat2a %>% rename(child.id=childid)

2.7.5.8 Editting data

#initial look at data 
summary(dat2a)
##    child.id          family.id            cohort             gender            zygosity        
##  Length:50          Length:50          Length:50          Length:50          Length:50         
##  Class :character   Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                                                
##                                                                                                
##                                                                                                
##       whz                waz               haz           age.first.fe    age.last.fe   
##  Min.   :-1.73842   Min.   :-3.7555   Min.   :-4.6586   Min.   : 1.00   Min.   :286.0  
##  1st Qu.:-0.88598   1st Qu.:-2.6037   1st Qu.:-3.4314   1st Qu.: 5.00   1st Qu.:455.0  
##  Median :-0.47063   Median :-1.5116   Median :-1.5749   Median : 6.00   Median :700.5  
##  Mean   :-0.42223   Mean   :-1.5148   Mean   :-1.8570   Mean   : 9.78   Mean   :600.6  
##  3rd Qu.:-0.03458   3rd Qu.:-0.7189   3rd Qu.:-0.7189   3rd Qu.: 9.75   3rd Qu.:706.0  
##  Max.   : 1.20133   Max.   : 1.6698   Max.   : 1.6698   Max.   :37.00   Max.   :738.0  
##    no.fe.sam      sam.int.mean     month.ebf      age.first.solid no.diarhea.yr   pc.days.diarrhea
##  Min.   : 9.00   Min.   :22.56   Min.   :0.0000   Min.   :1.216   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:18.00   1st Qu.:28.29   1st Qu.:0.1971   1st Qu.:4.920   1st Qu.:1.271   1st Qu.: 1.163  
##  Median :21.00   Median :30.48   Median :1.0348   Median :6.932   Median :2.916   Median : 2.742  
##  Mean   :19.92   Mean   :31.34   Mean   :2.0716   Mean   :6.690   Mean   :2.971   Mean   : 3.125  
##  3rd Qu.:23.00   3rd Qu.:33.27   3rd Qu.:3.2523   3rd Qu.:8.361   3rd Qu.:4.054   3rd Qu.: 4.314  
##  Max.   :26.00   Max.   :42.29   Max.   :7.3260   Max.   :9.790   Max.   :8.814   Max.   :10.826  
##  frac.sam.atb.7day     set           
##  Min.   :0.00000   Length:50         
##  1st Qu.:0.08083   Class :character  
##  Median :0.11882   Mode  :character  
##  Mean   :0.18328                     
##  3rd Qu.:0.28373                     
##  Max.   :0.50000
dat2a<-dat2a %>%
  mutate(zygosity=replace(zygosity,zygosity=="NA","singleton")) %>%
  mutate(zygosity=replace(zygosity,zygosity=="not tested",NA))

2.8 Basic data visualization

Below are some examples of basic data visualization with ggplot2 .

2.8.1 Barplot

library(ggthemes)
ggplot(dat2a,aes(gender))+
  geom_bar()

2.8.2 Histogram

ggplot(dat2a,aes(whz))+
  geom_histogram(bins=10)

2.8.2.1 Plot by group

ggplot(dat2a,aes(whz,fill=gender))+
  geom_histogram(bins=10,alpha=0.5, position="identity")

2.8.3 Density

ggplot(dat2a, aes(whz)) +
  geom_density()

2.8.3.1 Plot by group

ggplot(dat2a, aes(whz, fill=gender)) +
  geom_density(alpha=0.5)

2.8.4 Histogram with density

ggplot(dat2a,aes(whz))+
  geom_histogram(bins=10)+
  geom_density()

Something goes wrong! => Create a histogram with a density scale using the computed variable ..density..:

ggplot(dat2a)+
  geom_histogram(aes(x=whz,y=..density..),bins=10)+
  geom_density(aes(whz))

2.8.4.1 By group

ggplot(dat2a)+
  geom_histogram(aes(x=whz,y=..density..,fill = gender),bins=10, position="identity",alpha=0.5)+
  geom_density(aes(whz,color=gender))

2.8.5 Scatter plot

ggplot(dat2a,aes(x=month.ebf, y=whz)) +
  geom_point()

2.8.5.1 With regression line

ggplot(dat2a,aes(x=month.ebf, y=whz)) +
  geom_point()+
  stat_smooth(method = 'glm')

2.8.5.2 By group

ggplot(dat2a,aes(x=month.ebf, y=whz,color=gender)) +
  geom_point()+
  stat_smooth(method = 'glm')

2.8.6 Boxplot

ggplot(dat2a,aes(x=gender, y=month.ebf, color=gender)) +
  geom_boxplot()

2.8.6.1 Box plot with dots

ggplot(dat2a,aes(x=gender, y=month.ebf, color=gender)) +
  geom_boxplot()+ 
  geom_dotplot(binaxis='y', stackdir='center', dotsize=1)

2.8.7 Pie Charts

dat2a %>% 
  group_by(gender) %>%
  summarise(count=n()) %>%
ggplot(aes(x="",y=count,fill=gender)) +
  geom_bar(stat = "identity") +
  coord_polar("y", start = 0)

2.8.8 More data visualization

Please see Chapter “Data visualization” of this book.

2.9 Summary statistics

2.9.1 Basic descriptive summary functions

# Summary of variable month.ebf 
dat2a %>% 
  summarise(
    count = n(),
    mean = mean(month.ebf, na.rm = TRUE),
    sd = sd(month.ebf, na.rm = TRUE),
    median = median(month.ebf, na.rm = TRUE),
    IQR = IQR(month.ebf, na.rm = TRUE))
##   count     mean       sd   median      IQR
## 1    50 2.071644 2.154823 1.034836 3.055231

2.9.1.1 Summary by group

# Summary of month.ebf by cohort
dat2a %>% group_by(cohort) %>%
  summarise(
    count = n(),
    mean = mean(month.ebf, na.rm = TRUE),
    sd = sd(month.ebf, na.rm = TRUE),
    median = median(month.ebf, na.rm = TRUE),
    IQR = IQR(month.ebf, na.rm = TRUE))
## # A tibble: 2 × 6
##   cohort                         count  mean    sd median   IQR
##   <chr>                          <int> <dbl> <dbl>  <dbl> <dbl>
## 1 Healthy Singleton Birth Cohort    25 3.45  2.24   3.29   3.71
## 2 Healthy Twins & Triplets          25 0.694 0.706  0.854  1.02

2.9.1.2 Summary by subgroup of another group

dat2a %>% filter(cohort == "Healthy Twins & Triplets") %>%
  group_by(gender) %>%
  summarise(
    count = n(),
    mean = mean(month.ebf, na.rm = TRUE),
    sd = sd(month.ebf, na.rm = TRUE),
    median = median(month.ebf, na.rm = TRUE),
    IQR = IQR(month.ebf, na.rm = TRUE))
## # A tibble: 2 × 6
##   gender count  mean    sd median   IQR
##   <chr>  <int> <dbl> <dbl>  <dbl> <dbl>
## 1 Female    18 0.772 0.770  0.920 1.04 
## 2 Male       7 0.493 0.498  0.197 0.920

2.9.1.3 More filter

dat2a %>% filter(cohort == "Healthy Twins & Triplets") %>%
  filter(!is.na(zygosity)) %>%
  group_by(gender) %>%
  summarise(
    count = n(),
    mean = mean(month.ebf, na.rm = TRUE),
    sd = sd(month.ebf, na.rm = TRUE),
    median = median(month.ebf, na.rm = TRUE),
    IQR = IQR(month.ebf, na.rm = TRUE))
## # A tibble: 2 × 6
##   gender count  mean    sd median   IQR
##   <chr>  <int> <dbl> <dbl>  <dbl> <dbl>
## 1 Female    16 0.749 0.816  0.690 1.03 
## 2 Male       7 0.493 0.498  0.197 0.920

2.9.2 Basic statistical test examples

2.9.2.1 Continuous variables: parametric

#t-test
summary(dat2a$whz)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.73842 -0.88598 -0.47063 -0.42223 -0.03458  1.20133
hist(dat2a$whz)

t.test(whz~gender, data=dat2a)
## 
##  Welch Two Sample t-test
## 
## data:  whz by gender
## t = -0.7285, df = 47.802, p-value = 0.4699
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.5572199  0.2608470
## sample estimates:
## mean in group Female   mean in group Male 
##           -0.4903957           -0.3422092

2.9.2.2 Continuous variables: non-parametric

#Kruskal-Wallis 
summary(dat2a$month.ebf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1971  1.0348  2.0716  3.2523  7.3260
hist(dat2a$month.ebf)

kruskal.test(month.ebf ~ gender, data = dat2a)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  month.ebf by gender
## Kruskal-Wallis chi-squared = 0.12372, df = 1, p-value = 0.725
#wilcox.test
wilcox.test(month.ebf ~ gender, data = dat2a)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  month.ebf by gender
## W = 292.5, p-value = 0.7324
## alternative hypothesis: true location shift is not equal to 0

2.9.2.3 Categorical variables

#Chi-square
table(dat2a$cohort,dat2a$gender)
##                                 
##                                  Female Male
##   Healthy Singleton Birth Cohort      9   16
##   Healthy Twins & Triplets           18    7
chisq.test(dat2a$cohort,dat2a$gender)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dat2a$cohort and dat2a$gender
## X-squared = 5.153, df = 1, p-value = 0.02321
#Fisher 
fisher.test(dat2a$cohort,dat2a$gender)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  dat2a$cohort and dat2a$gender
## p-value = 0.02224
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.05544835 0.83324786
## sample estimates:
## odds ratio 
##  0.2261644

2.9.3 Summary statistic table

2.9.3.1 Packages for quick, nice summary table

The “arsenal” package provides handy R functions for large-scale statistical summaries (Heinzen et al. 2021).

#install.packages(c("Deducer","arsenal","qwraps2","table1","tangram","purrrlyr","Gmisc")) 
library(arsenal)
ca.vars<-c("cohort","gender","zygosity")
co.vars<-c("whz", "waz", "haz","age.first.fe","age.last.fe","no.fe.sam","sam.int.mean",
        "month.ebf","age.first.solid","no.diarhea.yr","pc.days.diarrhea","frac.sam.atb.7day")
varuse<-c(co.vars,ca.vars[!ca.vars %in% "cohort"])
mylabels <-as.list(varuse)
names(mylabels)<-varuse
tab1 <- tableby(as.formula(paste("cohort",paste(varuse,collapse="+"),sep="~")), 
                data = dat2a)
kable(summary(tab1,
        labelTranslations = mylabels, 
        text=TRUE))
Healthy Singleton Birth Cohort (N=25) Healthy Twins & Triplets (N=25) Total (N=50) p value
whz 0.683
- Mean (SD) -0.380 (0.727) -0.464 (0.724) -0.422 (0.719)
- Range -1.623 - 0.949 -1.738 - 1.201 -1.738 - 1.201
waz < 0.001
- Mean (SD) -0.483 (0.806) -2.546 (0.561) -1.515 (1.248)
- Range -1.741 - 1.670 -3.755 - -1.392 -3.755 - 1.670
haz < 0.001
- Mean (SD) -0.483 (0.806) -3.231 (0.956) -1.857 (1.641)
- Range -1.741 - 1.670 -4.659 - -1.412 -4.659 - 1.670
age.first.fe 0.008
- Mean (SD) 6.360 (2.782) 13.200 (12.124) 9.780 (9.366)
- Range 3.000 - 15.000 1.000 - 37.000 1.000 - 37.000
age.last.fe < 0.001
- Mean (SD) 710.400 (12.430) 490.880 (152.553) 600.640 (154.167)
- Range 700.000 - 738.000 286.000 - 730.000 286.000 - 738.000
no.fe.sam 0.001
- Mean (SD) 21.920 (2.019) 17.920 (5.338) 19.920 (4.476)
- Range 18.000 - 24.000 9.000 - 26.000 9.000 - 26.000
sam.int.mean < 0.001
- Mean (SD) 34.003 (3.813) 28.677 (3.090) 31.340 (4.363)
- Range 30.087 - 42.294 22.562 - 35.000 22.562 - 42.294
month.ebf < 0.001
- Mean (SD) 3.449 (2.242) 0.694 (0.706) 2.072 (2.155)
- Range 0.000 - 7.326 0.000 - 2.332 0.000 - 7.326
age.first.solid 0.254
- Mean (SD) 6.352 (2.236) 7.028 (1.884) 6.690 (2.074)
- Range 1.216 - 9.790 3.088 - 9.396 1.216 - 9.790
no.diarhea.yr < 0.001
- Mean (SD) 4.278 (2.135) 1.664 (1.345) 2.971 (2.205)
- Range 0.520 - 8.814 0.000 - 4.986 0.000 - 8.814
pc.days.diarrhea 0.007
- Mean (SD) 4.090 (2.223) 2.159 (2.601) 3.125 (2.585)
- Range 0.142 - 10.826 0.000 - 9.563 0.000 - 10.826
frac.sam.atb.7day 0.070
- Mean (SD) 0.219 (0.145) 0.148 (0.125) 0.183 (0.139)
- Range 0.000 - 0.500 0.000 - 0.462 0.000 - 0.500
gender 0.011
- Female 9 (36.0%) 18 (72.0%) 27 (54.0%)
- Male 16 (64.0%) 7 (28.0%) 23 (46.0%)
zygosity < 0.001
- N-Miss 0 2 2
- DZ 0 (0.0%) 12 (52.2%) 12 (25.0%)
- Fraternal co-twin in set of triplets 0 (0.0%) 1 (4.3%) 1 (2.1%)
- MZ 0 (0.0%) 8 (34.8%) 8 (16.7%)
- MZ co-twin in set of triplets 0 (0.0%) 2 (8.7%) 2 (4.2%)
- singleton 25 (100.0%) 0 (0.0%) 25 (52.1%)
#Feel free to try other packages! 

2.9.3.2 More comprehensive table (control)

my_controls <- tableby.control(
  test = T,
  total = T,
  numeric.test = "kwt", cat.test = "fe",
  numeric.stats = c("meansd", "medianq1q3", "range", "Nmiss2"),
  cat.stats = c("countpct", "Nmiss2"),
  stats.labels = list(
    meansd = "Mean (SD)",
    medianq1q3 = "Median (Q1, Q3)",
    range = "Min - Max",
    Nmiss2 = "Missing"
  )
)
tab2 <- tableby(as.formula(paste("cohort",paste(varuse,collapse="+"),sep="~")),data = dat2a,
                control = my_controls)
kable(summary(tab2,labelTranslations = mylabels, text=TRUE))
Healthy Singleton Birth Cohort (N=25) Healthy Twins & Triplets (N=25) Total (N=50) p value
whz 0.705
- Mean (SD) -0.380 (0.727) -0.464 (0.724) -0.422 (0.719)
- Median (Q1, Q3) -0.468 (-0.897, 0.015) -0.560 (-0.853, -0.104) -0.471 (-0.886, -0.035)
- Min - Max -1.623 - 0.949 -1.738 - 1.201 -1.738 - 1.201
- Missing 0 0 0
waz < 0.001
- Mean (SD) -0.483 (0.806) -2.546 (0.561) -1.515 (1.248)
- Median (Q1, Q3) -0.706 (-1.031, -0.030) -2.606 (-2.879, -2.295) -1.512 (-2.604, -0.719)
- Min - Max -1.741 - 1.670 -3.755 - -1.392 -3.755 - 1.670
- Missing 0 0 0
haz < 0.001
- Mean (SD) -0.483 (0.806) -3.231 (0.956) -1.857 (1.641)
- Median (Q1, Q3) -0.706 (-1.031, -0.030) -3.437 (-4.032, -2.551) -1.575 (-3.431, -0.719)
- Min - Max -1.741 - 1.670 -4.659 - -1.412 -4.659 - 1.670
- Missing 0 0 0
age.first.fe 0.077
- Mean (SD) 6.360 (2.782) 13.200 (12.124) 9.780 (9.366)
- Median (Q1, Q3) 5.000 (5.000, 8.000) 8.000 (5.000, 15.000) 6.000 (5.000, 9.750)
- Min - Max 3.000 - 15.000 1.000 - 37.000 1.000 - 37.000
- Missing 0 0 0
age.last.fe < 0.001
- Mean (SD) 710.400 (12.430) 490.880 (152.553) 600.640 (154.167)
- Median (Q1, Q3) 704.000 (701.000, 715.000) 455.000 (366.000, 638.000) 700.500 (455.000, 706.000)
- Min - Max 700.000 - 738.000 286.000 - 730.000 286.000 - 738.000
- Missing 0 0 0
no.fe.sam 0.011
- Mean (SD) 21.920 (2.019) 17.920 (5.338) 19.920 (4.476)
- Median (Q1, Q3) 23.000 (21.000, 23.000) 18.000 (13.000, 22.000) 21.000 (18.000, 23.000)
- Min - Max 18.000 - 24.000 9.000 - 26.000 9.000 - 26.000
- Missing 0 0 0
sam.int.mean < 0.001
- Mean (SD) 34.003 (3.813) 28.677 (3.090) 31.340 (4.363)
- Median (Q1, Q3) 33.143 (31.409, 35.350) 28.154 (27.048, 30.350) 30.477 (28.290, 33.271)
- Min - Max 30.087 - 42.294 22.562 - 35.000 22.562 - 42.294
- Missing 0 0 0
month.ebf < 0.001
- Mean (SD) 3.449 (2.242) 0.694 (0.706) 2.072 (2.155)
- Median (Q1, Q3) 3.285 (1.643, 5.355) 0.854 (0.000, 1.018) 1.035 (0.197, 3.252)
- Min - Max 0.000 - 7.326 0.000 - 2.332 0.000 - 7.326
- Missing 0 0 0
age.first.solid 0.388
- Mean (SD) 6.352 (2.236) 7.028 (1.884) 6.690 (2.074)
- Median (Q1, Q3) 6.209 (4.599, 8.213) 7.063 (6.078, 8.476) 6.932 (4.920, 8.361)
- Min - Max 1.216 - 9.790 3.088 - 9.396 1.216 - 9.790
- Missing 0 0 0
no.diarhea.yr < 0.001
- Mean (SD) 4.278 (2.135) 1.664 (1.345) 2.971 (2.205)
- Median (Q1, Q3) 3.957 (3.008, 5.170) 1.276 (0.802, 2.003) 2.916 (1.271, 4.054)
- Min - Max 0.520 - 8.814 0.000 - 4.986 0.000 - 8.814
- Missing 0 0 0
pc.days.diarrhea < 0.001
- Mean (SD) 4.090 (2.223) 2.159 (2.601) 3.125 (2.585)
- Median (Q1, Q3) 3.930 (2.996, 4.755) 1.359 (0.440, 2.740) 2.742 (1.163, 4.314)
- Min - Max 0.142 - 10.826 0.000 - 9.563 0.000 - 10.826
- Missing 0 0 0
frac.sam.atb.7day 0.109
- Mean (SD) 0.219 (0.145) 0.148 (0.125) 0.183 (0.139)
- Median (Q1, Q3) 0.217 (0.087, 0.318) 0.111 (0.077, 0.167) 0.119 (0.081, 0.284)
- Min - Max 0.000 - 0.500 0.000 - 0.462 0.000 - 0.500
- Missing 0 0 0
gender 0.022
- Female 9 (36.0%) 18 (72.0%) 27 (54.0%)
- Male 16 (64.0%) 7 (28.0%) 23 (46.0%)
- Missing 0 0 0
zygosity < 0.001
- DZ 0 (0.0%) 12 (52.2%) 12 (25.0%)
- Fraternal co-twin in set of triplets 0 (0.0%) 1 (4.3%) 1 (2.1%)
- MZ 0 (0.0%) 8 (34.8%) 8 (16.7%)
- MZ co-twin in set of triplets 0 (0.0%) 2 (8.7%) 2 (4.2%)
- singleton 25 (100.0%) 0 (0.0%) 25 (52.1%)
- Missing 0 2 2

2.10 Practice

Write R code to:

  • Import sheet 2 of excel data file “nature13421-s2.xlsx”
  • Rename column names to sensible names, remove redundant rows
  • Save your cleaned data to local disc and load saved data for use
  • Make a table showing number of fecal sample per each child ID
  • Summarize number of fecal samples in all child ID and by cohort of singletons vs twins/triplets
  • Plot number of fecal samples by cohort of singletons vs twins/triplets

2.11 Practice review

2.11.1 The requirements

Import sheet 2 of excel data file (Chan et al. 2023):

library(rio)
#import all sheets
list1<-import_list(paste(fdir1,"nature13421-s2.xlsx",sep=""))
#get sheet 2
sh2<-list1[[2]]

#import sheet 2 only
sh2<-import(paste(fdir1,"nature13421-s2.xlsx",sep=""),sheet=2,skip=1) 

Rename column names to sensible names, remove redundant rows:

library(tidyverse)
#colnames(sh2)
colnames(sh2)[!is.na(sh2[1,])]<-sh2[1,][!is.na(sh2[1,])]
#or just do all at a time
oldname<-colnames(sh2)
newname<-c("cohort",
           "family.id",
           "child.id",
           "sample.id",
           "age.d",
           "age.m",
           "whz","waz","haz",
           "breast.milk",
           "formula",
           "solid",
           "diarrhea",
           "antibiotic.7d",
           "medication",
           "no.sequence",
           "serun.id",
           "barcode")
sh2<-sh2 %>% rename_at(vars(oldname), ~newname)
#colnames(sh2)
#remove unused rows 
sh2<-sh2 %>% 
  filter(!is.na(family.id))

You can save your cleaned data to local disc and load saved data for use:

#save data
#write.csv(sh2,file=paste(fdir1,"sh2.csv",sep=""),row.names=F)
#sh2<-read.csv(paste(fdir1,"sh2.csv",sep=""))

Make a table showing number of fecal sample per each child ID and summarize number of fecal samples in all child ID:

nsam<-sh2 %>%
  group_by(child.id) %>%
  summarise(no.sam=n()) 
#View(nsam)
nsam %>%
    summarise(
      mean = mean(no.sam, na.rm = TRUE),
      sd = sd(no.sam, na.rm = TRUE),
      median = median(no.sam, na.rm = TRUE),
      IQR = IQR(no.sam, na.rm = TRUE))
## # A tibble: 1 × 4
##    mean    sd median   IQR
##   <dbl> <dbl>  <dbl> <dbl>
## 1  19.9  4.48     21     5

Summarize number of fecal samples per childid by cohort of singletons vs twins/triplets:

nsam<-sh2 %>%
  group_by(cohort,child.id) %>%
  summarise(no.sam=n()) %>% 
  summarise(
    mean = mean(no.sam, na.rm = TRUE),
    sd = sd(no.sam, na.rm = TRUE),
    median = median(no.sam, na.rm = TRUE),
    IQR = IQR(no.sam, na.rm = TRUE))
nsam  
## # A tibble: 2 × 5
##   cohort                          mean    sd median   IQR
##   <chr>                          <dbl> <dbl>  <int> <dbl>
## 1 Healthy Singleton Birth Cohort  21.9  2.02     23     2
## 2 Healthy Twins & Triplets        17.9  5.34     18     9

Boxplot number of fecal samples by cohort of singletons vs twins/triplets:

sh2 %>%
  group_by(cohort,child.id) %>%
  summarise(no.sam=n()) %>%
  ggplot(aes(x=cohort,y=no.sam,fill=cohort)) +
  geom_boxplot()+
  ylab("Number of fecal samples")+
  geom_dotplot(binaxis='y', stackdir='center', dotsize=1)

Violin Plot: similar to box plot but shows the density within groups, not much info provided as in box plot:

sh2 %>%
  group_by(cohort,child.id) %>%
  summarise(no.sam=n()) %>%
  ggplot(aes(x=cohort,y=no.sam,fill=cohort)) +
  geom_violin()+
  ylab("Number of fecal samples")

Bar plot number of fecal samples by cohort of singletons vs twins/triplets and summary of number of fecal samples:

nsam<-sh2 %>%
  group_by(cohort,child.id) %>%
  summarise(no.sam=n()) %>%
  #group_by(cohort) %>% 
  summarise(
    no.child = n(),
    mean = mean(no.sam, na.rm = TRUE),
    sd = sd(no.sam, na.rm = TRUE),
    se=sd/sqrt(n()),
    ci=1.96*se,
    median = median(no.sam, na.rm = TRUE),
    IQR = IQR(no.sam, na.rm = TRUE))
nsam
## # A tibble: 2 × 8
##   cohort                         no.child  mean    sd    se    ci median   IQR
##   <chr>                             <int> <dbl> <dbl> <dbl> <dbl>  <int> <dbl>
## 1 Healthy Singleton Birth Cohort       25  21.9  2.02 0.404 0.791     23     2
## 2 Healthy Twins & Triplets             25  17.9  5.34 1.07  2.09      18     9

Bar plot with error bar:

ggplot(nsam, aes(x=cohort, y=mean, fill=cohort)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=mean-se, ymax=mean+se),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))+
    ylab("Number of fecal samples (mean,se)")

Bar plot with 95%CI:

ggplot(nsam, aes(x=cohort, y=mean, fill=cohort)) + 
    geom_bar(position=position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))+
    ylab("Number of fecal samples (mean, 95%CI)")

Histogram number of fecal samples by cohort of singletons vs twins/triplets:

sh2 %>%
  group_by(cohort,child.id) %>%
  summarise(no.sam=n()) %>%
  ggplot(aes(no.sam,y=..density..,fill=cohort))+
  geom_histogram(bins=10,alpha=0.5, position="identity")+
  geom_density(aes(no.sam,color=cohort),alpha=0.5)

2.11.2 Further exploration

Basic statistical test:

nsam<-sh2 %>%
  group_by(cohort,child.id) %>%
  summarise(no.sam=n()) 
head(nsam)
## # A tibble: 6 × 3
## # Groups:   cohort [1]
##   cohort                         child.id  no.sam
##   <chr>                          <chr>      <int>
## 1 Healthy Singleton Birth Cohort Bgsng7018     21
## 2 Healthy Singleton Birth Cohort Bgsng7035     23
## 3 Healthy Singleton Birth Cohort Bgsng7052     20
## 4 Healthy Singleton Birth Cohort Bgsng7063     20
## 5 Healthy Singleton Birth Cohort Bgsng7071     18
## 6 Healthy Singleton Birth Cohort Bgsng7082     18
wilcox.test(no.sam~cohort,nsam)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  no.sam by cohort
## W = 443.5, p-value = 0.01092
## alternative hypothesis: true location shift is not equal to 0
kruskal.test(no.sam~cohort,nsam)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  no.sam by cohort
## Kruskal-Wallis chi-squared = 6.5277, df = 1, p-value = 0.01062
t.test(no.sam~cohort,nsam)
## 
##  Welch Two Sample t-test
## 
## data:  no.sam by cohort
## t = 3.5045, df = 30.73, p-value = 0.001427
## alternative hypothesis: true difference in means between group Healthy Singleton Birth Cohort and group Healthy Twins & Triplets is not equal to 0
## 95 percent confidence interval:
##  1.671264 6.328736
## sample estimates:
## mean in group Healthy Singleton Birth Cohort       mean in group Healthy Twins & Triplets 
##                                        21.92                                        17.92

Which test is appropriate? => Please see Chapter “Basic Statistics” of this book.

Statistical test for difference in data type between sheet 1 and sheet 2?

  • Sheet 1 is cross sectional data.
  • Sheet 2 is longitudinal data: many observations per childid. Therefore, different ways of data description and analysis may be used.

Longitudinal data visualization with spaghetti plot:

sh2 %>%
  filter(!is.na(whz)) %>%
  ggplot()+ 
  geom_point(aes(x = age.m, y = whz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = whz, group = child.id, colour=cohort),size=0.3)

More exploration with waz:

sh2 %>%
  filter(!is.na(waz)) %>%
  ggplot()+
  geom_point(aes(x = age.m, y = waz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = waz, group = child.id, colour=cohort),size=0.3)

We can add fitted lines to the plot:

sh2 %>%
  filter(!is.na(waz)) %>%
  ggplot()+
  geom_point(aes(x = age.m, y = waz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = waz, group = child.id, colour=cohort),size=0.3)+
  stat_smooth(aes(x = age.m, y = waz, colour=cohort), method = 'glm',size=2)

More exploration with haz:

sh2 %>%
  filter(!is.na(haz)) %>%
  ggplot()+
  geom_point(aes(x = age.m, y = haz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = haz, group = child.id, colour=cohort),size=0.3)+
  stat_smooth(aes(x = age.m, y = haz, colour=cohort), method = 'glm',size=2)

To test for the difference between two groups, check out the chapter “Longitudinal data analysis with R” in this book.

All-in-one spaghetti plot of three indices by cohort:

sh2 %>%
  filter(!is.na(haz) & !is.na(waz) & !is.na(whz)) %>%
  ggplot()+
  #add each item to the plot
  geom_point(aes(x = age.m, y = haz, group = child.id,color="haz"))+
  geom_line(aes(x = age.m, y = haz, group = child.id,color="haz"),size=0.3)+
  stat_smooth(aes(x = age.m, y = haz,color="haz"), method = 'glm',size=2)+
  geom_point(aes(x = age.m, y = waz, group = child.id,color="waz"))+
  geom_line(aes(x = age.m, y = waz, group = child.id,color="waz"),size=0.3)+
  stat_smooth(aes(x = age.m, y = waz,color="waz"), method = 'glm',size=2)+
  geom_point(aes(x = age.m, y = whz, group = child.id,color="whz"))+
  geom_line(aes(x = age.m, y = whz, group = child.id,color="whz"),size=0.3)+
  stat_smooth(aes(x = age.m, y = whz,color="whz"), method = 'glm',size=2)+
  #divide the plot into multiple panels (cohort groups)  
  facet_grid(.~ cohort)+
  theme(legend.title = element_blank())

All-in-one spaghetti plot by cohort for each index and each plot for each index:

p1<-sh2 %>%
  filter(!is.na(haz) & !is.na(waz) & !is.na(whz)) %>%
  ggplot()+
  geom_point(aes(x = age.m, y = haz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = haz, group = child.id, colour=cohort),size=0.3)+
  stat_smooth(aes(x = age.m, y = haz, colour=cohort), method = 'glm',size=2)+
  theme(legend.position = "bottom")
p2<-sh2 %>%
  filter(!is.na(haz) & !is.na(waz) & !is.na(whz)) %>%
  ggplot()+
  geom_point(aes(x = age.m, y = waz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = waz, group = child.id, colour=cohort),size=0.3)+
  stat_smooth(aes(x = age.m, y = waz, colour=cohort), method = 'glm',size=2)+
  theme(legend.position = "bottom")
p3<-sh2 %>%
  filter(!is.na(haz) & !is.na(waz) & !is.na(whz)) %>%
  ggplot()+
  geom_point(aes(x = age.m, y = whz, group = child.id, colour=cohort))+
  geom_line(aes(x = age.m, y = whz, group = child.id, colour=cohort),size=0.3)+
  stat_smooth(aes(x = age.m, y = whz, colour=cohort), method = 'glm',size=2)+
  theme(legend.position = "bottom")
# Combine 3 plots into 1 plot
#install.packages("ggpubr")
library(ggpubr)
ggarrange(p1, p2, p3,nrow=1, common.legend = TRUE, legend="bottom")

2.12 Reference materials

2.12.2 Datasets

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Aragón, Tomás J. 2019. Population Health Data Science with r. 1st ed. Vol. 1. https://bookdown.org/taragonmd/phds/.
Chan, Chung-hong, Thomas J. Leeper, Jason Becker, and David Schoch. 2023. Rio: A Swiss-Army Knife for Data File i/o. https://cran.r-project.org/package=rio.
Heinzen, Ethan, Jason Sinnwell, Elizabeth Atkinson, Tina Gunderson, and Gregory Dougherty. 2021. Arsenal: An Arsenal of r Functions for Large-Scale Statistical Summaries. https://github.com/mayoverse/arsenal.
Ho, Nhan. 2020. metamicrobiomeR: Microbiome Data Analysis & Meta-Analysis with GAMLSS-BEZI & Random Effects. https://github.com/nhanhocu/metamicrobiomeR.
———. 2024b. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Subramanian, Sathish, Sayeeda Huq, Tanya Yatsunenko, Rashidul Haque, Mustafa Mahfuz, Mohammed A. Alam, Amber Benezra, et al. 2014. “Persistent Gut Microbiota Immaturity in Malnourished Bangladeshi Children.” Nature 510: 417–21. https://doi.org/10.1038/nature13421.
———. 2023b. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.