前言

这只是零碎的个人笔记,距离一本书还很遥远!

Essentially, all models are wrong, but some are useful.

— George Box (Box and Draper 1987)

写作灵感来自 Common statistical tests are linear models (or: how to teach stats) 参考文献 《Modern Applied Statistics with S》(第四版)(Venables and Ripley 2002) 和 《Mixed-effects models in S and S-PLUS》(Pinheiro and Bates 2000)

将 R 语言发挥到极致的 70 多个提示 (Tsagris and Papadakis 2018)

If you really want to assess uncertainty you need to take into account that the models are false and that several models may capture different aspects of the data and so be false in different ways.

— Brian D. Ripley1

为什么选择 R

Let’s not kid ourselves: the most widely used piece of software for statistics is Excel.

— Brian D. Ripley (Ripley 2002)

Jenny Bryan 在 2016 年国际 R 语言大会上的演讲 — 摆脱 Excel 吧!2

Nathan Stephens 总结了 Excel 的缺陷和不足,无法掌控数据,工作不可重复使用,文件太大不方便分享,包含大量丑陋的 VBA 代码,随着时间的推移,几乎不可维护, Excel 也不是数据库 https://resources.rstudio.com/wistia-rstudio-essentials-2/how-to-excel-without-using-excel

Some people familiar with R describe it as a supercharged version of Microsoft’s Excel spreadsheet software.

— Ashlee Vance3

写作细节

I think, therefore I R.

— William B. King4

本书 R Markdown 源文件托管在 Github 仓库里,本地使用 RStudio IDE 编辑,bookdown 组织各个章节的 Rmd 文件和输出格式,使用 Git 进行版本控制。每次提交修改到 Github 上都会触发 Travis 自动编译书籍,将一系列 Rmd 文件经 knitr 调用 R 解释器执行里面的代码块,并将输出结果返回,Pandoc 将 Rmd 文件转化为 md 、 html 或者 tex 文件。若想输出 pdf 文件,还需要准备 TeX 排版环境,最后使用 Netlify 托管书籍网站,和 Travis 一起实现连续部署,使得每次修改都会同步到网站。

Rickyfox: Dang it how could I have missed this? Thx for the answer, I feel incredibly stupid now.

Juba: Never underestimate the power of R to make you feel stupid.

— Rickyfox and Juba5

本书依赖的 R 包和配置环境比较复杂,所以将整个运行环境打包成 Docker 镜像,方便读者重现,构建镜像的 Dockerfile 文件随同书籍源文件一起托管在 Github 上,方便读者研究。本地编译书籍只需三步走,先将存放在 Github 上的书籍项目克隆到本地6

git clone https://github.com/XiangyunHuang/MASR.git

然后在 Git Bash 的模拟终端器中,启动虚拟机,拉取准备好的镜像文件7

docker-machine.exe start default
docker pull xiangyunhuang/masr

最后 cd 进入书籍项目所在目录,运行如下命令编译书籍

docker run --rm -u docker -v "/${PWD}://home/docker/workspace" xiangyunhuang/masr make gitbook

编译成功后,可以在目录 _book/ 下看到生成的文件,点击文件 index.html 选择谷歌浏览器打开8,尽情地阅读吧!

如果你想了解编译书籍的环境和过程,我推荐你阅读随书籍源文件一起的 Dockerfile 文件,Docker Hub 是根据此文件构建的镜像,打包成功后,大约占用空间 2 Gb,本书在 RStudio IDE 下用 R Markdown (Xie, Allaire, and Grolemund 2018) 编辑的,编译本书获得电子版还需要以下 R 包和软件9

xfun::session_info(
  packages = c("knitr", "rmarkdown", "bookdown"),
  dependencies = FALSE
)
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Locale:
##   LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
##   LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
##   LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
##   LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
##   LC_ADDRESS=C               LC_TELEPHONE=C            
##   LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
## 
## Package version:
##   bookdown_0.17 knitr_1.26    rmarkdown_2.1
## 
## Pandoc version: 2.9.1.1

镜像中已安装的 R 包列表可运行如下命令查看,或者点击链接

docker run --rm xiangyunhuang/masr Rscript -e 'xfun::session_info(.packages(TRUE))'

借助 bookdown (Xie 2016) 可以将 Rmd 文件组织起来, rmarkdown (Allaire et al. 2020)和 knitr (Xie 2015) 将源文件编译成 Markdown 文件, Pandoc 将 Markdown 文件转化成 HTML 和 TeX 文件,用 TinyTeX 将 TeX 文件编译成 PDF 文档,所有的图形都用 ggplot2 包制作 (Wickham 2016)

R 社区提供了丰富的帮助资源,可以先在 R 官网搜集的高频问题 https://cran.r-project.org/faqs.html 中查找,然后在线搜索 https://cran.r-project.org/search.htmlhttps://rseek.org/,更多帮助获取方式见 https://www.r-project.org/help.html

ruler()
----+----1----+----2----+----3----+----4----+----5----+----6----+----
123456789012345678901234567890123456789012345678901234567890123456789

目前编译本书需要的 R 包如下:

表 0.1: 依赖的 R 包
Package Version Title
abind 1.4-5 Combine Multidimensional Arrays
AER 1.2-9 Applied Econometrics with R
agridat 1.16 Agricultural Datasets
askpass 1.1 Safe Password Entry for R, Git, and SSH
assertthat 0.2.1 Easy Pre and Post Assertions
backports 1.1.5 Reimplementations of Functions Introduced Since R-3.0.0
base64enc 0.1-3 Tools for base64 encoding
bayesplot 1.7.1 Plotting for Bayesian Models
BH 1.72.0-3 Boost C++ Header Files
bookdown 0.17 Authoring Books and Technical Documents with R Markdown
boot 1.3-24 Bootstrap Functions (Originally by Angelo Canty for S)
bridgesampling 0.8-1 Bridge Sampling for Marginal Likelihoods and Bayes Factors
brms 2.11.1 Bayesian Regression Models using Stan
Brobdingnag 1.2-6 Very Large Numbers in R
broom 0.5.4 Convert Statistical Analysis Objects into Tidy Tibbles
callr 3.4.1 Call R from R
car 3.0-6 Companion to Applied Regression
carData 3.0-3 Companion to Applied Regression Data Sets
cellranger 1.1.0 Translate Spreadsheet Cell Ranges to Rows and Columns
checkmate 2.0.0 Fast and Versatile Argument Checks
class 7.3-15 Functions for Classification
classInt 0.4-2 Choose Univariate Class Intervals
cli 2.0.1 Helpers for Developing Command Line Interfaces
clipr 0.7.0 Read and Write from the System Clipboard
coda 0.19-3 Output Analysis and Diagnostics for MCMC
codetools 0.2-16 Code Analysis Tools for R
colorspace 1.4-1 A Toolbox for Manipulating and Assessing Colors and Palettes
colourpicker 1.0 A Colour Picker Tool for Shiny and for Selecting Colours in Plots
crayon 1.3.4 Colored Terminal Output
crosstalk 1.0.0 Inter-Widget Interactivity for HTML Widgets
curl 4.3 A Modern and Flexible Web Client for R
data.table 1.12.8 Extension of data.frame
DBI 1.1.0 R Database Interface
desc 1.2.0 Manipulate DESCRIPTION Files
digest 0.6.23 Create Compact Hash Digests of R Objects
dplyr 0.8.4 A Grammar of Data Manipulation
DT 0.12 A Wrapper of the JavaScript Library DataTables
dygraphs 1.1.1.6 Interface to Dygraphs Interactive Time Series Charting Library
e1071 1.7-3 Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien
ellipsis 0.3.0 Tools for Working with …
evaluate 0.14 Parsing and Evaluation Tools that Provide More Details than the Default
fansi 0.4.1 ANSI Control Sequence Aware String Functions
faraway 1.0.7 Functions and Datasets for Books by Julian Faraway
farver 2.0.3 High Performance Colour Space Manipulation
fastmap 1.0.1 Fast Implementation of a Key-Value Store
forcats 0.4.0 Tools for Working with Categorical Variables (Factors)
foreach 1.4.7 Provides Foreach Looping Construct
foreign 0.8-75 Read Data Stored by ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, ‘dBase’, …
formatR 1.7 Format R Code Automatically
formattable 0.2.0.1 Create Formattable Data Structures
Formula 1.2-3 Extended Model Formulas
future 1.16.0 Unified Parallel and Distributed Processing in R for Everyone
gdtools 0.2.1 Utilities for Graphical Rendering
generics 0.0.2 Common S3 Generics not Provided by Base R Methods Related to Model Fitting
geoR 1.7-5.2.2 Analysis of Geostatistical Data
gganimate 1.0.4 A Grammar of Animated Graphics
ggfittext 0.8.1 Fit Text Inside a Box in ggplot2
ggfortify 0.4.8 Data Visualization Tools for Statistical Analysis Results
ggiraph 0.7.0 Make ggplot2 Graphics Interactive
ggmosaic 0.2.0 Mosaic Plots in the ggplot2 Framework
ggplot2 3.2.1 Create Elegant Data Visualisations Using the Grammar of Graphics
ggridges 0.5.2 Ridgeline Plots in ggplot2
gifski 0.8.6 Highest Quality GIF Encoder
glmmTMB 1.0.0 Generalized Linear Mixed Models using Template Model Builder
glmnet 3.0-2 Lasso and Elastic-Net Regularized Generalized Linear Models
globals 0.12.5 Identify Global Objects in R Expressions
glue 1.3.1 Interpreted String Literals
gridBase 0.4-7 Integration of base and grid graphics
gridExtra 2.3 Miscellaneous Functions for Grid Graphics
gtable 0.3.0 Arrange Grobs in Tables
gtools 3.8.1 Various R Programming Tools
haven 2.2.0 Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files
hexbin 1.28.1 Hexagonal Binning Routines
highcharter 0.7.0 A Wrapper for the Highcharts Library
highr 0.8 Syntax Highlighting for R Source Code
hms 0.5.3 Pretty Time of Day
htmltools 0.4.0 Tools for HTML
htmlwidgets 1.5.1 HTML Widgets for R
httpuv 1.5.2 HTTP and WebSocket Server Library
httr 1.4.1 Tools for Working with URLs and HTTP
igraph 1.2.4.2 Network Analysis and Visualization
inline 0.3.15 Functions to Inline C, C++, Fortran Function Calls from R
iterators 1.0.12 Provides Iterator Construct
jpeg 0.1-8.1 Read and write JPEG images
jsonlite 1.6.1 A Robust, High Performance JSON Parser and Generator for R
kableExtra 1.1.0 Construct Complex Table with kable and Pipe Syntax
KernSmooth 2.23-16 Functions for Kernel Smoothing Supporting Wand & Jones (1995)
knitr 1.28 A General-Purpose Package for Dynamic Report Generation in R
labeling 0.3 Axis Labeling
later 1.0.0 Utilities for Scheduling Functions to Execute Later with Event Loops
lattice 0.20-38 Trellis Graphics for R
latticeExtra 0.6-29 Extra Graphical Utilities Based on Lattice
lazyeval 0.2.2 Lazy (Non-Standard) Evaluation
lifecycle 0.1.0 Manage the Life Cycle of your Package Functions
listenv 0.8.0 Environments Behaving (Almost) as Lists
lme4 1.1-21 Linear Mixed-Effects Models using Eigen and S4
lmtest 0.9-37 Testing Linear Regression Models
loo 2.2.0 Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models
lpSolve 5.6.15 Interface to ‘Lp_solve’ v. 5.5 to Solve Linear/Integer Programs
lubridate 1.7.4 Make Dealing with Dates a Little Easier
magick 2.3 Advanced Graphics and Image-Processing in R
magrittr 1.5 A Forward-Pipe Operator for R
mapproj 1.2.7 Map Projections
maps 3.3.0 Draw Geographical Maps
maptools 0.9-9 Tools for Handling Spatial Objects
markdown 1.1 Render Markdown with the C Library Sundown
MASS 7.3-51.5 Support Functions and Datasets for Venables and Ripley’s MASS
Matrix 1.2-18 Sparse and Dense Matrix Classes and Methods
MatrixModels 0.4-1 Modelling with Sparse And Dense Matrices
matrixStats 0.55.0 Functions that Apply to Rows and Columns of Matrices (and to Vectors)
mgcv 1.8-31 Mixed GAM Computation Vehicle with Automatic Smoothness Estimation
mime 0.9 Map Filenames to MIME Types
miniUI 0.1.1.1 Shiny UI Widgets for Small Screens
minqa 1.2.4 Derivative-free optimization algorithms by quadratic approximation
munsell 0.5.0 Utilities for Using Munsell Colours
mvtnorm 1.0-12 Multivariate Normal and t Distributions
nleqslv 3.3.2 Solve Systems of Nonlinear Equations
nlme 3.1-144 Linear and Nonlinear Mixed Effects Models
nloptr 1.2.1 R Interface to NLopt
nnet 7.3-12 Feed-Forward Neural Networks and Multinomial Log-Linear Models
openssl 1.4.1 Toolkit for Encryption, Signatures and Certificates Based on OpenSSL
openxlsx 4.1.4 Read, Write and Edit xlsx Files
packrat 0.5.0 A Dependency Management System for Projects and their R Package Dependencies
patchwork 1.0.0 The Composer of Plots
pbkrtest 0.4-7 Parametric Bootstrap and Kenward Roger Based Methods for Mixed Model Comparison
pdftools 2.3 Text Extraction, Rendering and Converting of PDF Documents
pillar 1.4.3 Coloured Formatting for Columns
pkgbuild 1.0.6 Find Tools Needed to Build R Packages
pkgconfig 2.0.3 Private Configuration for R Packages
plogr 0.2.0 The plog C++ Logging Library
plotly 4.9.1 Create Interactive Web Graphics via plotly.js
plyr 1.8.5 Tools for Splitting, Applying and Combining Data
png 0.1-7 Read and write PNG images
prettyunits 1.1.1 Pretty, Human Readable Formatting of Quantities
processx 3.4.1 Execute and Control System Processes
productplots 0.1.1 Product Plots for R
progress 1.2.2 Terminal Progress Bars
promises 1.1.0 Abstractions for Promise-Based Asynchronous Programming
ps 1.3.0 List, Query, Manipulate System Processes
purrr 0.3.3 Functional Programming Tools
pwr 1.2-2 Basic Functions for Power Analysis
qpdf 1.1 Split, Combine and Compress PDF Files
quantmod 0.4-15 Quantitative Financial Modelling Framework
quantreg 5.54 Quantile Regression
R6 2.4.1 Encapsulated Classes with Reference Semantics
RandomFields 3.3.8 Simulation and Analysis of Random Fields
RandomFieldsUtils 0.5.3 Utilities for the Simulation and Analysis of Random Fields
rappdirs 0.3.1 Application Directories: Determine Where to Save Data, Caches, and Logs
raster 3.0-12 Geographic Data Analysis and Modeling
rasterVis 0.47 Visualization Methods for Raster Data
RColorBrewer 1.1-2 ColorBrewer Palettes
Rcpp 1.0.3 Seamless R and C++ Integration
RcppEigen 0.3.3.7.0 Rcpp Integration for the Eigen Templated Linear Algebra Library
readr 1.3.1 Read Rectangular Text Data
readxl 1.3.1 Read Excel Files
rematch 1.0.1 Match Regular Expressions with a Nicer ‘API’
reshape2 1.4.3 Flexibly Reshape Data: A Reboot of the Reshape Package
reticulate 1.14 Interface to Python
rgdal 1.4-8 Bindings for the Geospatial Data Abstraction Library
rio 0.5.16 A Swiss-Army Knife for Data I/O
rlang 0.4.4 Functions for Base Types and Core R and Tidyverse Features
rlist 0.4.6.1 A Toolbox for Non-Tabular Data Manipulation
rmarkdown 2.1 Dynamic Documents for R
rprojroot 1.3-2 Finding Files in Project Subdirectories
rsconnect 0.8.16 Deployment Interface for R Markdown Documents and Shiny Applications
rstan 2.19.2 R Interface to Stan
rstantools 2.0.0 Tools for Developing R Packages Interfacing with Stan
rstudioapi 0.10 Safely Access the RStudio API
rvest 0.3.5 Easily Harvest (Scrape) Web Pages
sandwich 2.5-1 Robust Covariance Matrix Estimators
scales 1.1.0 Scale Functions for Visualization
selectr 0.4-2 Translate CSS Selectors to XPath Expressions
sf 0.8-1 Simple Features for R
shades 1.4.0 Simple Colour Manipulation
shape 1.4.4 Functions for Plotting Graphical Shapes, Colors
shiny 1.4.0 Web Application Framework for R
shinyjs 1.1 Easily Improve the User Experience of Your Shiny Apps in Seconds
shinystan 2.5.0 Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models
shinythemes 1.1.2 Themes for Shiny
sourcetools 0.1.7 Tools for Reading, Tokenizing and Parsing R Code
sp 1.3-2 Classes and Methods for Spatial Data
SparseM 1.78 Sparse Linear Algebra
splancs 2.01-40 Spatial and Space-Time Point Pattern Analysis
StanHeaders 2.19.0 C++ Header Files for Stan
stargazer 5.2.2 Well-Formatted Regression and Summary Statistics Tables
stringi 1.4.5 Character String Processing Facilities
stringr 1.4.0 Simple, Consistent Wrappers for Common String Operations
survival 3.1-8 Survival Analysis
sys 3.3 Powerful and Reliable Tools for Running System Commands in R
systemfonts 0.1.1 System Native Font Finding
threejs 0.3.3 Interactive 3D Scatter Plots, Networks and Globes
tibble 2.1.3 Simple Data Frames
tidyr 1.0.2 Tidy Messy Data
tidyselect 1.0.0 Select from a Set of Strings
tinytex 0.19 Helper Functions to Install and Maintain TeX Live, and Compile LaTeX Documents
TMB 1.7.16 Template Model Builder: A General Random Effect Tool Inspired by ADMB
transformr 0.1.1 Polygon and Path Transformations
treemap 2.4-2 Treemap Visualization
treemapify 2.5.3 Draw Treemaps in ggplot2
TTR 0.23-6 Technical Trading Rules
tufte 0.5 Tufte’s Styles for R Markdown Documents
tweenr 1.0.1 Interpolate Data for Smooth Animations
units 0.6-5 Measurement Units for R Vectors
utf8 1.1.4 Unicode Text Processing
uuid 0.1-2 Tools for generating and handling of UUIDs
vctrs 0.2.2 Vector Helpers
viridisLite 0.3.0 Default Color Maps from matplotlib (Lite Version)
webshot 0.5.2 Take Screenshots of Web Pages
whisker 0.4 {{mustache}} for R, Logicless Templating
withr 2.1.2 Run Code With Temporarily Modified Global State
xfun 0.12 Miscellaneous Functions by Yihui Xie
XML 3.99-0.3 Tools for Parsing and Generating XML Within R and S-Plus
xml2 1.2.2 Parse XML
xtable 1.8-4 Export Tables to LaTeX or HTML
xts 0.12-0 eXtensible Time Series
yaml 2.2.1 Methods to Convert R Data to YAML and Back
zip 2.0.4 Cross-Platform ‘zip’ Compression
zoo 1.8-7 S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations)

参考文献

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2020. Rmarkdown: Dynamic Documents for R. https://CRAN.R-project.org/package=rmarkdown.

Box, George E. P., and Norman R. Draper. 1987. Empirical Model-Building and Response Surfaces. New York, NY: John Wiley & Sons.

Pinheiro, José C., and Douglas M. Bates. 2000. Mixed-Effects Models in S and S-PLUS. New York, NY: Springer-Verlag. https://link.springer.com/book/10.1007%2Fb98882.

Ripley, Brian D. 2002. “Statistical Methods Need Software: A View of Statistical Computing.” Opening Lecture Royal Statistical Society. Plymouth. https://www.stats.ox.ac.uk/~ripley/RSS2002.pdf.

Tsagris, Michail, and Manos Papadakis. 2018. “Taking R to Its Limits: 70+ Tips.” PeerJ Preprints 6: e26605v1. https://doi.org/10.7287/peerj.preprints.26605v1.

Venables, W. N., and Brian D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York, NY: Springer-Verlag. http://www.stats.ox.ac.uk/pub/MASS4.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. 2nd ed. New York: Springer-Verlag. https://ggplot2-book.org/.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

Xie, Yihui. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.


  1. https://stat.ethz.ch/pipermail/r-help/2007-July/136590.html↩︎

  2. https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/jailbreakr-Get-out-of-Excel-free↩︎

  3. https://www.nytimes.com/2009/01/07/technology/business-computing/07program.html↩︎

  4. https://ww2.coastal.edu/kingw/statistics/R-tutorials/↩︎

  5. https://stackoverflow.com/questions/15568942↩︎

  6. 如果本地环境中没有 Git,你需要从它的官网 https://git-scm.com/downloads 下载安装适配本地系统的 Git 软件。↩︎

  7. 为了方便读者重现本书的内容,特将书籍的编译环境打包成 Docker 镜像。在启动镜像前需要确保本地已经安装 Docker 软件 https://www.docker.com/products/docker-desktop,安装过程请看官网教程。↩︎

  8. 不要使用 IE 浏览器,推荐使用谷歌浏览器获取最佳阅读体验。↩︎

  9. Pandoc https://pandoc.org/ 软件是系统 Fedora 30 仓库自带的,版本是 2.2.1,较新的 RStudio IDE 捆绑的 Pandoc 软件一般会高于此版本。如果你打算在本地系统上编译书籍,RStudio IDE 捆绑的 Pandoc 软件版本已经足够,当然你也可以在 https://github.com/jgm/pandoc/releases/latest 下载安装最新版本,此外,你还需参考书籍随附的 Dockerfile 文件配置 C++ 代码编译环境,安装所需的 R 包,并确保本地安装的版本不低于镜像内的版本。↩︎