Chapter 2 Introduction to statistical software - R
2.0.1 R packages required for this chapter
library(knitr)
library(tidyverse)
library(broom)
library(psych)
library(magrittr)
A twitter post has nicely summarized 10 reasons “Why should I bother learning to code”;
- Encourages reproducible statistical analyses
- Enables easy incorporation of “New” and “personalized” statistical methods
- Code sharing - be inspired and inspiring
- Career perspectives increased with this skill
- Data visualization -> better understanding of your data
- Avoid copy / paste frustrations
- Customization - get exactly what you want
- Develop interactive graphs and web-apps to increase dissemination and understanding of your work
- Consults with a statistician easier
- Personal satisfaction (it can be fun)
In summary, coding has the attributes of flexibility, transparency, and reproduciblity which should enhance overall research quality.
2.1 Statistical software - R
The most important element in clinical epidemiology is NOT which statistical software is chosen but rather an in depth understanding the basic epidemiologic and statistical concepts. Having said that, there are many advantages for R
, largely summarized by the fact that R is the lingua franca of data science, used by millions of data experts.
Why R
?
- Free and open source software environment for statistical computing and graphics
- Open source indicates the original source code is freely available, may be redistributed, and modified
- Allows & encourages researchers to modify, extend, and develop additions to the base program
- Additions are referred to as packages
- Use of scripts and Rmarkdown encourages reproducible research
- Active online community facilitates formal courses, sharing of solutions to coding queries
- Rstudio, an integrated development environment (IDE) greatly facilitates the
R
experience
- Combining with
Rmarkdown
can easily create, reproduce and share your work via html or pdf files
This book is not intended to be first line resource for learning R
, as there are many excellent online learning resources. It should be noted that there are at least 2 flavors or R
- 1) standard base R
2) tidyverse version, a collection of R packages designed with a common philosophy, grammar, and data structures especially useful for data science.
Learning and help resources
R
definitive online resource can be found at CRAN has a number of manuals online
- Condensed
R
reference card can be found [here][https://cran.r-project.org/doc/contrib/Short-refcard.pdf]
- The swirl tutorial teaches R programming and data science interactively, install
swirl
withinstall.packages("swirl")
and run with theswirl()
command
- Helpful cheet sheets can be found as the RStudio website
- UCLA
- Quick R
- R blogger, a daily compilation of R blogs from over the interent
- Advanced R
- After acquiring the basics, many questions are answered with the help of Stackoverflow
- Good old Google using “r type your question”
Within the R
environment to find help for a specific function, for example epi.2by2
in the EpiR
package try typing
* help("epi.2by2")
* example("epi.2by2")
* help.search("epi.2by2")
* RSiteSearch(“epi.2by2”) - provides online search
Packages
The capabilities of base R
are greatly extended using “packages.” These are distributed over the Internet via CRAN and can be downloaded either directly during an R
session by typing the command install.packages("pakage.name")
. Alternatively this can be done via RStudio which also provides a directory of all downloaded and installed packages. In 2010, there were about 2,000 packages, by 2016 there were almost 10,000 and by 2020 this has reached almost 17,000. This rapid growth of these important resources is one of the prime reasons for the ever increasing popularity of R
. Of course, there is also a chick and egg argument that sees the increasing popularity of R as a reason why more people are contributing packages.
For epidemiologists some of the standard epidemiology packages include epiR
, epibasix
, epitools
, and Epi
but there are over 30 packages including some that are ultra specialized.
2.2 R - Common data and variable manipulations
R
is a programming language based on the concept of objects, which may be data or code, in the form of procedures. The data structures are a form of organizing and storing data are four basic types - vector (single dimension structure of 1 type), matrix (two dimension structure of 1 type), list (single dimensional data structure of different types), & data frame (special case of a list where each component is of same length). Data frames are the most common data structure used in epidemiology analyses.
Here are some common data manipulations in R that represent the minimal knowledge or comfortable level that the reader may like to have to easily follow the code in later chapters.
Creating a data frame
# creation of a simple data frame (dat)
<- data.frame('id'=1:4, 'Age'=c(21,15,14,18), 'Gender'=c('M','F','F','M'))
dat dat
## id Age Gender
## 1 1 21 M
## 2 2 15 F
## 3 3 14 F
## 4 4 18 M
Read a data file
<- read.csv("data/pima_db.csv")
dat1 head(dat1,3)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
Other file formats including Excel, SAS, Stata, SPSS files can be read with readxl::read_excel()
, sas7bdat::read.sas7bdat()
, Hmisc::spss.get()
, foreign::read.dta()
respectively.
Variable manipulation
# create a new variable based on cutoff on existing variable
# Base R
$Glucose_hi <- NA
dat1$Glucose >120, 'Glucose_hi'] <- 1
dat1[dat1$Glucose <=120, 'Glucose_hi'] <- 0
dat1[dat1head(dat1[,c(1:3,8:10)],4)
## Pregnancies Glucose BloodPressure Age Outcome Glucose_hi
## 1 6 148 72 50 1 1
## 2 1 85 66 31 0 0
## 3 8 183 64 32 1 1
## 4 1 89 66 21 0 0
#tidyverse
library(tidyverse)
<- dat1 %>% mutate(Age_old = ifelse(Age > 50, 1, 0))
dat2 head(dat2[,c(1:3,8:11)],10)
## Pregnancies Glucose BloodPressure Age Outcome Glucose_hi Age_old
## 1 6 148 72 50 1 1 0
## 2 1 85 66 31 0 0 0
## 3 8 183 64 32 1 1 0
## 4 1 89 66 21 0 0 0
## 5 0 137 40 33 1 1 0
## 6 5 116 74 30 0 0 0
## 7 3 78 50 26 1 0 0
## 8 10 115 0 29 0 0 0
## 9 2 197 70 53 1 1 1
## 10 8 125 96 54 1 1 1
Variable and data subsetting
#################
# Variable subsetting
#################
# Base R
= subset(dat1, select = c('Pregnancies', 'Glucose'))
dat1s head(dat1s)
## Pregnancies Glucose
## 1 6 148
## 2 1 85
## 3 8 183
## 4 1 89
## 5 0 137
## 6 5 116
# tidyverse
%>% dplyr::select(Pregnancies, Glucose) %>% head() dat1
## Pregnancies Glucose
## 1 6 148
## 2 1 85
## 3 8 183
## 4 1 89
## 5 0 137
## 6 5 116
#################
# Data subsetting
#################
# Base R #1
<- subset(dat1, subset = Pregnancies >2 & Glucose_hi == 1) # notice need for == when looking for equality
dat1s head(dat1s[,c(1:4,8:10)])
## Pregnancies Glucose BloodPressure SkinThickness Age Outcome Glucose_hi
## 1 6 148 72 35 50 1 1
## 3 8 183 64 0 32 1 1
## 10 8 125 96 0 54 1 1
## 12 10 168 74 0 34 1 1
## 13 10 139 80 0 57 0 1
## 15 5 166 72 19 51 1 1
# Base R #2
= dat1[which(dat1$Pregnancies >2 & dat1$Glucose_hi ==1),]
dat1ss head(dat1ss[,c(1:4,8:10)])
## Pregnancies Glucose BloodPressure SkinThickness Age Outcome Glucose_hi
## 1 6 148 72 35 50 1 1
## 3 8 183 64 0 32 1 1
## 10 8 125 96 0 54 1 1
## 12 10 168 74 0 34 1 1
## 13 10 139 80 0 57 0 1
## 15 5 166 72 19 51 1 1
# tidyverse
library(tidyverse)
%>% dplyr::filter(Pregnancies >2 & Glucose_hi == 1) %>% head(,c(1:4,8:10)) dat1
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 8 183 64 0 0 23.3
## 3 8 125 96 0 0 0.0
## 4 10 168 74 0 0 38.0
## 5 10 139 80 0 0 27.1
## 6 5 166 72 19 175 25.8
## DiabetesPedigreeFunction Age Outcome Glucose_hi
## 1 0.627 50 1 1
## 2 0.672 32 1 1
## 3 0.232 54 1 1
## 4 0.537 34 1 1
## 5 1.441 57 0 1
## 6 0.587 51 1 1
Basic Data Descriptions
# Base R
summary(dat1)
## Pregnancies Glucose BloodPressure SkinThickness Insulin
## Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 99 1st Qu.: 62.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 3.00 Median :117 Median : 72.0 Median :23.0 Median : 30.5
## Mean : 3.85 Mean :121 Mean : 69.1 Mean :20.5 Mean : 79.8
## 3rd Qu.: 6.00 3rd Qu.:140 3rd Qu.: 80.0 3rd Qu.:32.0 3rd Qu.:127.2
## Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0
## BMI DiabetesPedigreeFunction Age Outcome
## Min. : 0.0 Min. :0.078 Min. :21.0 Min. :0.000
## 1st Qu.:27.3 1st Qu.:0.244 1st Qu.:24.0 1st Qu.:0.000
## Median :32.0 Median :0.372 Median :29.0 Median :0.000
## Mean :32.0 Mean :0.472 Mean :33.2 Mean :0.349
## 3rd Qu.:36.6 3rd Qu.:0.626 3rd Qu.:41.0 3rd Qu.:1.000
## Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000
## Glucose_hi
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.454
## 3rd Qu.:1.000
## Max. :1.000
# Other approaches
::describe(dat1) psych
## vars n mean sd median trimmed mad min
## Pregnancies 1 768 3.85 3.37 3.00 3.46 2.97 0.00
## Glucose 2 768 120.89 31.97 117.00 119.38 29.65 0.00
## BloodPressure 3 768 69.11 19.36 72.00 71.36 11.86 0.00
## SkinThickness 4 768 20.54 15.95 23.00 19.94 17.79 0.00
## Insulin 5 768 79.80 115.24 30.50 56.75 45.22 0.00
## BMI 6 768 31.99 7.88 32.00 31.96 6.82 0.00
## DiabetesPedigreeFunction 7 768 0.47 0.33 0.37 0.42 0.25 0.08
## Age 8 768 33.24 11.76 29.00 31.54 10.38 21.00
## Outcome 9 768 0.35 0.48 0.00 0.31 0.00 0.00
## Glucose_hi 10 768 0.45 0.50 0.00 0.44 0.00 0.00
## max range skew kurtosis se
## Pregnancies 17.00 17.00 0.90 0.14 0.12
## Glucose 199.00 199.00 0.17 0.62 1.15
## BloodPressure 122.00 122.00 -1.84 5.12 0.70
## SkinThickness 99.00 99.00 0.11 -0.53 0.58
## Insulin 846.00 846.00 2.26 7.13 4.16
## BMI 67.10 67.10 -0.43 3.24 0.28
## DiabetesPedigreeFunction 2.42 2.34 1.91 5.53 0.01
## Age 81.00 60.00 1.13 0.62 0.42
## Outcome 1.00 1.00 0.63 -1.60 0.02
## Glucose_hi 1.00 1.00 0.18 -1.97 0.02
::tidy(dat1) broom
## # A tibble: 10 x 13
## column n mean sd median trimmed mad min max range
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Pregnancies 768 3.85 3.37 3 3.46 2 0 17 17
## 2 Glucose 768 121. 32.0 117 119. 20 0 199 199
## 3 BloodPress… 768 69.1 19.4 72 71.4 8 0 122 122
## 4 SkinThickn… 768 20.5 16.0 23 19.9 12 0 99 99
## 5 Insulin 768 79.8 115. 30.5 56.7 30.5 0 846 846
## 6 BMI 768 32.0 7.88 32 32.0 4.6 0 67.1 67.1
## 7 DiabetesPe… 768 0.472 0.331 0.372 0.422 0.168 0.078 2.42 2.34
## 8 Age 768 33.2 11.8 29 31.5 7 21 81 60
## 9 Outcome 768 0.349 0.477 0 0.312 0 0 1 1
## 10 Glucose_hi 768 0.454 0.498 0 0.443 0 0 1 1
## # … with 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>
2.3 RStudio - The IDE for R
RStudio is an integrated development environment (IDE) for R. For overall convenience, flexibility, educational resources, and ongoing development it is in my opinion an unparalleled environment for working in R. It offers a multi-pane console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, project and workspace management. There are many tools within RStudio that facilitate coding including numerous shortcuts which accessed from a drop down menu within RStudio and can be found here. Several shortcuts that I find most helpful are listed below.
Table: Very useful RStudio shortcuts
| Command | Windows | Mac |
|---------------------------------- |--------------------------------------------- |----------------- |
| Assignment operator | Alt + - | Opt + - |
| Commenting & Uncommenting Code | Ctrl + Shift + C | Cmd + Shift + C |
| Add the Pipe %>% | Ctrl + Shift + M | Cmd + Shift + M |
| Keyboard Shortcut Cheat Sheet | Alt + Shift + K | Opt + Shift + K |
| Move cursor beginning of line | Home | Cmd+Left | | Move cursor to end of line | End | Cmd+Right |
When using RStudio, it generally most helpful to begin by creating a New Project from the
Filedrop down menu. As you will soon appreciate this has definitely file management advantages. For individual files, I find it most useful to create individual RMarkdown documents. For this book, each chapter is a separate
Rmdfile. These files have the advantage of being able to combine free text and
Rcode chunks which via a synthesis of the
markdownlanguage and
Pandoc` allows the output to be on the format of your choice (html, LaTex/pdf, WORD).
2.4 R - More than a statistical program
R
is much more than a mere statistical program. It is a complete programming language which while highly advantageous does result in a non trivial learning curve. One of the most outstanding attributes of R
is the ability to produce publication quality data visualizations with either base R
or within the tidyverse universe by using ggplot2
(see next chapter). Interactive graphics can also be easily produced. To appreciate the range of graphical activities possible, here is a self portrait drawn by R
. The code for this may be found here.
Some beautiful art and the accompanying R
code can be found here
2.5 R - General Public License
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.