Chapter 1 Getting Started
R is an free software for statistics and graphics. It is widely used among statisticians and data scientists for developing statistical software and data analysis.
R is primarily developed in three programming languages: C, Fortan and R itself. Although It itself can be used with command line interface, there are several third-party integrated development environment (IDE) with nice graphical user interface, including RStudio and Jupyter Notebook.
R comes from S
programming language. S
was created by John Chambers in 1976 at Bell Labs. Later, two statisticians, Ross Ihaka and Robert Gentleman, developed R that is currently maintained by the R Development Core Team. R is named partly after the first names of the first two R authors and partly as a play on the name of S.
R can be downloaded from http://cran.r-project.org/.
1.1 RStudio
Rstudio
is an integraded environment for R. It comes in two versions: Desktop version is a desktop application and server version runs on a web browser. Regular RStudio
for personal usage is free for both desktop and server version.
If R is a car engine, then RStudio is the structure of that care, body frame, interior design and every other thing that help us operate it safely, efficiently and comfotably.
RStudio can be downloaded from their official website.
1.2 Start screen
When you open RStudio, you will see a screen that looks like figure 1.1.
The interface has four window panes. Each window pane may have several tabs or sub-windows. By default, Source Pane is on the top-left corner, Console Pane on bottom-left, Environment Pane on top-right, and Files Pane on bottom-right.
Source Pane is a text editor that will be referred to as R Script later on. You can write commands and save them, which is the main point of reproducibility. Anyone who has this R Script can review and edit it in the future.
Console Pane is the place you write your line-by-line command. It means you can only write a single command or a long paragraph of commands. After you close the RStudio, the commands will not be saved unless you specified to do so. However, it is the best way to saving R Script in Source Paneto store the commands you desire.
- Symbol
>
called prompt - Type 3 + 4, and press Enter.
- Symbol
Environment Pane has several sub-windows. For data management and beginner, you only have to know Environment and History.
Environment
is where R works.Global Environment
is the place where your data will be after importing data.History
saves the commands you run in R console.Connections
is where you connect to external databases.
Files Pane also has several sub-windows.
Files
is like a folder manager on your phone. You can manage files and folders as well as set the working directory.Plots
is where your plots will appear.Packages
is where you manage your R packages. You can install it from CRAN or other repositories. You can also install locally stored R packages.Help
is where R stores documentations. You can open the help or introduction page of the respective packages as well as individual functions.
You can arrange these windows as you like. I think putting console on the right upper window will give you a lot of freedom as you will mostly be using Source Pane and Console Pane simultaneously. You can change this by pressing the four window icon below the menu bar and choosing the option Console on the Right as shown in figure 1.2.
1.3 R Packages
Base R
is what you get after installing R. R is powerful because of its tens of thousands of packages. The number is still growing. This also creates the problem of confusion. Even no single user can check or use all packages R can offer. A solution is to stick to certain packages that work well for you. In this book, I will show you a few packages that works for me.
1.3.1 Installing a package from CRAN
To install a package, you can use the function install.packages("package_name")
if it is published on CRAN
. CRAN
is the comprehensive R Archive Network, web servers that store R packages.
Packages installed directly from Github are usually under development, but up-to-date and often free from previous issues and bugs.
Let’s install the packages used in this book.
<- c("tidyverse", "mStats", "flextable")
package_name install.packages(package_name)
1.3.2 Installing a package from GitHub
For packages published on GitHub, use either devtools::install_github("repo")
or remotes::install_github("repo")
.
Let’s illustrate this by installing remotes
and then mStats
from GitHub.
install.packages("remotes")
::install_github("myominnoo/mStats") remotes
1.3.3 Essential packages for this book
- dplyr
- mStats
- magrittr
1.4 mStats
R has a steep learning curve. For physicians and health professionals, they just want to get started with data management and analysis, rather than spending a lot of time to learn and master R. There might be an advantage to mastering R; however unlike statisticians and programmers, it seems a very daunting process for them to become comfortable with coding in R.
The mStats
package was developed to facilitate the data management and analysis in R while requiring only a few basic of R. It takes some principles from one of the most popular statistical software in public health settings, STATA
. STATA
has been in existence for a long time now that most of its commands are stable, powerful, and intuitive. These features attract power-users and help them accomplish their data tasks.
In a nutshell, mStats
is a R package that facilitates data management and analysis of health research in R environment. It includes three core sets of functions that are designed to help data management, data analysis, and reporting streamline for health research projects.
For epidemiology and public health, there are already a number of good packages available for R
, including but not limited to epiCalc
, Epi
, epibasix
, epiDisplay
, epiR
and epitools
. mStats
does not intend to replace any of them, but to only contribute as much as I can to data management and analysis of health research projects.
1.4.1 How it started
The mStats
package came into life when I started my doctoral life in Thailand in 2018. The first version of mStats
was released in late 2018. Initially it was intended to prove a concept of what I have learnt and how much I could make use of what I have learnt so far. In addition, I would like something open source unlike STATA
and something different from base R.
There are two packages that deeply inspired me to do this: dplyr
under the universe of tidyverse
by Hadley Wickham and epiCalc
by my advisor Virasakdi Chongsuvivatwong. The first version of mStats
made use of ggplot2
and included functions for generating ready-made plots. I released several small patch versions with addition of a few functions that were interesting to me at that time.
In mid 2019, I started another package called stats2
. In this package, I tried to simplify the concept of one function doing one task with a few or no options to tweak the task. Although this might be in contrast with the generous flexibility of R
, I hardly doubt beginners would mind this limitation. In fact, such simplicity might make users’ life easier to get their jobs in R done. The package was on CRAN and released on March 31, 2020.
As times progress, I hope this package will find its way to some people in their works.
1.4.2 Guiding principles
Most of the guiding principles behind dplyr
, tidyverse
, and STATA
inspire the creation and development of the mStats
package:
Compose simple functions with the pipe:
data masking
for tidy evaluation: it is very common to use datasets as in the form ofdata frame
in as R analysis project.R
functions are mostly very generic and created for basic data structures of R. The [tidyverse] uses tidy evaluation and implements data masking in itsdplyr
package. This allows manipulation of data-variables as if they were “programming” variables that live in an environment. Basically, it allows you to type, as an example,generate(airquality, lnOzone = log(Ozone))
instead ofairquality$lnOzone <- log(airquality$Ozone)
.The principle of pipes changes the paradigm of coding in
R
. The use of the pipe function%>%
from themagrittr
package allows for cleaner code and enhances readability of the codes. Most functions inmStats
is compatible with the usage of the pipe function.
Designed for humans:
KISS
: Keepfunctions
short and simple - one function does only one task with few options to change the nature of the task.mStats
intentionally limits options of arguments in functions and attempts to balance between having too many argument options and having no options at all.Labels
:mStats
consistently uses the concept of labelling variables and datasets. It also streamlineslabels
for the purpose of report generation. Though it is not necessary to label values, this can be done using certain packages likelabelled
andsjlabelled
.Outputs: outputs in the console are well-formatted for the ease of interpretation.
Reproducible research
- Making tables is One of the most common tasks in health research analysis where most researchers waste much time. Reproducibility in health research is becoming crucially important and depends on streamlining this process.
mStats
provides one of many workflows used for tables generation and link toR
codes for getting the job done.
- Making tables is One of the most common tasks in health research analysis where most researchers waste much time. Reproducibility in health research is becoming crucially important and depends on streamlining this process.
1.5 Package masking
Packages including mStats
might contain functions that are in conflicts with other packages. I suggest to explicitly use the syntax to call the desired package, package::function
.
For example, the cut
function from mStats
will be masked from the cut
function from base
. When you want to use the function from base, you need to write this: base::cut()
.
1.6 Piping your workflow
The pipe function %>%
from the magrittr
package can connect several lines of codes in a single workflow. It allows cleaner codes and enhances human readability. As an example, if you have three generic functions, you can write codes like this: function3(function32(function31(..., data), ...), ...)
. But it is confusing. By using %>%
, we can arrange codes from top to bottom and left to right as follow:
%>%
data function1(...) %>%
function2(...) %>%
function3(...)
1.7 Searching Help
1.7.1 General
If you get stuck, try to google it or search on virtual forums like Stack Overflow
. Most of what you want to know are already answered or solved in one of those forums.
1.7.2 Getting help
If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please directly email me dr.myominnoo@gmail.com or use the mStats mailing list.
1.8 Example datasets
In this book, we will use datasets from several sources. The code chunks below download these datasets into a data
folder under your current working directory.
Let’s create a function that takes a URL link for zipped files, download, and unzip the files into specified directory.
<- function(url, exdir = "data/", file = tempfile(), junkpaths = TRUE) {
download_zipfile for (link in url) {
## download the dmus2 zipped file into temporary file
download.file(link, file)
## extract the dmus2 datasets into the data folder
unzip(file, exdir = exdir, junkpaths = TRUE)
} }
1.8.1 dmus2
The following code chunk downloads datasets from one of the most popular data management books for Stata, Data Management Using Stata: A Practical Handbook, Second Edition
## Download dmus2 zipped file
download_zipfile("https://www.stata-press.com/data/dmus2/dmus2.zip",
exdir = "data/dmus2")
1.8.2 WHO Mortality Data
THe following code chunks download datasets from WHO webpage.
## 1. A list of countries-years available for the mortality and population data.
## 2. Reference populations and live births.
## 3. Country codes
<- c("https://cdn.who.int/media/docs/default-source/world-health-data-platform/mortality-raw-data/mort_availability.zip?sfvrsn=23261e11_15&ua=1",
file "https://cdn.who.int/media/docs/default-source/world-health-data-platform/mortality-raw-data/mort_pop.zip?sfvrsn=937039fc_10&ua=1",
"https://cdn.who.int/media/docs/default-source/world-health-data-platform/mortality-raw-data/mort_country_codes.zip?sfvrsn=800faac2_5&ua=1")
download_zipfile(file, exdir = "data/who")
1.8.3 UCLA Stata Tutorial
The following code chunks download datasets from UCLA’s Stata tutorials.
if (!dir.exists("data/ucla"))
dir.create("data/ucla") ## this will create a directory if it's not there
download.file("http://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm.dta",
destfile = "data/ucla/patient_pt1_stata_dm.dta")
download.file("http://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm.dta",
destfile = "data/ucla/patient_pt2_stata_dm.dta")
download.file("http://stats.idre.ucla.edu/stat/data/doctor_stata_dm.dta",
destfile = "data/ucla/doctor_stata_dm.dta")
download.file("http://stats.idre.ucla.edu/stat/data/hsb2.csv",
destfile = "data/ucla/hsb2.csv")
download.file("http://stats.idre.ucla.edu/stat/data/hsb2.xls",
destfile = "data/ucla/hsb2.xls")
1.8.4 Miliary Tuberculosis
The following dataset was used to evaluated a scoring system for central nervous tuberculosis among patients with miliary tuberculosis in a Chinese hospital. Read their study here.
We put this dataset into a separate path data/others/
.
if (!dir.exists("data/others"))
dir.create("data/others") ## this will create a directory if it's not there
download.file("https://journals.plos.org/plosone/article/file?type=supplementary&id=info:doi/10.1371/journal.pone.0176651.s001",
destfile = "data/others/miliary_tb.xlsx")