Chapter 1 Getting Started With R

1.1 What is R?

R is a freely available “computational language and environment for data analysis and graphics.” R is indispensable for anyone that uses and interprets data. As medical, public health, and research epidemiologists, we use R in the following ways:

  • Full-function calculator
  • Extensible statistical package
  • High-quality graphics tool
  • Multi-use programming language

We use R to explore, analyze, and understand epidemiological data. We analyze data straight out of tables provided in reports or articles as well as analyze usual data sets. The data might be a large, individual-level data set imported from another source (e.g., cancer registry); an imported matrix of group-level data (e.g, population estimates or projections); or some data extracted from a journal article we are reviewing. The ability to quantitatively express, graphically explore, and describe epidemiologic data and processes enables one to work and strengthen one’s epidemiologic intuition.

In fact, we only use a very small fraction of the R package. For those who develop an interest or have a need, R also has many of the statistical modeling tools used by epidemiologists and statisticians, including logistic and Poisson regression, and Cox proportional hazard models. However, for many of these routine statistical models, almost any package will suffice (SAS, Stata, SPSS, etc.). The real advantage of R is the ability to easily manipulate, explore, and graphically display data. Repetitive analytic tasks can be automated or streamlined with the creation of simple functions (programs that execute specific tasks). The initial learning curve is steep, but in the long run one is able to conduct analyses that would otherwise require a tremendous amounts of programming and time.

Some may find R challenging to learn if they are not familiar with statistical programming. R was created by statistical programmers and is more often used by analysts comfortable with matrix algebra and programming. However, even for those unfamiliar with matrix algebra, there are many analyses one can accomplish in R without using any advanced mathematics, which would be difficult in other programs. The ability to easily manipulate data in R will allow one to conduct good descriptive epidemiology, life table methods, graphical displays, and exploration of epidemiologic concepts. R allows one to work with data in any way they come.

1.2 What is RStudio?

To get started quickly we need tools that makes the process of writing and compiling R code quick and (mostly) pain free. Fortunately for us there is RStudio—it is a free, open source, and powerful integrated development environment (IDE) for R that runs on most operating systems (Windows, Mac, or Linux). After installing R, install RStudio—it has all the tools we need to learn and apply R.

Screenshot of RStudio in Ubuntu Linux: Top left is for R script editor, or for viewing tabular data objects.  Top right has tabs for workspace and history. Bottom right has tabs for files, plots, packages, or help. Bottom left is the console for typing R expressions for evaluation.

Figure 1.1: Screenshot of RStudio in Ubuntu Linux: Top left is for R script editor, or for viewing tabular data objects. Top right has tabs for workspace and history. Bottom right has tabs for files, plots, packages, or help. Bottom left is the console for typing R expressions for evaluation.

1.3 Who should learn R?

Anyone that uses a calculator or spreadsheet, or analyzes numerical data at least weekly should seriously consider learning and using R. This includes epidemiologists, statisticians, physician researchers, engineers, health economists, health systems analysts, business analysts, and faculty and students of mathematics and science courses, to name just a few. We jokingly tell our staff analysts that once they learn R they will never use a spreadsheet program again (well almost never!).

1.4 Why should I learn R?

To implement numerical methods we need a computational tool. On one end of the spectrum are calculators and spreadsheets for simple calculations, and on the other end of the spectrum are specialized computer programs for such things as statistical and mathematical modeling. However, many numerical problems are not easily handled by these approaches. Calculators, and even spreadsheets, are too inefficient and cumbersome for numerical calculations whose scope and scale change frequently. Statistical packages are usually tailored for the statistical analysis of data sets and often lack an intuitive, extensible, open source programming language for tackling new problems efficiently.1 R can do the simplest and the most complex analysis efficiently and effectively.

When we learn and use R regularly we will save significant amounts of time and money. It’s powerful and it’s free! It’s a complete environment for data analysis and graphics. Its straightforward programming language facilitates the development of functions to extend and improve the efficiency of our analyses.

1.5 Where can I get R?

R is available for many computer platforms, including Mac OS, Linux, Microsoft Windows, and others. R comes as source code or a binary file. Source code needs to be compiled into an executable program for our computer. Those not familiar with compiling source code (and that’s most of us) just install the binary program. We assume most readers will be using R in the Mac OS or MS Windows environment. Listed here are useful R links:

To install R for Windows or Mac OS, do the the following:

  • Go to http://www.r-project.org;
  • From the left menu list, click on the “CRAN” (Comprehensive R Archive Network) link;
  • Select a nearby geographic site (e.g., http://cran.cnr.berkeley.edu);
  • Select appropriate operating system;
  • Select on “base” link;
  • For Windows, save R-X.X.X-win.exe to the computer; and for Mac OS, save the R-X.X.X-pkg installer package.
  • Run the installation program and accept the default installation options. That’s it!

1.6 How do I use R?

1.6.1 Using R on our computer

Use R by typing commands at the R command line prompt (>) and pressing Enter on our keyboard. This is how to use R interactively. Alternatively, from the R command line, we can also execute a list of R commands that we have saved in a text file (more on this later). Here is an example of using R as a calculator:

(4 + 6)^3 - 2*500/4
## [1] 750

Use the c function to collect data entered at the console. Name each collection of data, and then perform a numercal operation. In this example we conduct an analysis that is analogous to working in a spreadsheet.

quantity <- c(34, 56, 22)        # quantity data
price <- c(19.95, 14.95, 10.99)  # price data
subtotal <- quantity*price       # subtotal cost
cbind(quantity, price, subtotal) # column bind into spreadsheet-like display
##      quantity price subtotal
## [1,]       34 19.95   678.30
## [2,]       56 14.95   837.20
## [3,]       22 10.99   241.78

1.6.2 Does R have epidemiology programs?

The default installation of R does not have packages that specifically implement epidemiologic applications; however, many of the statistical tools that epidemiologists use are readily available, including statistical models such as unconditional logistic regression, conditional logistic regression, Poisson regression, Cox proportional hazards regression, and much more. Table 1.1 lists selected R packages with biostatistical or mathematical modeling methods applied to epidemiologic problems. The focus of this book is learning how to use R without relying on a specific packages. Learning the R basics covered in this book will help one take full advantage of these and other R packages, some of which address advanced topics such as network modeling of epidemics.

Table 1.1: Selected R packages for epidemiologic applications
Packages Description
dagbag Learning directed acyclic graphs (DAGs) through bootstrap aggregating
dagitty Graphical Analysis of Structural Causal Models
dagR R functions for directed acyclic graphs
amei Adaptive Management of Epidemiological Interventions
Epi A Package for Statistical Analysis in Epidemiology
epibasix Elementary Epidemiological Functions for Epidemiology and Biostatistics
EpiBayes Implements Hierarchical Bayesian Models for Epidemiological Applications
EpiContactTrace Epidemiological Tool for Contact Tracing
epiDisplay Epidemiological Data Display Package
EpiDynamics Dynamic Models in Epidemiology
EpiEstim EpiEstim: estimate time varying reproduction numbers from epidemic curves
epifit Flexible Modelling Functions for Epidemiological Data Analysis
EpiModel Mathematical Modeling of Infectious Disease
epinet Epidemic/Network-Related Tools
epiR Tools for the Analysis of Epidemiological Data
episensr Basic Sensitivity Analysis of Epidemiological Results
episheet Rothman’s Episheet
epitools Epidemiology Tools
EpiWeek Conversion Between Epidemiological Weeks and Calendar Dates

1.6.3 How should I use these notes?

The best way to learn R is to use it! Use it as your calculator! Use it as your spreadsheet! Finally read these notes sitting at a computer and use R interactively (this works best sitting in a cafe that brews great coffee and plays good music). Although we initially encourage you to use R interactively by typing expressions at the console, as a general rule, it is much better to type your code as a R script. Save your code with a convenient file name such as job01.R.2

RStudio comes with a text editor for creating and editing R scripts. Our focus will be learning how to use RStudio to edit and run R scripts.

The code in your text editor can be run in the following ways: - Highlight and run selected expressions in the RStudio; - Copy and paste the code directly into R console; - Run the file in batch mode from the R console using the source function (e.g., source("job01.R")).

Table 1.2: Selected math operators
Operator Description Try these examples
+ addition 5+4
- subtraction 5-4
* multiplication 5*4
/ division 5/4
^ exponentiation 5^4
- unary minus (change current sign) -5
abs absolute value abs(-23)
exp exponentiation (\(e\) to a power) exp(8)
log logarithm (default is natural log) log(exp(8))
sqrt square root sqrt(64)
%/% integer divide 10%/%3
%% modulus 10%%3
%*% matrix multiplication xx <-matrix(1:4, 2, 2)
xx%*%c(1, 1)
c(1, 1)%*%xx

1.7 Just do it!

1.7.1 Using R as your calculator

Open R and start using it as our calculator. The most common math operators are displayed in Table 1.2. From now on make R our default calculator! Study the examples and spend a few minutes experimenting with R as a calculator. Use parentheses as needed to group operations. Use the keyboard Up-arrow to recall what we previously entered at the command line prompt.

1.7.2 Useful R concepts

1.7.2.1 Types of evaluable expressions

Every expression that is entered at the R console is evaluated by R and returns a value. A literal is the simplist expression that can be evaluated (number, character string, or logical value). Mathematical operations involve numeric literals. For example, R evaluates the expression 4*4 and returns the value 16. The exception to this is when an evaluable expression is assigned an object name: x <- 4*4. To display the assigned expression, wrap the expression in parentheses: (x <- 4*4), or type the object name. Finally, evaluable expressions must be separated by either newline breaks or a semicolon. Table 1.3 summarizes evaluable R expressions.

4 * 4
## [1] 16
x <- 4 * 4    # assign expression to object named 'x'
x             # display 'x' object
## [1] 16
(x <- 4 * 4)  # evaluates expression and displays 'x'
## [1] 16
x <- 4*4; x   # expressions can be separated by semi-colons
## [1] 16
Table 1.3: Types of evaluable R expressions
Expression type Try these examples
literal "hello, I'm John Snow" #character
3.5 #numeric
TRUE; FALSE #logical
math operation 6*7
assignment x <- 4*4
data object x
function log(x)

1.7.2.2 Using the assignment operator

Most calculators have a memory function: the ability to assign a number or numerical result to a key for recalling that number or result at a later time. The same is true in R but it is much more flexible. Any evaluable expression can be assigned a name and recalled at a later time. We refer to these variables as data objects. We use the assignment operator (<-) to name an evaluable expression and save it as a data object.

xx <- "hello, what's your name"; xx
## [1] "hello, what's your name"

Multiple assignments work and are read from right to left:

aa <- bb <- 99
aa; bb
## [1] 99
## [1] 99

Finally, the equal sign (=) can be used for assignments, although we prefer and the <- symbol.

ages <- c(34, 45, 67) # equivalent
ages = c(34, 45, 67) # equivalent

The reason I prefer <- for assigning object names in the workspace is because later we use = for assigning values to function arguments. For example,

x <- 1:10 # assigning object name (x)
sample(x = 1:10, size = 5)  # assigning value to argument x

The first x is an object name assignment in the workspace which persist during the R session. The second x is a function argument assignment which is only recognized locally in the function and only for the duration of the function execution. For clarity, it is better to keep these types of assignments separate in our mind by using different assignment symbols.

Study these previous examples and spend a few minutes using the assignment operator to create and call data objects. Try to use descriptive names if possible. For example, suppose we have data on age categories; we might name the data agecat, age.cat, or age_cat.3

1.7.3 Useful R functions

When we start R you have opened a workspace. The first time we use R, the workspace is empty. Every time we create a data object, it is in the workspace. If a data object with the same name already exists, the old data object will be overwritten without warning, so be careful. To list the objects in your workspace use the ls or objects functions:

ls() # lists objects in the workspace
## [1] "aa"        "ages"      "bb"        "epi.packs" "price"     "quantity" 
## [7] "subtotal"  "x"         "xx"
objects() # equivalent
## [1] "aa"        "ages"      "bb"        "epi.packs" "price"     "quantity" 
## [7] "subtotal"  "x"         "xx"

Data objects can be saved between sessions. We will be prompted with “Save workspace image?” You can also use save.image() at the console prompt. The workspace image is saved in a file called .RData.4 Use getwd() to display the file path to the .RData file. Table 1.4 has more useful R functions.

Table 1.4: Useful R functions
Function Description Try these examples
q Quit R q()
ls List objects ls()
objects objects() #equivalent
rm Remove object(s) yy <- 1:5; ls()
remove rm(yy); ls()
#removes all: caution!
rm(list = ls()); ls()
help Open help instructions; help()
or get help on specific topic help(plot)
?plot #equivalent
help.search Search help system given character string help.search("print")
help.start Start help browser help.start()
apropos Displays all objects in the search list matching topic apropos(plot)
getwd Get working directory getwd()
setwd Set working directory setwd("c:\mywork\rproject")
args Display arguments of function args(sample)
example Runs example of a function example(plot)
data Information on available R data sets; data() #displays data sets
Load data set data(Titanic) #load data set
save.image Saves current workspace to .RData save.image()

1.7.3.1 What are packages?

R has many available functions. When we open R, several packages are attached by default. Each package has its own suite of functions. To display the list of attached packages use the search function. To display the file paths to the packages use the searchpaths function.

search() # Linux
##  [1] ".GlobalEnv"        "package:foreign"   "package:survival" 
##  [4] "ESSR"              "package:stats"     "package:graphics" 
##  [7] "package:grDevices" "package:utils"     "package:datasets" 
## [10] "package:methods"   "Autoloads"         "package:base"
searchpaths() # Linux
##  [1] ".GlobalEnv"                                                              
##  [2] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/foreign"  
##  [3] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/survival" 
##  [4] "ESSR"                                                                    
##  [5] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/stats"    
##  [6] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/graphics" 
##  [7] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/grDevices"
##  [8] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/utils"    
##  [9] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/datasets" 
## [10] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/methods"  
## [11] "Autoloads"                                                               
## [12] "/Library/Frameworks/R.framework/Resources/library/base"

To learn more about a specific package enter library(help=“Package_Name”). Alternatively, we can get more detailed information by entering help.start() which opens the HTML help page. On this page click on the Packages link to see the available packages. If we need to load a package enter library(Package_Name). For example, when we cover survival analysis we will need to load the survival package.

1.7.3.2 What are function arguments?

We will be using many R functions for data analysis, so we need to know some function basics. Suppose we are interested in taking a random sample of days from the month of June, which has 30 days. We want to use the sample function but we forgot the syntax. Let’s explore:

sample
## function (x, size, replace = FALSE, prob = NULL) 
## {
##     if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 
##         1) {
##         if (missing(size)) 
##             size <- x
##         sample.int(x, size, replace, prob)
##     }
##     else {
##         if (missing(size)) 
##             size <- length(x)
##         x[sample.int(length(x), size, replace, prob)]
##     }
## }
## <bytecode: 0x7feea69b0c30>
## <environment: namespace:base>

Whoa! What happened? Whenever we type the function name without any parentheses it usually returns the whole function code. This is useful when we start programming and we need to alter an existing function, borrow code for our own functions, or study the code for learning how to program. If we are already familiar with the sample function we may only need to see the syntax of the function arguments. For this we use the args function:

args(sample)
## function (x, size, replace = FALSE, prob = NULL) 
## NULL

The terms x, size, replace, and prob are the function arguments. First, notice that replace and prob have default values; that is, we do not need to specify these arguments unless we want to override the default values. Second, notice the order of the arguments. If you enter the argument values in the same order as the argument list we do not need to specify the argument.

dates <- 1:30
sample(dates, 16) # sample "size = 16"
##  [1]  7 20  9 30 22 14 27 15  5 21 24 18 26  3 11  4

Third, if we enter the arguments out of order then we will get either an error message or an undesired result. Arguments entered out of their default order need to be specified.

sample(16, dates) # undesired results; wanted "size = 16"
## [1] 2
sample(size = 16, x = dates) # gives desired result
##  [1] 10 26 14 16 12  5 24  4 15 28  3 20  9 29 21  1

Fourth, when we specify an argument we only need to type a sufficient number of letters so that R can uniquely identify it from the other arguments.

sample(s = 16, x = dates, r = TRUE) # sampling w/ replacement
##  [1] 15 20 11 22 23 22 27  4 18 21 21  2 17 18 13 16

Fifth, argument values can be any valid R expression (including functions) that evaluates to an appropriate value. In the following example we see two sample functions that provide random values to the sample function arguments.<

sample(s = sample(1:36, 1), x = sample(1:10, 5), r=T)
##  [1] 6 7 7 7 7 8 1 1 6 8 6 9 9 8 1 8 8 6 1 9 8 9 8 9 9 1 6 9 1

Finally, if we need more guidance on how to use the sample function enter ?sample or help(sample).

1.7.4 How do I get help?

RStudio has extensive help capabilities. From the RStudio main menu select Help \(\rightarrow\) R Help to get you started. The Frequently Asked Questions (FAQ) and R manuals are available from this menu. From the R console, try the following options to learn about the help capabilities:

?help               # opens help page for 'help' function
help.start()        # launches HTML help page
help.search("help") # searches help system for "help"
apropos("help")     # displays 'help' objects in search list

To learn about about available data sets use the data function:

data()                          # display avail. data sets
try(data(package = "survival")) # list 'survival' data sets
help(pbc, package = "survival") # display pbc data help page

1.7.5 Is there anything else that I need?

Not really. RStudio has everything you will need to use R productively. Some analysts will select to use R with a text editor, rather than RStudio. Like RStudio, a good text editor makes programming and data processing easier and more efficient. If you are considering a text editor, the functionality we look for in a text editor are the following:

  • Toggle between wrapped and unwrapped text
  • Block cutting and pasting (also called column editing)
  • Easy macro programming
  • Search and replace using regular expressions
  • Ability to import large datasets for editing

When we are programming we want our text to wrap so we can read all of your code. When we import a data set that is wider than the screen, we do not want the data set to wrap: we want it to appear in its tabular format. Column editing allows us to cut and paste columns of text at will. A macro is just a way for the text editor to learn a set of keystrokes (including search and replace) that can be executed as needed. Searching using regular expressions means searching for text based on relative attributes. For example, suppose you want to find all words that begin with “b”, end with “g”, have any number of letters in between but not “r” and “f”. Regular expression searching makes this a trivial task. These are powerful features that once we use regularly, we will wonder how we ever got along without them.

If we do not want to install a text editing program then we can use the default text editor that comes with our computer operating system (gedit in Ubuntu Linux, TextEdit in Mac OS, Notepad in Windows). However, it is much better to install a text editor that works with R. My favorite text editor is the free and open source GNU Emacs. GNU Emacs can be extended with the “Emacs Speaks Statistics” (ESS) package. For more information on Emacs and ESS pre-installed for Windows and Mac OS, visit http://ess.r-project.org.

1.7.6 What’s ahead?

To the novice user, R may seem complicated and difficult to learn. In fact, for its immense power and versatility, R is easier to learn and deploy compared to other statistical software (e.g. SAS, Stata, SPSS). This is because R was built from the ground up to be an efficient and intuitive programming language and environment. If you understand the logic and structure of R, then learning proceeds quickly. Just like a spoken language, once you know its rules of grammar, syntax, and pronunciation, and can write legible sentences, you can figure out how to communicate almost anything. Before we get into the “trees” (next chapter), we want to describe the “forest”: the logic and structure of working with R objects and epidemiologic data.

1.7.6.1 Working with R objects

For our purposes, there are only five types of data objects in R5 and five types of actions we take on these objects (Table 1.5). That’s it! No more, no less. You will learn to create, name, index (subset), replace components of, and operate on these data objects using a systematic, comprehensive approach. As you learn about each new data object type, it will reinforce and extend what you learned previously.

Table 1.5: Types of actions taken on R data objects
Action Vector Matrix Array List Data frame
Creating Table 2.5 Table 2.12 Table 2.20 Table 3.1 Table 3.7
Naming Table 2.6 Table 2.13 Table 2.21 Table 3.2 Table 3.8
Indexing Table 2.7 Table 2.14 Table 2.22 Table 3.3 Table 3.9
Replacing Table 2.8 Table 2.15 Table 2.23 Table 3.4 Table 3.10
Operating on Table 2.9 Table 2.16 Table 2.24 Table 3.5 Table 3.11
  Table 2.10      

A vector is a collection of elements (often numbers):

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12); x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

A matrix is a 2-dimensional representaton of a vector:

y <- matrix(x, nrow = 2); y
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    3    5    7    9   11
## [2,]    2    4    6    8   10   12

An array is a 3 or more dimensional represention of a vector:

z <- array(x, dim = c(2, 3, 2)); z
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

A list is a collection of “bins”, each containing any kind of R object:

mylist <- list(x, y, z); mylist
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    3    5    7    9   11
## [2,]    2    4    6    8   10   12
## 
## [[3]]
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

A data frame is a list in tabular form where each “bin” contains a data vector of the same length. A data frame is the usual tabular data set familiar to epidemiologists. Each row is an record and each column (“bin”) is a field.

kids <- c("Tomasito", "Lusito", "Angelita")
gender <- c("boy", "boy", "girl")
age <- c(8, 7, 4)
mydf <- data.frame(kids, gender, age)
mydf
##       kids gender age
## 1 Tomasito    boy   8
## 2   Lusito    boy   7
## 3 Angelita   girl   4

In the next chapter we explore these R data objects in greater detail.

1.8 Exercises

If you have not done so already,

  • install R on your computer (https://cran.rstudio.com/),
  • install RStudio on your computer (https://www.rstudio.com/), and
  • register for a RPubs account (http://www.rpubs.com/), and open RStudio.
  • Open a new Rmarkdown template file (.Rmd extension).
  • Learn Rmarkdown and use it to answer the exercises in this chapter.
  • Submit the exercises as a HTML link, Word document, or PDF document.

1.8.1

What is the R workspace file on your operating systems? What is the file path to your R workspace file? What is the name of this workspace file?

1.8.2

By default, which R packages come already loaded? What are the file paths to the default R packages?

1.8.3

List all the object in the current workspace. If there are none, create some data objects. Using one expression, remove all the objects in the current workspace.

1.8.4

One inch equals 2.54 centimeters. Correct the following R code and create a conversion table.

inches <- 1:12 
centimeters <- inches/2.54 
cbind(inches, centimeters) 

1.8.5

To convert between temperatures in degrees Celsius (\(C\)) and Farenheit (\(F\)), we use the following conversion formulas:

\[ C = (F - 32) \frac{5}{9} \]

\[ F = \frac{9}{5} C + 32 \]

At standard temperature and pressure, the freezing and boiling points of water are 0 and 100 degrees Celsius, respectively. What are the freezing and boiling points of water in degrees Fahrenheit?

1.8.6

For the Celsius temperatures 0, 5, 10, 15, 20, 25, …, 80, 85, 90, 95, 100, construct a conversion table that displays the corresponding Fahrenheit temperatures. Hint: to create the sequence of Celsius temperatures use seq(0, 100, 5).

1.8.7

BMI is a reliable indicator of total body fat, which is related to the risk of disease and death. The score is valid for both men and women but it does have some limits. BMI does have some limitations: it may overestimate body fat in athletes and others who have a muscular build, it may underestimate body fat in older persons and others who have lost muscle mass.

Table 1.6: Body mass index classification
BMI Classification
\(<18.5\) Underweight
\([18.5, 25)\) Normal weight
\([25, 30)\) Overweight
\(\ge 30\) Obesity

Body Mass Index (BMI) is calculated from your weight in kilograms and height in meters:

\[ BMI = \frac{kg}{m^2} \]

\[ 1\,\mbox{kg} \approx 2.2\,\mbox{lb} \]

\[ 1\,\mbox{m} \approx 3.3\,\mbox{ft} \]

Calculate your BMI (don’t report it to us).

Using Table 1.2, explain in words, and use R to illustrate, the difference between modulus and integer divide.

1.8.8

In mathematics, a logarithm (to base \(b\)) of a number \(x\) is written \(\log_b(x)\) and equals the exponent \(y\) that satisfies \(x = b^y\). In other words,

\[ y = \log_b(x) \] is equivalent to

\[ x = b^y \]

In R, the log function is to the base \(e\). Implement the following R code and study the graph:

curve(log(x), 0, 6)
abline(v = c(1, exp(1)), h = c(0, 1), lty = 2)

What kind of generalizations can you make about the natural logarithm and its base—the number \(e\)?

1.8.9

Risk (\(R\)) is a probability bounded between 0 and 1. Odds is the following transformation of \(R\):

\[ Odds = \frac{R}{1-R} \]

Use the following code to plot the odds:

curve(x/(1-x), 0, 1)

Now, use the following code to plot the \(\log\)(odds):

curve(log(x/(1-x)), 0, 1)

What kind of generalizations can you make about the \(\log\)(odds) as a transformation of risk?

Table 1.7: Estimated per-act risk (transmission probability) for acquisition of HIV, by exposure route to an infected source. Source: CDC
Exposure route Risk per 10,000 exposures
Blood transfusion (BT) 9,000
Needle-sharing injection-drug use (IDU) 67
Receptive anal intercourse (RAI) 50
Percutaneous needle stick (PNS) 30
Receptive penile-vaginal intercourse (RPVI) 10
Insertive anal intercourse (IAI) 6.5
Insertive penile-vaginal intercourse (IPVI) 5
Receptive oral intercourse on penis (ROI) 1
Insertive oral intercourse with penis (IOI) 0.5

Use the data in Table 1.7. Assume one is HIV-negative. If the probability of infection per act is \(p\), then the probability of not getting infected per act is \((1-p)\). The probability of not getting infected after 2 consecutive acts is \((1-p)^2\), and after 3 consecutive acts is \((1-p)^3\). Therefore, the probability of not getting infected infected after \(n\) consecutive acts is \((1-p)^n\), and the probability of getting infected after \(n\) consecutive acts is \(1-(1-p)^n\). For each non-blood transfusion transmission probability (per act risk) in Table 1.7, calculate the cumulative risk of being infected after one year (365 days) if one carries out the same act once daily for one year with an HIV-infected partner. Do these cumulative risks make intuitive sense? Why or why not?

1.8.10

The source function in R is used to “source” (read in) ASCII text files. Take a group of R commands that worked from a previous problem above and paste them into an ASCII text file and save it with the name job01.R. Then from R command line, source the file. Here is how it looked on my Linux computer running R:

> source("~/Documents/courses/ph251d/jobs/job01.R")

Describe what happened. Now, set option to .

> source("~/Documents/courses/ph251d/jobs/job01.R", echo = TRUE)

Describe what happened. To improve your understanding read the help file on the source function.

1.8.11

Now run the source again (without and with echo = TRUE) but each time create a log file using the sink function. Create two log files: job01.log1a and job01.log1b.

> sink("~/Documents/courses/ph251d/jobs/job01.log1a")
> source("~/Documents/courses/ph251d/jobs/job01.R")
> sink() #closes connection
>
> sink("~/Documents/courses/ph251d/jobs/job01.log1b")
> source("~/Documents/courses/ph251d/jobs/job01.R", echo = TRUE)
> sink() #closes connection

Examine the log files and describe what happened.

1.8.12

Create a new job file () with the following code:

n <- 365
per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000
risks <- 1-(1-per.act.risk)^n
show(risks)

Source this file at the R command line and describes what happens.


  1. Recommendations for mostly free and open source software (FOSS) at http://phds.io.

  2. The .R extension, although not necessary, is useful when searching for R command files. Additionally, this file extension is recognized by RStudio and many text editors.

  3. To improve readability, a period (.) or underscore (_) symbol can be used in your object name

  4. In some operating systems files names that begin with a period (.) are hidden files and are not displayed by default. You may need to change the viewing option to see the file.

  5. The sixth type of R object is a function. Functions can create, manipulate, operate on, and store data; however, we will use functions primarily to execute a series of R “commands” and not as primary data objects.