1 Data Basics

1.1 R and RStudio essentials

Since we will use R and RStudio to illustrate concepts and give examples throughout these notes, it makes sense to begin by introducing these platforms.

1.1.1 The basics

  • R is a programming language and environment for statistical computing and graphics1.

  • RStudio is an integrated development environment (IDE) for R , which includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management2. Both are free and open source3. RStudio is the interface through which we will use R.

  • Posit Cloud is a hosted version of RStudio in the cloud.

To navigate through this course, you will need to either

  1. Download and install R and RStudio (instructions here), or

  2. Join the MA217 workspace on Posit Cloud (instructions here).

Once you launch RStudio RStudioLogo. You should see an interface similar to this one:

The panel in the upper right contains your workspace as well as a history of the commands that you have previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the left is where the action happens. It’s called the console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt, indicated by the > symbol. As its name suggests, this prompt is a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

Performing simple calculations in R is a good way to begin learning its features. Commands entered in the console are immediately executed by R. Most simple calculations will work just like you would expect from a typical calculator. For example, typing 3+5 in the console will return 8 (try it for yourself).

3 + 5
## [1] 8

Other examples:

3*5
## [1] 15
sqrt(81) # square root
## [1] 9

This last example demonstrates how functions are called within R as well as the use of comments. Comments are prefaced with the # character. You can also save values to named variables for later reuse.

myvariable = 3*5 # save result
myvariable # display the result
## [1] 15
myvariable <- 3*5 # you may use <- instead of = to assign values to a variable.
myvariable
## [1] 15

Once variables are defined, they can be referenced in other operations and functions.

0.5*myvariable
## [1] 7.5
log(myvariable) # (natural) log of myvariable
## [1] 2.70805

The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and print a value all in one go:

myvariable <- 3*5; myvariable # save result and show it
## [1] 15

Note that R is case sensitive, so MyVariable would be a different variable than myvariable.

We can create a set of values, which we’ll refer to as a vector, using the concatenate function c() such as:

a <- c(-1, 1, 0, 5, 2, -3) # creates a vector called 'a' with six values in it
a[2] # displays the second component of a
## [1] 1
a # displays the whole vector a
## [1] -1  1  0  5  2 -3

Basic arithmetic can also be carried out on vectors. These operations are carried out componentwise. For example, we could multiply each component of a by itself via

a*a
## [1]  1  1  0 25  4  9

or multiply each element of a by 2 as in

2*a
## [1] -2  2  0 10  4 -6

R also operates on logical quantities TRUE (or T for true) and FALSE (or F for false). Logical values are generated by conditions that are either true or false. For example,

a <- c(-3, 4, 2, -1, -5)
a > 0
## [1] FALSE  TRUE  TRUE FALSE FALSE

compares each element of the vector a with 0, returning TRUE when it is greater than 0 and FALSE otherwise. The following logical operators can be used: <, <=, >=, >, == (for equality), != (for “not equal”), as well as & (for intersection), | (for union) and ! (for negation).

Sometimes we may have variables that take character values. While it is always possible to code these values as numbers, there is no need to do this, as R can also handle character-valued variables. For example, the commands

A <- c('a', 'b'); A
## [1] "a" "b"

create a character vector A, containing two values a and b, and then we print out this vector. Note that we included the character values in single quotes when doing the assignment.

Sometimes data values are missing and so are listed as NA (not available). Operations on missing values create missing values. Also, an impossible operation, such as 0/0, produces NaN (not a number).

A data table can be created with the data.frame() command. The inputs are the columns of the data table. For example, to create a table in which the columns are the vectors col1 = c(1,2,3,4), col2 = c(5,6,7,8), we run

mydata <- data.frame(col1 = c(1,2,3,4), col2 = c(5,6,7,8))
mydata
##   col1 col2
## 1    1    5
## 2    2    6
## 3    3    7
## 4    4    8

Here we named the data frame “mydata” and its columns “col1” and “col2”, but different names could have been chosen. We can refer to elements of a data frame by using brackets [,] or the $ sign. For example, the number in the third row and first column of mydata is

mydata[3,1]
## [1] 3
mydata$col1[3]
## [1] 3

The entire first column is

mydata[,1]
## [1] 1 2 3 4
mydata$col1
## [1] 1 2 3 4

Data frames are one of the most common ways to store data in R.

Various objects can be created during an R session. To see those created so far in your session, use the command ls(). You can remove any objects in your workspace using the command rm(). For example, rm(x) removes the vector x.

1.1.2 Working with R Script Files

As an alternative to typing commands directly into the console, R commands can be stored in a file, called an R script, which facilitates executing some or all of the commands at any time. To create a file, select File, then New File, then R Script from the RStudio menu. A file editor tab will open in the Source panel. R code can be entered there, and buttons and menu items are provided to run all the code (called sourcing the file) or to run the code on a single line or in a selected section of the file.

To run your code from an R Script, you can either copy and paste it into the console, or highlight the code and hit the Run button, or highlight the code and hit command+enter on a mac or control+enter on a PC. To save an R Script, click on the disk icon. The first time you hit save, RStudio will ask for a file name; you can name it anything you like.

1.1.3 Working with R Markdown

R Markdown is a markup language that provides a good facility for creating reports that include text, R code, R output, and graphics. These notes were typed in R Markdown and your lab reports will be created using it as well. To create an R Markdown file from RStudio, click on “File”, then “New File”, then “R Markdown…”. Type the Title of your document (for example, “Lab 1 Report”) and your name. Select HTML as the default output. The file that will open has some examples but you will replace them with your report. Headings are created with hashtags: the fewer the hashtags, the larger the heading. To insert regular text, simply type the text. To insert R code to be executed, it must be preceded by 3 tick marks ``` followed by {r} and then closed with 3 tick marks. Click on the Knit button to generate an HTML file.

1.1.4 Importing a dataset into R

Even though one can generate data in R, most often the data we wish to analyze is generated elsewhere and needs to imported to R. If the data is on a spreadsheet (most common), it is best to save it as a .csv (comma separated value, aka comma delimited) file, but R can also import an Excel file (.xlsx). The data table should have variables with preferably short names. If the data has several sheets/tabs, save each sheet as its own file. Avoid complex formatting.

In RStudio, under “Environment”, click on “Import Dataset”, then “From Text (base)”. Browse and select the csv file. A dialog box will open. Among other things, the dialog box will give you the option to change the name of the dataset. If your data has a heading, select “yes” for Heading. Then click on “Import”. The newly imported dataset will be listed on the upper right box of RStudio and will be ready for use. When clicking on the dataset name, the data will be displayed on the upper left box.

All datasets used as examples in these notes will either be retrieved from R packages or will be available on Canvas for download.

1.1.5 R packages

Since R is an open-source programming language, users can contribute packages with functions that facilitate the use of R for certain types of analyses. In this course, we will often use the following packages:

  • The tidyverse “umbrella” package which houses a suite of many different R packages for data wrangling and data visualization.

  • The openintro R package, which we’ll use mainly for datasets.

To install these packages, type the following in the R console:

install.packages("tidyverse")
install.packages("openintro")

You can also go to Tools, then click on “Install packages”, then type the name of the package you would like to install. To use an installed package, you must call its library. For example, to use tidyverse within an R session or R Markdown document, we enter the command

library(tidyverse)

1.2 How is data organized?

A data matrix4 is a convenient and common way to organize data. Each row corresponds to a case (a.k.a. observation, individual, subject, unit, etc) and each column corresponds to a variable (characteristic of the cases). Other arrangements are sometimes used for certain statistical methods.

Each data matrix should be accompanied of a table with variable descriptions (codebook). Data may need to be cleaned before being organized.

Let’s look at the county dataset, available in the package usdata, which comes with the openintro package. To load the data, type:

data(county, package = "usdata")

To view the entire data, you can type View(county) in your console. Table 1.1 has the first 5 rows of the county dataset, which includes information for 3,142 counties in the US. Its variables are summarized in Table 1.2.

Table 1.1: First five rows of the county dataset.
name state pop2000 pop2010 pop2017 pop_change poverty homeownership multi_unit unemployment_rate metro median_edu per_capita_income median_hh_income smoking_ban
Autauga County Alabama 43671 54571 55504 1.48 13.7 77.5 7.2 3.86 yes some_college 27841.70 55317 none
Baldwin County Alabama 140415 182265 212628 9.19 11.8 76.7 22.6 3.99 yes some_college 27779.85 52562 none
Barbour County Alabama 29038 27457 25270 -6.22 27.2 68.0 11.1 5.90 no hs_diploma 17891.73 33368 partial
Bibb County Alabama 20826 22915 22668 0.73 15.2 82.9 6.6 4.39 yes hs_diploma 20572.05 43404 none
Blount County Alabama 51024 57322 58013 0.68 15.6 82.0 3.7 4.02 yes hs_diploma 21367.39 47412 none
Table 1.2: Description of the variables in the county dataset.
Variable Description
name County name.
state State where the county resides or the District of Columbia.
pop2000 Population in 2000.
pop2010 Population in 2010.
pop2017 Population in 2017.
pop_change Percent change in the population from 2010 to 2017.
poverty Percent of the population in poverty in 2017.
homeownership Percent of the population that lives in their own home or lives with the owner, 2006-2010.
multi_unit Percent of living units that are in multi-unit structures, e.g. apartments, 2006-2010.
unemployment_rate Unemployment rate as a percent, 2017.
metro Whether the county contains a metropolitan area.
median_edu Median education level (2013-2017), which can take a value among below_hs, hs_diploma, some_college, and bachelors.
per_capita_income Per capita (per person) income (2013-2017).
median_hh_income Median household income for the county, where a household’s income equals the total income of its occupants who are 15 years or older.
smoking_ban Describes the type of county-level smoking ban in place in 2010, taking one of the values ‘none’, ‘partial’, or ‘comprehensive’.

1.3 Variables

Variables are characteristics of the cases. They can be of the following types:

  • Numerical / quantitative: Represented by numbers (with units) and it’s sensible to do arithmetic with them. There are two types of numerical variables: continuous and discrete. Continuous variables can take real values continuously within an interval, while discrete variables can only take values discretely (with jumps).

  • Categorical / qualitative: Its possible values are categories, called the levels of the variable. There are two types of categorical variables: nominal and ordinal. Nominal variables have no apparent order in their levels, while ordinal variables do.

The variables in the county dataset (Table 1.2) can be classified as follows:

Table 1.3: Variables in the county dataset and their classifications.
Variable Classification
name Nominal
state Nominal
pop2000 Continuous
pop2010 Continuous
pop2017 Continuous
pop_change Continuous
poverty Continuous
homeownership Continuous
multi_unit Continuous
unemployment_rate Continuous
metro Nominal
median_edu Ordinal
per_capita_income Continuous
median_hh_income Discrete
smoking_ban Nominal

1.4 Models

Models are approximations of reality that can help our understanding of the world. This course focuses on statistical models, which describe a behavior in mathematical or statistical terms, based on data. Many models and analyses are motivated by a researcher looking for a relationship between two or more variables.

For example, are counties with a higher median household income associated with a higher percent change in population? A plot of these two quantities would be a good first step to gaining insight to answer the posed question.

library(tidyverse)
ggplot(data = county, aes(x = median_hh_income, y = pop_change)) +
  geom_point(alpha = 0.3)

Here we used the function ggplot from the package ggplot2, which comes in the tidyverse umbrella package. The first input is the dataset, and the second are the “aesthetics”, that is, which variables to place in the x- and y-axes. Then we add layers to the plot with the + sign. In this case, we add points using geom_point(). The input alpha = 0.3 makes the points have some transparency (the lower this value, the more transparent).

We can also do the same plot with R’s built-in plot function:

plot(pop_change ~ median_hh_income, data = county)

Here the symbol ~ means that we want the variable on the left-hand-side to be plotted as a function of the one on the right-hand-side. Even though the command above is shorter than the ggplot one, ggplot is a much more powerful tool for building visualizations.

A linear function may be a reasonable model for the relationship between these two quantities.

When two variables show some connection with one another, they are said to be associated. For numerical variables, if higher values of one variable correspond to higher values of another variable (or lower values a variable correspond to lower values of another variable), they are said to have a positive association. If higher values of one variable correspond to lower values of another variable, they are said to have a negative association. If two variables are not associated, they are said to be independent.

When investigating possible effects of one variable on another, the one that might affect the other is called an explanatory variable and the other the response variable.

1.5 Types of data-generating studies

There are two main types of data-generating studies: observational studies and experiments.

In an observational study, data are collected in a way that does not directly interfere with how the data arise (they are merely observed). Observational studies are generally only sufficient to show associations, not causal connections. Ex: medical records, surveys.

In an experiment, data are collected in a way that allows for investigating causal connections. Usually it involves random assignment to treatment and control groups. Ex: clinical trials.

Further, if the data are to be used to make broad and complete conclusions, then it is important to understand who or what the data represent. Knowing how the units were selected from a larger entity will allow for generalizations back to the population from which the data were randomly selected. A good question to ask oneself before working with the data at all is, “How were these observations collected?”, “What are the possible biases this data might have?”

References

Krzanowski, W. J., and F. H. C. Marriott. 1994. Multivariate Analysis Part 1: Distributions, Ordination, and Inference. Hodder Education Publishers.

  1. For more information, visit The R Project for Statistical Computing.This page also has a free R manual↩︎

  2. RStudio’s webpage↩︎

  3. R is a GNU project↩︎

  4. The term data matrix was defined in (Krzanowski and Marriott 1994)↩︎