Chapter 2 Working with Data (FQA)

This section provides an overview of useful commands for working with tabular data in R.

2.1 Reading in Data

2.1.1 Uploading Data to RStudio Cloud

To read in data from a file, that file must be stored in the RStudio Cloud project you are working within. Typically, the data files you need will come pre-loaded in the projects we share with you. However, if you need to upload your own data file, you can do so by clicking the “Upload” button in the bottom-right pane of RStudio (see image below).

2.1.2 Reading in CSV Files

Throughout HBAP we will typically work with data stored in comma-separated value (.csv) files. Once a file is loaded into your RStudio Cloud project, you can read it in with the read.csv() function. The file name you pass into the function must be surrounded by quotation marks ("").

data <- read.csv("employee_data.csv")
head(data)

##     ID                Name Gender Age Rating      Degree Start_Date Retired
## 1 6881  al-Rahimi, Tayyiba Female  51     10 High School  2/23/1990   FALSE
## 2 2671       Lewis, Austin   Male  34      4        Ph.D  2/23/2007   FALSE
## 3 8925   el-Jaffer, Manaal Female  50     10    Master's  2/23/1991   FALSE
## 4 2769       Soto, Michael   Male  52     10 High School  2/23/1987   FALSE
## 5 2658 al-Ebrahimi, Mamoon   Male  55      8        Ph.D  2/23/1985   FALSE
## 6 1933      Medina, Brandy Female  62      7 Associate's  2/23/1979    TRUE
##      Division Salary
## 1  Operations 108804
## 2 Engineering 182343
## 3 Engineering 206770
## 4       Sales 183407
## 5   Corporate 236240
## 6       Sales     NA

2.2 Data Frame Basics

We can count the number of observations (or rows) in our data set with nrow():

nrow(data)

## [1] 1000

Similarly, we can count the number of variables (or columns) with ncol():

ncol(data)

## [1] 10

We can get a list of the variable names with names():

names(data)

##  [1] "ID"         "Name"       "Gender"     "Age"        "Rating"    
##  [6] "Degree"     "Start_Date" "Retired"    "Division"   "Salary"

We can access specific columns in our data set with the $ operator. For example, the code below calculates the average age in our data set.

mean(data$Age)

## [1] 45.53

The str() command prints out the structure of the data, which indicates the recognized type of each variable:

str(data)

## 'data.frame':    1000 obs. of  10 variables:
##  $ ID        : int  6881 2671 8925 2769 2658 1933 3570 7821 3256 6222 ...
##  $ Name      : chr  "al-Rahimi, Tayyiba" "Lewis, Austin" "el-Jaffer, Manaal" "Soto, Michael" ...
##  $ Gender    : chr  "Female" "Male" "Female" "Male" ...
##  $ Age       : int  51 34 50 52 55 62 47 43 27 30 ...
##  $ Rating    : int  10 4 10 10 8 7 8 8 7 6 ...
##  $ Degree    : chr  "High School" "Ph.D" "Master's" "High School" ...
##  $ Start_Date: chr  "2/23/1990" "2/23/2007" "2/23/1991" "2/23/1987" ...
##  $ Retired   : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Division  : chr  "Operations" "Engineering" "Engineering" "Sales" ...
##  $ Salary    : int  108804 182343 206770 183407 236240 NA 101138 149468 74188 110078 ...

Finally, the summary() command provides an intelligent summary of each variable in the data set, based on the variable’s type:

summary(data)

##        ID           Name              Gender               Age       
##  Min.   :1030   Length:1000        Length:1000        Min.   :25.00  
##  1st Qu.:3225   Class :character   Class :character   1st Qu.:36.00  
##  Median :5360   Mode  :character   Mode  :character   Median :46.00  
##  Mean   :5384                                         Mean   :45.53  
##  3rd Qu.:7496                                         3rd Qu.:56.00  
##  Max.   :9973                                         Max.   :65.00  
##                                                                      
##      Rating          Degree           Start_Date         Retired       
##  Min.   : 2.000   Length:1000        Length:1000        Mode :logical  
##  1st Qu.: 6.000   Class :character   Class :character   FALSE:920      
##  Median : 7.000   Mode  :character   Mode  :character   TRUE :80       
##  Mean   : 6.993                                                        
##  3rd Qu.: 8.000                                                        
##  Max.   :10.000                                                        
##                                                                        
##    Division             Salary      
##  Length:1000        Min.   : 29825  
##  Class :character   1st Qu.:129694  
##  Mode  :character   Median :156290  
##                     Mean   :156486  
##                     3rd Qu.:184742  
##                     Max.   :266235  
##                     NA's   :80

2.3 Manipulating Data Frames

2.3.1 Subsetting Data

There are two ways to subset a data set. One method is to use the subset() function, where the first argument is the data frame and the second is the filtering condition. For example, the code below creates a new data frame called dataHighPerformers with only those employees whose Rating is greater than or equal to 8.

dataHighPerformers <- subset(data, Rating >= 8)
head(dataHighPerformers)

##     ID                 Name Gender Age Rating      Degree Start_Date Retired
## 1 6881   al-Rahimi, Tayyiba Female  51     10 High School  2/23/1990   FALSE
## 3 8925    el-Jaffer, Manaal Female  50     10    Master's  2/23/1991   FALSE
## 4 2769        Soto, Michael   Male  52     10 High School  2/23/1987   FALSE
## 5 2658  al-Ebrahimi, Mamoon   Male  55      8        Ph.D  2/23/1985   FALSE
## 7 3570  Troftgruben, Tierra Female  47      8 High School  2/23/1995   FALSE
## 8 7821 Holleman, Shaquaisha Female  43      8    Master's  2/23/1999   FALSE
##          Division Salary
## 1      Operations 108804
## 3     Engineering 206770
## 4           Sales 183407
## 5       Corporate 236240
## 7      Operations 101138
## 8 Human Resources 149468

Alternatively, we can specify the condition within brackets. Within the brackets, the first position indicates the rows we want to extract, which in this case is all rows where Rating is greater than or equal to 8. The second position indicates the columns we want to extract; we can extract all columns by leaving this position empty.

dataHighPerformers <- data[data$Rating >= 8, ]
head(dataHighPerformers)

##     ID                 Name Gender Age Rating      Degree Start_Date Retired
## 1 6881   al-Rahimi, Tayyiba Female  51     10 High School  2/23/1990   FALSE
## 3 8925    el-Jaffer, Manaal Female  50     10    Master's  2/23/1991   FALSE
## 4 2769        Soto, Michael   Male  52     10 High School  2/23/1987   FALSE
## 5 2658  al-Ebrahimi, Mamoon   Male  55      8        Ph.D  2/23/1985   FALSE
## 7 3570  Troftgruben, Tierra Female  47      8 High School  2/23/1995   FALSE
## 8 7821 Holleman, Shaquaisha Female  43      8    Master's  2/23/1999   FALSE
##          Division Salary
## 1      Operations 108804
## 3     Engineering 206770
## 4           Sales 183407
## 5       Corporate 236240
## 7      Operations 101138
## 8 Human Resources 149468

2.3.2 Sorting Data

You can sort data on specific columns using the order() function:

sortedData <- data[order(data$Age),]
head(sortedData)

##       ID                  Name Gender Age Rating      Degree Start_Date Retired
## 12  7068          Dimas, Roman   Male  25      8 High School  2/23/2017   FALSE
## 96  5464      al-Pirani, Rajab   Male  25      3 Associate's  2/23/2016   FALSE
## 126 7910        Hopper, Summer Female  25      7  Bachelor's  2/23/2017   FALSE
## 261 6784 al-Siddique, Zaitoona Female  25      4    Master's  2/23/2015   FALSE
## 322 3240        Steggall, Shai Female  25      7    Master's  2/23/2017   FALSE
## 358 1413          Tanner, Sean   Male  25      2 Associate's  2/23/2016   FALSE
##            Division Salary
## 12       Operations  84252
## 96       Operations  37907
## 126     Engineering 100688
## 261 Human Resources 127618
## 322      Operations 117062
## 358      Operations  61869

You can reverse the direction of the sort by adding a negative sign (-) in front of the variable:

sortedDataDesc <- data[order(-data$Age),]
head(sortedDataDesc)

##       ID              Name Gender Age Rating      Degree Start_Date Retired
## 137 8060 al-Morad, Mastoor   Male  65      8        Ph.D  2/23/1977   FALSE
## 223 9545    Lloyd, Devante   Male  65      9  Bachelor's  2/23/1974   FALSE
## 248 7305     Law, Charisma Female  65      8 Associate's  2/23/1976   FALSE
## 295 4141  Herrera, Yarabbi Female  65      8 High School  2/23/1975   FALSE
## 340 2559     Holiday, Emma Female  65      7  Bachelor's  2/23/1975    TRUE
## 462 4407     Ross, Caitlyn Female  65      7  Bachelor's  2/23/1975    TRUE
##            Division Salary
## 137       Corporate 213381
## 223      Accounting 243326
## 248 Human Resources 214788
## 295      Operations 143728
## 340      Operations     NA
## 462       Corporate     NA

2.4 Summary Statistics

2.4.1 Quantitative Variables

We can apply all of the following functions to any quantitative variable in a data frame to calculate basic summary statistics:

mean()
median()
sd()
summary()

mean(data$Age)

## [1] 45.53

median(data$Age)

## [1] 46

sd(data$Age)

## [1] 11.62102

The summary() function outputs the mean, median, and quartiles:

summary(data$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   36.00   46.00   45.53   56.00   65.00

If the column contains any missing values (NA), we need to add the optional na.rm = TRUE argument to these functions. For example:

mean(data$Salary, na.rm = TRUE)

## [1] 156486

We can use the cor() function to calculate the correlation between two quantitative variables. For this function, we need to add the optional argument use = "complete.obs" if either column contains missing values (instead of the typical na.rm = TRUE).

cor(data$Age, data$Salary, use = "complete.obs")

## [1] 0.5635125

2.4.2 Categorical Variables

We can count the values of a categorical variable using the table() function:

table(data$Division)

## 
##      Accounting       Corporate     Engineering Human Resources      Operations 
##              63             103             236              97             287 
##           Sales 
##             214

If we want the proportion of each category (instead of the count), we need to surround our call to table() with prop.table():

prop.table(table(data$Division))

## 
##      Accounting       Corporate     Engineering Human Resources      Operations 
##           0.063           0.103           0.236           0.097           0.287 
##           Sales 
##           0.214

Two categorical variables can be summarized in a two-way table using the same table() and prop.table() commands shown above. For example:

table(data$Division, data$Degree)

##                  
##                   Associate's Bachelor's High School Master's Ph.D
##   Accounting                0         31           0       32    0
##   Corporate                 0         20           0       40   43
##   Engineering               0         36           0       43  157
##   Human Resources          35         30           0       32    0
##   Operations              110         16         146       15    0
##   Sales                    55         67          54       38    0

prop.table(table(data$Division, data$Degree))

##                  
##                   Associate's Bachelor's High School Master's  Ph.D
##   Accounting            0.000      0.031       0.000    0.032 0.000
##   Corporate             0.000      0.020       0.000    0.040 0.043
##   Engineering           0.000      0.036       0.000    0.043 0.157
##   Human Resources       0.035      0.030       0.000    0.032 0.000
##   Operations            0.110      0.016       0.146    0.015 0.000
##   Sales                 0.055      0.067       0.054    0.038 0.000

Within prop.table(), we can use the margin argument to normalize by row (margin = 1) or by column (margin = 2):

prop.table(table(data$Division, data$Degree), margin = 1)

##                  
##                   Associate's Bachelor's High School   Master's       Ph.D
##   Accounting       0.00000000 0.49206349  0.00000000 0.50793651 0.00000000
##   Corporate        0.00000000 0.19417476  0.00000000 0.38834951 0.41747573
##   Engineering      0.00000000 0.15254237  0.00000000 0.18220339 0.66525424
##   Human Resources  0.36082474 0.30927835  0.00000000 0.32989691 0.00000000
##   Operations       0.38327526 0.05574913  0.50871080 0.05226481 0.00000000
##   Sales            0.25700935 0.31308411  0.25233645 0.17757009 0.00000000

prop.table(table(data$Division, data$Degree), margin = 2)

##                  
##                   Associate's Bachelor's High School Master's  Ph.D
##   Accounting            0.000      0.155       0.000    0.160 0.000
##   Corporate             0.000      0.100       0.000    0.200 0.215
##   Engineering           0.000      0.180       0.000    0.215 0.785
##   Human Resources       0.175      0.150       0.000    0.160 0.000
##   Operations            0.550      0.080       0.730    0.075 0.000
##   Sales                 0.275      0.335       0.270    0.190 0.000

2.5 Visualizations

2.5.1 Quantitative Variables

To visualize a quantitative variable we can create a histogram using the hist() function:

hist(data$Salary)

Alternatively, we can create a boxplot using the boxplot() function:

boxplot(data$Salary)

If we want to visualize the distribution of a quantitative variable over each category of a categorical variable, we can create a side-by-side boxplot. To do this we use the same boxplot() function as above, but add a tilde (~) and the categorical variable:

boxplot(data$Salary ~ data$Degree)

To visualize the relationship between two quantitative variables we can create a scatterplot with the plot() function:

plot(data$Age, data$Salary)

2.5.2 Categorical Variables

We can create a bar plot of a categorical variable by applying the barplot() function to a table:

barplot(table(data$Division))

We can also create a pie chart with the pie() function:

pie(table(data$Division))