Chapter 5 Using R for Fundamental Data Analysis

In the past few chapters, you learned about R-Studio, its user interface, and how to use it to read, save, and retrieve a data file created. In this chapter, we will take a look at how to use R for basic data manipulation and analysis. Before we read in a data file for analysis, it is important to understand the different classes of variables and how R uses it.

5.1 Classes of Variables in R

In the previous chapter, you learned how to assign a value to a variable. A variable in R can take on a single value that is numeric or a character. When it is a vector, it can take on multiple values that are either numeric or character variables. The vector can also take on logical values such TRUE and FALSE or other more complex numbers as values.

The vector has certain attributes associated with it such as ‘names’, ‘dimensions’ and ‘class’ as shown in the code below. The ‘names’ attribute is useful for assigning names or labels. The dimensions attribute allows specifying number of rows and columns or in case of multidimensional arrays, it helps in specifying number of arrays. Matrices or two-dimensional arrays are nothing but a vector of values written with rows and columns as specified. This can be extended to arrays which can consist of multiple matrices with additional arguments for dimension.

## [1] 1 2 3 4 5 6
## [1] "numeric"
##    red   blue  green yellow orange  white 
##      1      2      3      4      5      6
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## [1] "matrix"

In addition to the above, it is useful to understand the role that a ‘list’ plays in R. A list can contain variables of different types as follows:

## [[1]]
## [1] 1 2 3 4 5 6
## 
## [[2]]
## [1] "Harry"  "Robert" "Nancy"

Understanding creating and reading elements from a list is useful when manipulating different types of data. This is covered in more advanced lessons in R since this is an introductory level of R for beginners.

5.2 Reading and analyzing data

In this section, you will learn how to read and analyze data for basic analysis. We already looked at how to compute basic measures of summary of data in the previous chapter. While that is a useful start, very often, basic exploration, analysis, and presentation of data require tabulations of frequencies, filtering, and plotting data. In order to do the above, we will be working with a specific dataset and a couple of new packages. R also has some basic functions for analysis built in. However, you will find that the R packages are more user friendly and efficient in the output generated.

In the previous chapter, you learned how to create variables and save a dataset. It is useful to have variable labels and value labels in the dataset. In order to read in a dataset along with the labels, we will use the following command to read a sample dataset named “samplesurveydata.csv” created using a simulated dataset courtesy of Hainke (2018):

##   id gender age province                 product Q1 Q2 Q3 Q4 Q5
## 1  1      2   4       ON            Bass-O-Matic  2  4  4  2  8
## 2  2      1   3       BC Little Chocolate Donuts  3  2  3  5  5
## 3  3      1   4       ON Little Chocolate Donuts  1  3  1  4  8
## 4  4      2   4       ON            Bass-O-Matic  2  3  2  4  4
## 5  5      2   1       QC            Bass-O-Matic  2  4  5  5  0
## 6  6      2   1       QC            Bass-O-Matic  2  5  3  5  7
## 'data.frame':	2000 obs. of  10 variables:
##  $ id      :Class 'labelled' int  1 2 3 4 5 6 7 8 9 10 ...
##    .. .. LABEL: Id 
##  $ gender  :Class 'labelled' int  2 1 1 2 2 2 1 1 1 1 ...
##    .. .. LABEL: Gender 
##    .. .. VALUE LABELS [1:2]: 1=Male, 2=Female 
##  $ age     :Class 'labelled' int  4 3 4 4 1 1 1 3 3 3 ...
##    .. .. LABEL: Age 
##    .. .. VALUE LABELS [1:5]: 1=18-24, 2=25-34, 3=35-54, 4=55-64, 5=65+ 
##  $ province: chr  "ON" "BC" "ON" "ON" ...
##  $ product : chr  "Bass-O-Matic" "Little Chocolate Donuts" "Little Chocolate Donuts" "Bass-O-Matic" ...
##  $ Q1      :Class 'labelled' int  2 3 1 2 2 2 5 4 4 5 ...
##    .. .. LABEL: Satisfaction with store 
##    .. .. VALUE LABELS [1:5]: 1=Very dissatisfied, 2=Somewhat dissatisfied, 3=Neither satisfied nor dissatisfied, 4=Somewhat satisfied, 5=Very satisfied 
##  $ Q2      :Class 'labelled' int  4 2 3 3 4 5 2 4 4 5 ...
##    .. .. LABEL: Satisfaction with products 
##    .. .. VALUE LABELS [1:5]: 1=Very dissatisfied, 2=Somewhat dissatisfied, 3=Neither satisfied nor dissatisfied, 4=Somewhat satisfied, 5=Very satisfied 
##  $ Q3      :Class 'labelled' int  4 3 1 2 5 3 3 4 5 2 ...
##    .. .. LABEL: User friendly website 
##    .. .. VALUE LABELS [1:5]: 1=Strongly disagree, 2=Disagree, 3=Neither agree nor disagree, 4=Agree, 5=Strongly agree 
##  $ Q4      :Class 'labelled' int  2 5 4 4 5 5 4 2 1 5 ...
##    .. .. LABEL: Helpful customer service 
##    .. .. VALUE LABELS [1:5]: 1=Not at all helpful, 2=Not so helpful, 3=Somewhat helpful, 4=Very helpful, 5=Extremely helpful 
##  $ Q5      :Class 'labelled' int  8 5 8 4 0 7 4 4 2 3 ...
##    .. .. LABEL: Would recommend to others

As seen above, we are using a package called expss(Demin 2020 ) that makes it easy to create and read variables with labels and value labels. The command library is used to load the package named expss. If it is not listed under packages in your right console, it needs to be downloaded first for it to work. It is also important to use read_labelled_csv to read the file as opposed to read_csv to retain pre-existing labels for variables and values in the file.

5.3 Adding a Variable or a Row

A typical issue that comes up when dealing with any data is the necessity to add a new observation or add a new variable or create a new variable through transformation to an existing dataset. We will work through an example using a smaller dataframe for simplicity to demonstrate this task. To add a variable, we simply use the function cbind and similarly to add a variable we will use the function rbind as follows:

##   var1 var2
## 1    1    A
## 2    2    B
## 3    3    C
## 4    4    D
## 5    5    E

As seen above, we created a dataframe with two variables and five observations. We will add a third variable var3 and a sixth observation as follows:

##   var1 var2 var3
## 1    1    A    5
## 2    2    B   10
## 3    3    C    8
## 4    4    D   11
## 5    5    E   12
## 6    1    B    8

Such tasks of adding new variables or new observations or changing a particular value are very common when analyzing data. For example, if we need to change the value in sixth observation for var2 to “A” from “B”, we can do it as follows:

## [1] B
## Levels: A B C D E
##   var1 var2 var3
## 1    1    A    5
## 2    2    B   10
## 3    3    C    8
## 4    4    D   11
## 5    5    E   12
## 6    1    A    8

As seen above, the value has been changed in the dataframe. The numbers within the square brackets of a dataframe correspond to row and column positions.

Now that we have covered some basics on handling, reading, and transforming data, it is time to start analyzing it. Typically, when examining a survey or any other datasets, you might want to start by exploring some basic descriptives such as measures of central tendency, dispersion, and frequency tables. R does provide some basic functions such as mean and table. In the next section, we will be looking at how to use some basic functions of R for some fundamental data analysis. As subsequently seen, they are useful but can also have limitations. Some of these limitations can be overcome by using specific packages designed to accomplish such tasks.

5.4 Descriptive Statistics on Single Variables

Typically, we first examine descriptive statistics on individual variables. For some, these might be measures of central tendency such as mean and median and for others, it could be frequency tables. R has some basic functions to examine such statistics, but as shown below they have some limitations. Basic built-in functions in R include mean, median, and table for mean, median, and frequency tables respectively. Below are examples of some basic statistics on a couple of the questions - one on gender and the other on variable Q1 which corresponds to satisfaction with the store.

## [1] 2.9995
## [1] 3
## 
##   Male Female 
##    995   1005

There are times we would like to look at the responses as percentages, as opposed to counts for which R has a built-in function of prop.table which may be used as follows:

## 
##   Male Female 
## 0.4975 0.5025

It is also possible to use these basic functions to compute a two way table as follows, for example looking at satisfaction with store by gender.

##                                     
##                                      Male Female
##   Very dissatisfied                   199    176
##   Somewhat dissatisfied               223    208
##   Neither satisfied nor dissatisfied  193    213
##   Somewhat satisfied                  200    196
##   Very satisfied                      180    212
##                                     
##                                        Male Female
##   Very dissatisfied                  0.0995 0.0880
##   Somewhat dissatisfied              0.1115 0.1040
##   Neither satisfied nor dissatisfied 0.0965 0.1065
##   Somewhat satisfied                 0.1000 0.0980
##   Very satisfied                     0.0900 0.1060

However, as seen here these basic built in functions are a little cumbersome as they need to be run separately for counts and percentages. In addition, they do not provide information on missing values as those are not included in the calculations. Thus, using functions provided in the package expss we loaded earlier is useful. The functions to use in expss are fre for frequency tables and cro for cross tabulations as follows:

Gender  Count   Valid percent   Percent   Responses, %   Cumulative responses, % 
 Male  995 49.8 49.8 49.8 49.8
 Female  1005 50.2 50.2 50.2 100.0
 #Total  2000 100 100 100
 <NA>  0 0.0
 Gender 
 Male   Female 
 Satisfaction with store 
   Very dissatisfied  199 176
   Somewhat dissatisfied  223 208
   Neither satisfied nor dissatisfied  193 213
   Somewhat satisfied  200 196
   Very satisfied  180 212
   #Total cases  995 1005

The function cro in expss has additional options to compute not just the counts but also row and column percentages as follows:

 Gender 
 Male   Female 
 Satisfaction with store 
   Very dissatisfied  20.0 17.5
   Somewhat dissatisfied  22.4 20.7
   Neither satisfied nor dissatisfied  19.4 21.2
   Somewhat satisfied  20.1 19.5
   Very satisfied  18.1 21.1
   #Total cases  995 1005
 Gender 
 Male   Female 
 Satisfaction with store 
   Very dissatisfied  10.0 8.8
   Somewhat dissatisfied  11.2 10.4
   Neither satisfied nor dissatisfied  9.7 10.7
   Somewhat satisfied  10.0 9.8
   Very satisfied  9.0 10.6
   #Total cases  995 1005

5.5 Statistics on Multiple Variables

Although the above functions are useful in looking at single and multiple variables one or two at a time, there are times when it is easier to compute summary statistics on multiple variables all at once. For example, the basic function summary provides multiple descriptive statistics on a single variable. Since, variables Q1 through Q5 are all numeric, it would be useful to run summary statistics on all five of them using a single command. This may be done by placing them in a single data frame as follows and then running summary command on the resulting data frame.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.999   4.000   5.000
##      df.Q1           df.Q2           df.Q3           df.Q4     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.00  
##  Mean   :2.999   Mean   :3.034   Mean   :2.995   Mean   :2.97  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00  
##      df.Q5       
##  Min.   : 0.000  
##  1st Qu.: 2.000  
##  Median : 5.000  
##  Mean   : 4.941  
##  3rd Qu.: 8.000  
##  Max.   :10.000

It is also useful to compute statistics on multiple variables such as running counts and percentages. It is useful to use the tab functions in expss along with what is known as piping, a concept that you get used to more as you advance. This has three different components to it. The first is to specify the dataframe on which the function is applied, followed by the variables it is applied to, followed by the statistics of interest and a final use of the function tab_pivot to create the table.

 #Total 
 Gender 
   Male  995
   Female  1005
   #Total cases  2000
 #Total 
 Gender 
   Male  49.8
   Female  50.2
   #Total cases  2000
 #Total 
 Gender 
   Male  49.8
   Female  50.2
   #Total cases  2000
 Satisfaction with store 
   Very dissatisfied  18.8
   Somewhat dissatisfied  21.6
   Neither satisfied nor dissatisfied  20.3
   Somewhat satisfied  19.8
   Very satisfied  19.6
   #Total cases  2000
 Gender 
 Male   Female 
 Age 
   18-24  14.0 14.3
   25-34  21.9 20.6
   35-54  31.7 29.0
   55-64  13.7 15.0
   65+  18.8 21.1
   #Total cases  995 1005
 Satisfaction with store 
   Very dissatisfied  20.0 17.5
   Somewhat dissatisfied  22.4 20.7
   Neither satisfied nor dissatisfied  19.4 21.2
   Somewhat satisfied  20.1 19.5
   Very satisfied  18.1 21.1
   #Total cases  995 1005

In summary, in this chapter you learned how to quickly compute tables and cross-tabulations, along with creating them in ways that are easy to quickly interpret and see patterns in the data. In the next chapter, the focus is on creating visual charts to represent data in different ways.

References

Demin, Gregory. 2020. Expss: Tables, Labels and Some Useful Functions from Spreadsheets and ’Spss’ Statistics. https://CRAN.R-project.org/package=expss.

Hainke, Michael. 2018. Getting Started Using Survey Data for Analysis. https://www.hainke.ca/.