Chapter 5 Using R for Fundamental Data Analysis

In the past few chapters, you learned about R-Studio, its user interface, and how to use it to read, save, and retrieve a data file created. In this chapter, we will take a look at how to use R for basic data manipulation and analysis. Before we read in a data file for analysis, it is important to understand the different classes of variables and how R uses it.

5.1 Classes of Variables in R

In the previous chapter, you learned how to assign a value to a variable. A variable in R can take on a single value that is numeric or a character. When it is a vector, it can take on multiple values that are either numeric or character variables. The vector can also take on logical values such TRUE and FALSE or other more complex numbers as values.

The vector has certain attributes associated with it such as ‘names’, ‘dimensions’ and ‘class’ as shown in the code below. The ‘names’ attribute is useful for assigning names or labels. The dimensions attribute allows specifying number of rows and columns or in case of multidimensional arrays, it helps in specifying number of arrays. Matrices or two-dimensional arrays are nothing but a vector of values written with rows and columns as specified. This can be extended to arrays which can consist of multiple matrices with additional arguments for dimension.

x<-c (1,2,3,4,5,6)  ##creating a vector with numeric values
x

## [1] 1 2 3 4 5 6

class(x)

## [1] "numeric"

names(x)<-c ("red","blue","green","yellow","orange","white")  ## assigns labels to the values in vector x as shown next
x

##    red   blue  green yellow orange  white 
##      1      2      3      4      5      6

dim(x)<-c (2,3) ## creating a matrix from a vector with two rows and three columns
x

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

class(x)

## [1] "matrix"

In addition to the above, it is useful to understand the role that a ‘list’ plays in R. A list can contain variables of different types as follows:

x<-c(1,2,3,4,5,6)
y<-c("Harry","Robert","Nancy")
z<-list(x,y)
z

## [[1]]
## [1] 1 2 3 4 5 6
## 
## [[2]]
## [1] "Harry"  "Robert" "Nancy"

Understanding creating and reading elements from a list is useful when manipulating different types of data. This is covered in more advanced lessons in R since this is an introductory level of R for beginners.

5.2 Reading and analyzing data

In this section, you will learn how to read and analyze data for basic analysis. We already looked at how to compute basic measures of summary of data in the previous chapter. While that is a useful start, very often, basic exploration, analysis, and presentation of data require tabulations of frequencies, filtering, and plotting data. In order to do the above, we will be working with a specific dataset and a couple of new packages. R also has some basic functions for analysis built in. However, you will find that the R packages are more user friendly and efficient in the output generated.

In the previous chapter, you learned how to create variables and save a dataset. It is useful to have variable labels and value labels in the dataset. In order to read in a dataset along with the labels, we will use the following command to read a sample dataset named “samplesurveydata.csv” created using a simulated dataset courtesy of Hainke (2018):

library(expss)
df<-read_labelled_csv("samplesurveydata.csv")
head(df)

##   id gender age province                 product Q1 Q2 Q3 Q4 Q5
## 1  1      2   4       ON            Bass-O-Matic  2  4  4  2  8
## 2  2      1   3       BC Little Chocolate Donuts  3  2  3  5  5
## 3  3      1   4       ON Little Chocolate Donuts  1  3  1  4  8
## 4  4      2   4       ON            Bass-O-Matic  2  3  2  4  4
## 5  5      2   1       QC            Bass-O-Matic  2  4  5  5  0
## 6  6      2   1       QC            Bass-O-Matic  2  5  3  5  7

str(df)

## 'data.frame':	2000 obs. of  10 variables:
##  $ id      :Class 'labelled' int  1 2 3 4 5 6 7 8 9 10 ...
##    .. .. LABEL: Id 
##  $ gender  :Class 'labelled' int  2 1 1 2 2 2 1 1 1 1 ...
##    .. .. LABEL: Gender 
##    .. .. VALUE LABELS [1:2]: 1=Male, 2=Female 
##  $ age     :Class 'labelled' int  4 3 4 4 1 1 1 3 3 3 ...
##    .. .. LABEL: Age 
##    .. .. VALUE LABELS [1:5]: 1=18-24, 2=25-34, 3=35-54, 4=55-64, 5=65+ 
##  $ province: chr  "ON" "BC" "ON" "ON" ...
##  $ product : chr  "Bass-O-Matic" "Little Chocolate Donuts" "Little Chocolate Donuts" "Bass-O-Matic" ...
##  $ Q1      :Class 'labelled' int  2 3 1 2 2 2 5 4 4 5 ...
##    .. .. LABEL: Satisfaction with store 
##    .. .. VALUE LABELS [1:5]: 1=Very dissatisfied, 2=Somewhat dissatisfied, 3=Neither satisfied nor dissatisfied, 4=Somewhat satisfied, 5=Very satisfied 
##  $ Q2      :Class 'labelled' int  4 2 3 3 4 5 2 4 4 5 ...
##    .. .. LABEL: Satisfaction with products 
##    .. .. VALUE LABELS [1:5]: 1=Very dissatisfied, 2=Somewhat dissatisfied, 3=Neither satisfied nor dissatisfied, 4=Somewhat satisfied, 5=Very satisfied 
##  $ Q3      :Class 'labelled' int  4 3 1 2 5 3 3 4 5 2 ...
##    .. .. LABEL: User friendly website 
##    .. .. VALUE LABELS [1:5]: 1=Strongly disagree, 2=Disagree, 3=Neither agree nor disagree, 4=Agree, 5=Strongly agree 
##  $ Q4      :Class 'labelled' int  2 5 4 4 5 5 4 2 1 5 ...
##    .. .. LABEL: Helpful customer service 
##    .. .. VALUE LABELS [1:5]: 1=Not at all helpful, 2=Not so helpful, 3=Somewhat helpful, 4=Very helpful, 5=Extremely helpful 
##  $ Q5      :Class 'labelled' int  8 5 8 4 0 7 4 4 2 3 ...
##    .. .. LABEL: Would recommend to others

As seen above, we are using a package called expss(Demin 2020 ) that makes it easy to create and read variables with labels and value labels. The command library is used to load the package named expss. If it is not listed under packages in your right console, it needs to be downloaded first for it to work. It is also important to use read_labelled_csv to read the file as opposed to read_csv to retain pre-existing labels for variables and values in the file.

5.3 Adding a Variable or a Row

A typical issue that comes up when dealing with any data is the necessity to add a new observation or add a new variable or create a new variable through transformation to an existing dataset. We will work through an example using a smaller dataframe for simplicity to demonstrate this task. To add a variable, we simply use the function cbind and similarly to add a variable we will use the function rbind as follows:

var1<-c(1,2,3,4,5)
var2<-c("A","B","C","D","E")
newdataframe<-data.frame(var1,var2)
newdataframe

##   var1 var2
## 1    1    A
## 2    2    B
## 3    3    C
## 4    4    D
## 5    5    E

As seen above, we created a dataframe with two variables and five observations. We will add a third variable var3 and a sixth observation as follows:

var3<-c(5,10,8,11,12)
newdataframe<-cbind(newdataframe,var3)
newrow<-c(1,"B",8)
newdataframe<-rbind(newdataframe,newrow)
newdataframe

##   var1 var2 var3
## 1    1    A    5
## 2    2    B   10
## 3    3    C    8
## 4    4    D   11
## 5    5    E   12
## 6    1    B    8

Such tasks of adding new variables or new observations or changing a particular value are very common when analyzing data. For example, if we need to change the value in sixth observation for var2 to “A” from “B”, we can do it as follows:

newdataframe[6,2]

## [1] B
## Levels: A B C D E

newdataframe[6,2]<-"A"
newdataframe

##   var1 var2 var3
## 1    1    A    5
## 2    2    B   10
## 3    3    C    8
## 4    4    D   11
## 5    5    E   12
## 6    1    A    8

As seen above, the value has been changed in the dataframe. The numbers within the square brackets of a dataframe correspond to row and column positions.

Now that we have covered some basics on handling, reading, and transforming data, it is time to start analyzing it. Typically, when examining a survey or any other datasets, you might want to start by exploring some basic descriptives such as measures of central tendency, dispersion, and frequency tables. R does provide some basic functions such as mean and table. In the next section, we will be looking at how to use some basic functions of R for some fundamental data analysis. As subsequently seen, they are useful but can also have limitations. Some of these limitations can be overcome by using specific packages designed to accomplish such tasks.

5.4 Descriptive Statistics on Single Variables

Typically, we first examine descriptive statistics on individual variables. For some, these might be measures of central tendency such as mean and median and for others, it could be frequency tables. R has some basic functions to examine such statistics, but as shown below they have some limitations. Basic built-in functions in R include mean, median, and table for mean, median, and frequency tables respectively. Below are examples of some basic statistics on a couple of the questions - one on gender and the other on variable Q1 which corresponds to satisfaction with the store.

mean(df$Q1)

## [1] 2.9995

median(df$Q1)

## [1] 3

table(df$gender)

## 
##   Male Female 
##    995   1005

There are times we would like to look at the responses as percentages, as opposed to counts for which R has a built-in function of prop.table which may be used as follows:

prop.table(table(df$gender))

## 
##   Male Female 
## 0.4975 0.5025

It is also possible to use these basic functions to compute a two way table as follows, for example looking at satisfaction with store by gender.

table(df$Q1,df$gender)

##                                     
##                                      Male Female
##   Very dissatisfied                   199    176
##   Somewhat dissatisfied               223    208
##   Neither satisfied nor dissatisfied  193    213
##   Somewhat satisfied                  200    196
##   Very satisfied                      180    212

prop.table(table(df$Q1,df$gender))

##                                     
##                                        Male Female
##   Very dissatisfied                  0.0995 0.0880
##   Somewhat dissatisfied              0.1115 0.1040
##   Neither satisfied nor dissatisfied 0.0965 0.1065
##   Somewhat satisfied                 0.1000 0.0980
##   Very satisfied                     0.0900 0.1060

However, as seen here these basic built in functions are a little cumbersome as they need to be run separately for counts and percentages. In addition, they do not provide information on missing values as those are not included in the calculations. Thus, using functions provided in the package expss we loaded earlier is useful. The functions to use in expss are fre for frequency tables and cro for cross tabulations as follows:

fre(df$gender)

Gender	Count	Valid percent	Percent	Responses, %	Cumulative responses, %
Male	995	49.8	49.8	49.8	49.8
Female	1005	50.2	50.2	50.2	100.0
#Total	2000	100	100	100
<NA>	0		0.0

cro(df$Q1,df$gender)

	Gender
	Male	Female
Satisfaction with store
Very dissatisfied	199	176
Somewhat dissatisfied	223	208
Neither satisfied nor dissatisfied	193	213
Somewhat satisfied	200	196
Very satisfied	180	212
#Total cases	995	1005

The function cro in expss has additional options to compute not just the counts but also row and column percentages as follows:

cro_cpct(df$Q1,df$gender)

	Gender
	Male	Female
Satisfaction with store
Very dissatisfied	20.0	17.5
Somewhat dissatisfied	22.4	20.7
Neither satisfied nor dissatisfied	19.4	21.2
Somewhat satisfied	20.1	19.5
Very satisfied	18.1	21.1
#Total cases	995	1005

cro_tpct(df$Q1,df$gender)

	Gender
	Male	Female
Satisfaction with store
Very dissatisfied	10.0	8.8
Somewhat dissatisfied	11.2	10.4
Neither satisfied nor dissatisfied	9.7	10.7
Somewhat satisfied	10.0	9.8
Very satisfied	9.0	10.6
#Total cases	995	1005

5.5 Statistics on Multiple Variables

Although the above functions are useful in looking at single and multiple variables one or two at a time, there are times when it is easier to compute summary statistics on multiple variables all at once. For example, the basic function summary provides multiple descriptive statistics on a single variable. Since, variables Q1 through Q5 are all numeric, it would be useful to run summary statistics on all five of them using a single command. This may be done by placing them in a single data frame as follows and then running summary command on the resulting data frame.

summary(df$Q1) ##Provides summary statistics on a single variable Q1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.999   4.000   5.000

newdataframe<-data.frame(df$Q1,df$Q2,df$Q3,df$Q4,df$Q5) #creating a new data frame with five variables of interest
summary(newdataframe)

##      df.Q1           df.Q2           df.Q3           df.Q4     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.00  
##  Mean   :2.999   Mean   :3.034   Mean   :2.995   Mean   :2.97  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00  
##      df.Q5       
##  Min.   : 0.000  
##  1st Qu.: 2.000  
##  Median : 5.000  
##  Mean   : 4.941  
##  3rd Qu.: 8.000  
##  Max.   :10.000

It is also useful to compute statistics on multiple variables such as running counts and percentages. It is useful to use the tab functions in expss along with what is known as piping, a concept that you get used to more as you advance. This has three different components to it. The first is to specify the dataframe on which the function is applied, followed by the variables it is applied to, followed by the statistics of interest and a final use of the function tab_pivot to create the table.

df %>% tab_cells(gender) %>% tab_stat_cases() %>% tab_pivot

	#Total
Gender
Male	995
Female	1005
#Total cases	2000

## Provides percentages
df %>% tab_cells(gender) %>% tab_stat_cpct() %>% tab_pivot

	#Total
Gender
Male	49.8
Female	50.2
#Total cases	2000

## We could add additional variables as follows
df %>% tab_cells(gender,Q1) %>% tab_stat_cpct() %>% tab_pivot

	#Total
Gender
Male	49.8
Female	50.2
#Total cases	2000
Satisfaction with store
Very dissatisfied	18.8
Somewhat dissatisfied	21.6
Neither satisfied nor dissatisfied	20.3
Somewhat satisfied	19.8
Very satisfied	19.6
#Total cases	2000

## We could add column variables to create cross tabulations as follows
df %>% tab_cols(gender) %>% tab_cells(age,Q1) %>% tab_stat_cpct() %>% tab_pivot

	Gender
	Male	Female
Age
18-24	14.0	14.3
25-34	21.9	20.6
35-54	31.7	29.0
55-64	13.7	15.0
65+	18.8	21.1
#Total cases	995	1005
Satisfaction with store
Very dissatisfied	20.0	17.5
Somewhat dissatisfied	22.4	20.7
Neither satisfied nor dissatisfied	19.4	21.2
Somewhat satisfied	20.1	19.5
Very satisfied	18.1	21.1
#Total cases	995	1005

In summary, in this chapter you learned how to quickly compute tables and cross-tabulations, along with creating them in ways that are easy to quickly interpret and see patterns in the data. In the next chapter, the focus is on creating visual charts to represent data in different ways.

References

Demin, Gregory. 2020. Expss: Tables, Labels and Some Useful Functions from Spreadsheets and ’Spss’ Statistics. https://CRAN.R-project.org/package=expss.

Hainke, Michael. 2018. Getting Started Using Survey Data for Analysis. https://www.hainke.ca/.