Chapter 5 Using R for Fundamental Data Analysis
In the past few chapters, you learned about R-Studio, its user interface, and how to use it to read, save, and retrieve a data file created. In this chapter, we will take a look at how to use R for basic data manipulation and analysis. Before we read in a data file for analysis, it is important to understand the different classes of variables and how R uses it.
5.1 Classes of Variables in R
In the previous chapter, you learned how to assign a value to a variable. A variable in R can take on a single value that is numeric or a character. When it is a vector, it can take on multiple values that are either numeric or character variables. The vector can also take on logical values such TRUE and FALSE or other more complex numbers as values.
The vector has certain attributes associated with it such as ‘names’, ‘dimensions’ and ‘class’ as shown in the code below. The ‘names’ attribute is useful for assigning names or labels. The dimensions attribute allows specifying number of rows and columns or in case of multidimensional arrays, it helps in specifying number of arrays. Matrices or two-dimensional arrays are nothing but a vector of values written with rows and columns as specified. This can be extended to arrays which can consist of multiple matrices with additional arguments for dimension.
## [1] 1 2 3 4 5 6
## [1] "numeric"
names(x)<-c ("red","blue","green","yellow","orange","white") ## assigns labels to the values in vector x as shown next
x
## red blue green yellow orange white
## 1 2 3 4 5 6
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] "matrix"
In addition to the above, it is useful to understand the role that a ‘list’ plays in R. A list can contain variables of different types as follows:
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [1] "Harry" "Robert" "Nancy"
Understanding creating and reading elements from a list is useful when manipulating different types of data. This is covered in more advanced lessons in R since this is an introductory level of R for beginners.
5.2 Reading and analyzing data
In this section, you will learn how to read and analyze data for basic analysis. We already looked at how to compute basic measures of summary of data in the previous chapter. While that is a useful start, very often, basic exploration, analysis, and presentation of data require tabulations of frequencies, filtering, and plotting data. In order to do the above, we will be working with a specific dataset and a couple of new packages. R also has some basic functions for analysis built in. However, you will find that the R packages are more user friendly and efficient in the output generated.
In the previous chapter, you learned how to create variables and save a dataset. It is useful to have variable labels and value labels in the dataset. In order to read in a dataset along with the labels, we will use the following command to read a sample dataset named “samplesurveydata.csv” created using a simulated dataset courtesy of Hainke (2018):
## id gender age province product Q1 Q2 Q3 Q4 Q5
## 1 1 2 4 ON Bass-O-Matic 2 4 4 2 8
## 2 2 1 3 BC Little Chocolate Donuts 3 2 3 5 5
## 3 3 1 4 ON Little Chocolate Donuts 1 3 1 4 8
## 4 4 2 4 ON Bass-O-Matic 2 3 2 4 4
## 5 5 2 1 QC Bass-O-Matic 2 4 5 5 0
## 6 6 2 1 QC Bass-O-Matic 2 5 3 5 7
## 'data.frame': 2000 obs. of 10 variables:
## $ id :Class 'labelled' int 1 2 3 4 5 6 7 8 9 10 ...
## .. .. LABEL: Id
## $ gender :Class 'labelled' int 2 1 1 2 2 2 1 1 1 1 ...
## .. .. LABEL: Gender
## .. .. VALUE LABELS [1:2]: 1=Male, 2=Female
## $ age :Class 'labelled' int 4 3 4 4 1 1 1 3 3 3 ...
## .. .. LABEL: Age
## .. .. VALUE LABELS [1:5]: 1=18-24, 2=25-34, 3=35-54, 4=55-64, 5=65+
## $ province: chr "ON" "BC" "ON" "ON" ...
## $ product : chr "Bass-O-Matic" "Little Chocolate Donuts" "Little Chocolate Donuts" "Bass-O-Matic" ...
## $ Q1 :Class 'labelled' int 2 3 1 2 2 2 5 4 4 5 ...
## .. .. LABEL: Satisfaction with store
## .. .. VALUE LABELS [1:5]: 1=Very dissatisfied, 2=Somewhat dissatisfied, 3=Neither satisfied nor dissatisfied, 4=Somewhat satisfied, 5=Very satisfied
## $ Q2 :Class 'labelled' int 4 2 3 3 4 5 2 4 4 5 ...
## .. .. LABEL: Satisfaction with products
## .. .. VALUE LABELS [1:5]: 1=Very dissatisfied, 2=Somewhat dissatisfied, 3=Neither satisfied nor dissatisfied, 4=Somewhat satisfied, 5=Very satisfied
## $ Q3 :Class 'labelled' int 4 3 1 2 5 3 3 4 5 2 ...
## .. .. LABEL: User friendly website
## .. .. VALUE LABELS [1:5]: 1=Strongly disagree, 2=Disagree, 3=Neither agree nor disagree, 4=Agree, 5=Strongly agree
## $ Q4 :Class 'labelled' int 2 5 4 4 5 5 4 2 1 5 ...
## .. .. LABEL: Helpful customer service
## .. .. VALUE LABELS [1:5]: 1=Not at all helpful, 2=Not so helpful, 3=Somewhat helpful, 4=Very helpful, 5=Extremely helpful
## $ Q5 :Class 'labelled' int 8 5 8 4 0 7 4 4 2 3 ...
## .. .. LABEL: Would recommend to others
As seen above, we are using a package called expss
(Demin 2020 ) that makes it easy to create and read variables with labels and value labels. The command library
is used to load the package named expss
. If it is not listed under packages in your right console, it needs to be downloaded first for it to work. It is also important to use read_labelled_csv to read the file as opposed to read_csv to retain pre-existing labels for variables and values in the file.
5.3 Adding a Variable or a Row
A typical issue that comes up when dealing with any data is the necessity to add a new observation or add a new variable or create a new variable through transformation to an existing dataset. We will work through an example using a smaller dataframe for simplicity to demonstrate this task. To add a variable, we simply use the function cbind
and similarly to add a variable we will use the function rbind
as follows:
## var1 var2
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
As seen above, we created a dataframe with two variables and five observations. We will add a third variable var3 and a sixth observation as follows:
var3<-c(5,10,8,11,12)
newdataframe<-cbind(newdataframe,var3)
newrow<-c(1,"B",8)
newdataframe<-rbind(newdataframe,newrow)
newdataframe
## var1 var2 var3
## 1 1 A 5
## 2 2 B 10
## 3 3 C 8
## 4 4 D 11
## 5 5 E 12
## 6 1 B 8
Such tasks of adding new variables or new observations or changing a particular value are very common when analyzing data. For example, if we need to change the value in sixth observation for var2 to “A” from “B”, we can do it as follows:
## [1] B
## Levels: A B C D E
## var1 var2 var3
## 1 1 A 5
## 2 2 B 10
## 3 3 C 8
## 4 4 D 11
## 5 5 E 12
## 6 1 A 8
As seen above, the value has been changed in the dataframe. The numbers within the square brackets of a dataframe correspond to row and column positions.
Now that we have covered some basics on handling, reading, and transforming data, it is time to start analyzing it. Typically, when examining a survey or any other datasets, you might want to start by exploring some basic descriptives such as measures of central tendency, dispersion, and frequency tables. R does provide some basic functions such as mean and table. In the next section, we will be looking at how to use some basic functions of R for some fundamental data analysis. As subsequently seen, they are useful but can also have limitations. Some of these limitations can be overcome by using specific packages designed to accomplish such tasks.
5.4 Descriptive Statistics on Single Variables
Typically, we first examine descriptive statistics on individual variables. For some, these might be measures of central tendency such as mean and median and for others, it could be frequency tables. R has some basic functions to examine such statistics, but as shown below they have some limitations. Basic built-in functions in R include mean
, median
, and table
for mean, median, and frequency tables respectively. Below are examples of some basic statistics on a couple of the questions - one on gender and the other on variable Q1 which corresponds to satisfaction with the store.
## [1] 2.9995
## [1] 3
##
## Male Female
## 995 1005
There are times we would like to look at the responses as percentages, as opposed to counts for which R has a built-in function of prop.table
which may be used as follows:
##
## Male Female
## 0.4975 0.5025
It is also possible to use these basic functions to compute a two way table as follows, for example looking at satisfaction with store by gender.
##
## Male Female
## Very dissatisfied 199 176
## Somewhat dissatisfied 223 208
## Neither satisfied nor dissatisfied 193 213
## Somewhat satisfied 200 196
## Very satisfied 180 212
##
## Male Female
## Very dissatisfied 0.0995 0.0880
## Somewhat dissatisfied 0.1115 0.1040
## Neither satisfied nor dissatisfied 0.0965 0.1065
## Somewhat satisfied 0.1000 0.0980
## Very satisfied 0.0900 0.1060
However, as seen here these basic built in functions are a little cumbersome as they need to be run separately for counts and percentages. In addition, they do not provide information on missing values as those are not included in the calculations. Thus, using functions provided in the package expss
we loaded earlier is useful. The functions to use in expss
are fre
for frequency tables and cro
for cross tabulations as follows:
Gender | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
---|---|---|---|---|---|
Male | 995 | 49.8 | 49.8 | 49.8 | 49.8 |
Female | 1005 | 50.2 | 50.2 | 50.2 | 100.0 |
#Total | 2000 | 100 | 100 | 100 | |
<NA> | 0 | 0.0 |
Gender | ||
---|---|---|
Male | Female | |
Satisfaction with store | ||
Very dissatisfied | 199 | 176 |
Somewhat dissatisfied | 223 | 208 |
Neither satisfied nor dissatisfied | 193 | 213 |
Somewhat satisfied | 200 | 196 |
Very satisfied | 180 | 212 |
#Total cases | 995 | 1005 |
The function cro
in expss
has additional options to compute not just the counts but also row and column percentages as follows:
Gender | ||
---|---|---|
Male | Female | |
Satisfaction with store | ||
Very dissatisfied | 20.0 | 17.5 |
Somewhat dissatisfied | 22.4 | 20.7 |
Neither satisfied nor dissatisfied | 19.4 | 21.2 |
Somewhat satisfied | 20.1 | 19.5 |
Very satisfied | 18.1 | 21.1 |
#Total cases | 995 | 1005 |
Gender | ||
---|---|---|
Male | Female | |
Satisfaction with store | ||
Very dissatisfied | 10.0 | 8.8 |
Somewhat dissatisfied | 11.2 | 10.4 |
Neither satisfied nor dissatisfied | 9.7 | 10.7 |
Somewhat satisfied | 10.0 | 9.8 |
Very satisfied | 9.0 | 10.6 |
#Total cases | 995 | 1005 |
5.5 Statistics on Multiple Variables
Although the above functions are useful in looking at single and multiple variables one or two at a time, there are times when it is easier to compute summary statistics on multiple variables all at once. For example, the basic function summary
provides multiple descriptive statistics on a single variable. Since, variables Q1 through Q5 are all numeric, it would be useful to run summary statistics on all five of them using a single command. This may be done by placing them in a single data frame as follows and then running summary command on the resulting data frame.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.999 4.000 5.000
newdataframe<-data.frame(df$Q1,df$Q2,df$Q3,df$Q4,df$Q5) #creating a new data frame with five variables of interest
summary(newdataframe)
## df.Q1 df.Q2 df.Q3 df.Q4
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00
## Median :3.000 Median :3.000 Median :3.000 Median :3.00
## Mean :2.999 Mean :3.034 Mean :2.995 Mean :2.97
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
## df.Q5
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 5.000
## Mean : 4.941
## 3rd Qu.: 8.000
## Max. :10.000
It is also useful to compute statistics on multiple variables such as running counts and percentages. It is useful to use the tab functions in expss along with what is known as piping, a concept that you get used to more as you advance. This has three different components to it. The first is to specify the dataframe on which the function is applied, followed by the variables it is applied to, followed by the statistics of interest and a final use of the function tab_pivot to create the table.
#Total | |
---|---|
Gender | |
Male | 995 |
Female | 1005 |
#Total cases | 2000 |
#Total | |
---|---|
Gender | |
Male | 49.8 |
Female | 50.2 |
#Total cases | 2000 |
## We could add additional variables as follows
df %>% tab_cells(gender,Q1) %>% tab_stat_cpct() %>% tab_pivot
#Total | |
---|---|
Gender | |
Male | 49.8 |
Female | 50.2 |
#Total cases | 2000 |
Satisfaction with store | |
Very dissatisfied | 18.8 |
Somewhat dissatisfied | 21.6 |
Neither satisfied nor dissatisfied | 20.3 |
Somewhat satisfied | 19.8 |
Very satisfied | 19.6 |
#Total cases | 2000 |
## We could add column variables to create cross tabulations as follows
df %>% tab_cols(gender) %>% tab_cells(age,Q1) %>% tab_stat_cpct() %>% tab_pivot
Gender | ||
---|---|---|
Male | Female | |
Age | ||
18-24 | 14.0 | 14.3 |
25-34 | 21.9 | 20.6 |
35-54 | 31.7 | 29.0 |
55-64 | 13.7 | 15.0 |
65+ | 18.8 | 21.1 |
#Total cases | 995 | 1005 |
Satisfaction with store | ||
Very dissatisfied | 20.0 | 17.5 |
Somewhat dissatisfied | 22.4 | 20.7 |
Neither satisfied nor dissatisfied | 19.4 | 21.2 |
Somewhat satisfied | 20.1 | 19.5 |
Very satisfied | 18.1 | 21.1 |
#Total cases | 995 | 1005 |
In summary, in this chapter you learned how to quickly compute tables and cross-tabulations, along with creating them in ways that are easy to quickly interpret and see patterns in the data. In the next chapter, the focus is on creating visual charts to represent data in different ways.
References
Demin, Gregory. 2020. Expss: Tables, Labels and Some Useful Functions from Spreadsheets and ’Spss’ Statistics. https://CRAN.R-project.org/package=expss.
Hainke, Michael. 2018. Getting Started Using Survey Data for Analysis. https://www.hainke.ca/.