Chapter 8 Tidying Data
Hello! In this tutorial, we’ll be going over how I cleaned the original (or “raw”) public opinion dataset for the first assignment. Let’s first make sure to load the packages we’ll be using: janitor
and tidyverse
.
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0 v purrr 0.3.5
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.3 v forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
survey_data <- read.csv("data/pew_AmTrendsPanel_110_2022.csv") |> # don't forget to make sure this file is in your working directory!
clean_names() # This is a janitor function that cleans the names of your dataset (i.w.,, removes spaces, undercases everything)
Some of these functions are a repeat of what you would have learned in the dplyr
tutorial.
8.1 select()
As you can see, the dataset has many (175) variables. Let’s focus on the ones we’re really interested in by using the select()
function in the dplyr
package. select()
takes the following structure: select(<data>, <some way of telling the computer what variables to select>)
. There are actually quite a few ways to do this (you can check them out in ?select
, but the one we’ll use here is the variable name (so, the code will look like this: select(<data>, c(<variable1>, <variable2>, ..))
.
survey_select <- select(survey_data, c(diffparty_w110, sengun22b_w110, persfnc_w110, party_w110, ptyissue_abpol_w110,
f_agecat, f_educcat, f_inc_sdt1, f_reg))
glimpse(survey_select) # let's take a look at the data!
## Rows: 6,174
## Columns: 9
## $ diffparty_w110 <int> 1, 1, 1, 3, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1,~
## $ sengun22b_w110 <int> 5, 2, 2, 4, 3, 1, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 1,~
## $ persfnc_w110 <int> 2, 2, 3, 4, 2, 4, 3, 1, 3, 3, 3, 1, 2, 2, 2, 2, 2,~
## $ party_w110 <int> 1, 1, 1, 1, 1, 4, 3, 2, 2, 1, 1, 2, 1, 2, 4, 2, 1,~
## $ ptyissue_abpol_w110 <int> 1, 1, 1, 1, 1, 4, 1, 4, 4, 2, 1, 4, 1, 2, 3, 4, 4,~
## $ f_agecat <int> 4, 4, 3, 3, 1, 2, 4, 4, 4, 3, 3, 4, 4, 4, 3, 4, 3,~
## $ f_educcat <int> 2, 1, 2, 2, 1, 3, 3, 2, 3, 3, 2, 1, 1, 2, 2, 2, 1,~
## $ f_inc_sdt1 <int> 3, 9, 9, 6, 9, 1, 2, 3, 3, 8, 5, 9, 9, 4, 3, 6, 6,~
## $ f_reg <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
8.2 Rename Variables
Now that we have selected specific, let’s rename them so they make more sense when we use them. If you remember from Chapter 7, we can use the rename()
function in dplyr
to do this. Remember that rename()
uses the following syntax: `rename(dataset, new_name = old_name)``
survey_rename <- survey_select |> # |> is the same as %>%
rename(perceived_party_difference = diffparty_w110,
gun_bill_approval = sengun22b_w110,
personal_fin = persfnc_w110,
abortion_opinion = ptyissue_abpol_w110,
party = party_w110,
age = f_agecat,
education = f_educcat,
income = f_inc_sdt1,
voter = f_reg)
#View(survey_rename)
Want to see the new names of these columns? Use colnames()
!
## [1] "perceived_party_difference" "gun_bill_approval"
## [3] "personal_fin" "party"
## [5] "abortion_opinion" "age"
## [7] "education" "income"
## [9] "voter"
8.3 filter()
Now that we’ve organized our columns, let’s see how to filter this dataset further. If you remember from Chapter 7, we can do this using the filter()
function. Recall that the filter()
function uses the following structure: filter(<data>, <some logical condition involving a variable>)
.
If we want to look at Republicans, for example, we could use the party affiliation variable (party
). This variable has the following codes: (1) stands for Republican, (2) stands for Democrat, (3) stands for Independent, (4) stands for something else, and (99) stands for did not answer. We’ll also use party
variable to filter out non-Republicans.
We can use table()
to learn more about this specific variable.
##
## 1 2 3 4 99
## 1551 1886 1885 777 75
This result shows that we have 1551 Republicans (1), 1886 Democrats (2), and 1885 independents (3). So, when we use filter()
we should see 1551 rows.
survey_rename |> #take the dataset
filter(party == 1) |> #filter by Republicans
nrow() #get the number of rows (where the participant is a Republican)
## [1] 1551
8.4 subset()
You can also use subset()
in base R, which is a generic function for selecting values (or rows in a data frame) based on a criterion. The subset()
function also works on one-dimensional structures (vectors and lists).
While filter()
and select()
from dplyr
are both useful, subset()
remains quite popular because you can subset and filter simultaneously. Aside from the dataset, subset()
takes two common arguments, a logical expression to indicate the variable you want to filter by and a select
argument in case you want to subset to a specific set of variables. Let’s see how this is used together:
#The structure is: new_object <- subset(old_object, logical expression, variable selection)
survey_subset <- subset(survey_rename, party == 1, select = c(party, gun_bill_approval, education))
# In the above line, I focus on Republicans (party == 1) and the party, gun_bill_approval, and education variable
8.5 Coercion
While the numbers are useful, it can also be annoying to constantly go back and forth with the codebook. Therefore, it may be useful to change the variable from one data type (numeric) to another (in this case, factors). This process is called a coercion. Most coercion functions in r use the following structure: as.<datatype>(variable)
. Let’s use this now to coerce a numeric into a factor (so we’ll use as.factor(<numeric_variable>
). We’ll use mutate()
to do this.
8.5.1 Renaming Factor Levels
As we have discussed, the factor
data type is somewhat unique to R and is most often used with categorical variables. Factor variables can have their levels renamed, making it easier to interpret results with this variable (a factor level is one category in the variable). For example, in the pary
variable we have been working with, there are five levels: Republicans, Democrats, independents, other party, and NA. But right now, we’re seeing this variable as numbers (1, 2, 3)
## [1] 1 1 1 1 1 4
## Levels: 1 2 3 4 99
So let’s change this using a base R function, levels()
. levels()
tells you the levels of a factor variable. levels(survey_recoded$party)
will return a vector of the level names. In order to use levels()
the variable/column/list/array needs to be a factor-type (it cannot be a character or numeric).
## [1] "1" "2" "3" "4" "99"
But, we can also use levels()
to relabel these levels. We can do so using an <-
assignment. To the left of this is the levels()
function. To the right of the <-
is an array of string names (we will have 5, for each category). In plain English, the line of code below takes the names on the right and assigns them to the levels on the left.
levels(survey_recoded$party) <- c("Republicans","Democrats","independent", "other party", "NA")
#levels(survey_recoded$party) #use this to see the levels
We can also use logical expressions to understand more about the data. Logical expression have “relational operators” like >
(greater than) or <=
(less than). ==
means “equal” and “!=”. The output of a logical expression is a binary. So, if we want to filter by rows where the party variable (party
) is “Democrat”, we would use the logical expression party == "Democrat"
. Let’s just see what happens when we use this logical expression on its own with the first party
variable in the data frame:
## [1] FALSE
Notice that this returns FALSE
. This means that the first participant is not a Democrat.
What if we used Republican?
## [1] TRUE
Because this returns TRUE
, we know that the first participant is a Republican.
Let’s now try this again with the education
variable. We’ll first coerce into a factor and rename the levels of the factor. Then, we will use a regular expression to filter the data for those who did not attend college (HS grad or less
).
The logical expression education == "HS grad or less"
means that any row in which education
is “HS grad or less” will be counted asTRUE
and will be included in the dataset. Rows in which education
is not that (e.g., “Some college”) will be counted as FALSE
and excluded in the subsetted dataset.
8.5.2 Coercing Back
What if you want to turn a factor variable into a numeric? You can also use coercion, but instead of using as.factor()
, you would use as.numeric()
.
When I do this, I like to use the table()
function to see what is in the variable. I may also use str()
to identify the data type of a variable.
table(survey_recoded2$education) #note that we are using the survey_recoded2 data, as the survey_subset data only includes HS grads or less
##
## College Grad+ Some college HS grad or less NA
## 2763 1820 1573 18
# if you wanted to remove survey_subset as a file, since we won't be using it anymore, use remove().
#For example, to remove "survey_subset", use remove(survey_subset)
str(survey_recoded2$education)
## Factor w/ 4 levels "College Grad+",..: 2 1 2 2 1 3 3 2 3 3 ...
as.numeric(survey_recoded2$education) |> #coerce the variable back into a numeric
head() #look at the first 6 variables
## [1] 2 1 2 2 1 3
Another word for coercion is recoding. Both refer to the process of re-organizing the values of a variable so that it can be used for subsequent statistical analysis. Recording is an essential (albeit tedious) process that varies from dataset to dataset.
8.5.3 Removing Levels
Often, there are levels that may be of interest to you. There are a couple different ways we can remove these. First, we can remove them using a base R strategy that leverages brackets.
survey_removelevels <- survey_recoded2 #we may want to look at independents in other analyses, so let's save our dataset in a new object that we can play around with. This allows us to test new things with the dataset without losing the last one.
levels(survey_removelevels$party) #check out the levels
## [1] "Republicans" "Democrats" "independent" "other party" "NA"
levels(survey_removelevels$party)[levels(survey_removelevels$party)=="independent"] <- NA #removes this level
levels(survey_removelevels$party)[levels(survey_removelevels$party)=="other party"] <- NA #removes this level
levels(survey_removelevels$party)[levels(survey_removelevels$party)=="NA"] <- NA #removes this level
levels(survey_removelevels$party)
## [1] "Republicans" "Democrats"
Notice that when we do levels
a second time, there are only two factors. This process does not delete the survey participants, it just sets “independent”, “other party”, and “NA” to NA
.
## [1] Republicans Republicans Republicans Republicans Republicans <NA>
## [7] <NA> Democrats Democrats Republicans
## Levels: Republicans Democrats
The “spoken” way to interpret this code would be: in the brackets, identify all the rows in which party
is "independent"
. Then replace the variable levels of these rows (this is the information outside of the brackets) with NA
(hence the <- NA
at the end).
Unsurprisingly, there is also a dplyr
strategy. There are two important functions we will use here: mutate()
and recode()
. With mutate()
, we can create or change variables in a data frame. With recode
, we can replace old factor levels with NA
–or, in this case, NA_character_
, which is a special variant of NA
for strings and character (keep in mind that the factor variables’ levels are names in characters, hence why we use NA_character_
).
survey_removelevels <- survey_recoded2
survey_removelevels <- survey_removelevels |>
mutate(party = recode(party, Republicans = "rep", Democrats = "dem", .default = NA_character_))
Notice that in the above, we can actually use recode()
to both rename variables and to replace variables with NA.
Before we finish, let’s make sure ot save this file using write.csv()
to save the data frame as a csv, or save()
to save the data as a .Rdata file.
As I mentioned, each dataset will come with its own data cleaning issues–no two datasets are alike in their problems. However, the more datasets you work with, the better you will get at devising a plan for cleaning the data. While data cleaning can be daunting, I hope this chapter (and the last one) will help you clean your own datasets.