5 Tutorial 5: Inspecting & transforming objects
After working through Tutorial 5, you’ll…
- understand how to inspect (elements of) objects
- understand how to edit (elements of) objects
For this tutorial, we’ll again use the data set “data_tutorial3.csv” (via Moodle/Data for R). The data set has already been introduced and explained in Tutorial 3: Objects & structures in R, so I won’t go into detail here.
<- read.csv2("data_tutorial3.csv", header = TRUE) survey
By now, you should already know several ways of inspecting an object:
- Simply call the object’s name (or use print()):
survey
## name age date outlet outlet_use outlet_trust
## 1 Alexandra 20 2021-09-09 TV 2 5
## 2 Alex 25 2021-09-08 Online 3 5
## 3 Maximilian 29 2021-09-09 Zeitung 4 1
## 4 Moritz 22 2021-09-06 TV 2 2
## 5 Vanessa 25 2021-09-07 Online 1 3
## 6 Andrea 26 2021-09-09 Online 3 4
## 7 Fabienne 26 2021-09-09 TV 3 2
## 8 Fabio 27 2021-09-09 Online 0 1
## 9 Magdalena 8 2021-09-08 Online 1 4
## 10 Tim 26 2021-09-07 TV NA 2
## 11 Alex 27 2021-09-09 Online NA 2
## 12 Tobias 26 2021-09-07 Online 2 2
## 13 Michael 25 2021-09-09 Online 3 2
## 14 Sabrina 27 2021-09-08 Online 1 2
## 15 Valentin 29 2021-09-09 TV 1 5
## 16 Tristan 26 2021-09-09 TV 2 5
## 17 Martin 21 2021-09-09 Online 1 2
## 18 Anna 23 2021-09-08 TV 3 3
## 19 Andreas 24 2021-09-09 TV 2 5
## 20 Florian 26 2021-09-09 Online 1 5
- Use the View() command to inspect the object in a separate tab in the window “Script”:
View(survey)
Image: Data set survey
- Or use head(), which prints out the first elements of an object:
head(survey)
## name age date outlet outlet_use outlet_trust
## 1 Alexandra 20 2021-09-09 TV 2 5
## 2 Alex 25 2021-09-08 Online 3 5
## 3 Maximilian 29 2021-09-09 Zeitung 4 1
## 4 Moritz 22 2021-09-06 TV 2 2
## 5 Vanessa 25 2021-09-07 Online 1 3
## 6 Andrea 26 2021-09-09 Online 3 4
5.1 Inspecting objects
5.1.1 Inspecting scalars
Inspecting scalars is easy, given that they only consist of a single value. Simply call the object. Here, we’ll use the object number consisting of the single value 3 as an example:
<- 3
number number
## [1] 3
5.1.2 Inspecting vectors
Let’s say you don’t have a single number but a row of numbers consisting of numeric data, e.g., the numbers from 1 to 20. This data is saved as the vector numbers:
<- c(1:20) numbers
If you want to inspect all elements of numbers, again, simply call the object:
numbers
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
However, maybe you only want to know the value of the first or second element of numbers - in this case, the first or second number.
In this case, we can use indexing to retrieve the corresponding element of numbers. In short, you tell R to only retrieve certain elements according to their position, i.e., their index. To do so, we use square brackets [].5
Let’s say we want to retrieve the first element of numbers:
1] numbers[
## [1] 1
… now it’s second:
2] numbers[
## [1] 2
.. and now the first, third, and fifth element:
c(1,3,5)] numbers[
## [1] 1 3 5
.. and, lastly, the first to seventh element as well as the element on position thirteen:
c(1:8,13)] numbers[
## [1] 1 2 3 4 5 6 7 8 13
As you see, you only need to remember some rules:
- Objects can be indexed by their position.
- You can either specify separate positions or, using the colon :, retrieve elements on several position at the same time.
5.1.3 Inspecting data frames
Remember that data frames consist of vectors of the same length.
Take, for instance, our data set survey: Its columns describe variables, its rows describe observations.
head(survey)
## name age date outlet outlet_use outlet_trust
## 1 Alexandra 20 2021-09-09 TV 2 5
## 2 Alex 25 2021-09-08 Online 3 5
## 3 Maximilian 29 2021-09-09 Zeitung 4 1
## 4 Moritz 22 2021-09-06 TV 2 2
## 5 Vanessa 25 2021-09-07 Online 1 3
## 6 Andrea 26 2021-09-09 Online 3 4
In difference to scalars and vectors, we can inspect elements from data frames both by indexing/their position and by their name.
5.1.3.1 Inspection by indexing/position
Say we want to get the name of the seventh respondent in our survey. We know that:
- the first column contains data on respondents’ names
- data for the respondent we are interested in (respondent number 7) is saved in the seventh row
Image: Survey data
We can now access this value by its position using square brackets [], similar to vectors. Here,
- the first element in square brackets contains the row number(s)
- the second element in square brackets contains the column number(s)
Thus, if we want to retrieve data from the seventh row and the first column, we can do so via indexing:
7,1] survey[
## [1] "Fabienne"
In turn, if you do not want to retrieve a single value, but for instance all values for a specific respondent or all values for a specific variable, you could do the same via indexing.
For instance, this command retrieves all answers for the respondent called “Fabienne” (respondent number seven):
7,] survey[
## name age date outlet outlet_use outlet_trust
## 7 Fabienne 26 2021-09-09 TV 3 2
This command, in turn, retrieves all answers for the variable called “names” (column number one):
1] survey[,
## [1] "Alexandra" "Alex" "Maximilian" "Moritz" "Vanessa" "Andrea" "Fabienne" "Fabio" "Magdalena" "Tim" "Alex" "Tobias" "Michael"
## [14] "Sabrina" "Valentin" "Tristan" "Martin" "Anna" "Andreas" "Florian"
5.1.3.2 Inspection by name
Another way of accessing elements is by their name. For instance, we could retrieve the variable “name” by simply using the variable name: We specify the object we want to access, the data frame survey, to then retrieve the column name via the operator $:
$name survey
## [1] "Alexandra" "Alex" "Maximilian" "Moritz" "Vanessa" "Andrea" "Fabienne" "Fabio" "Magdalena" "Tim" "Alex" "Tobias" "Michael"
## [14] "Sabrina" "Valentin" "Tristan" "Martin" "Anna" "Andreas" "Florian"
5.1.3.3 Inspection by index/positioning and name
You will often retrieve data via a mix of indexing/positioning and data retrieval via column names.
For instance, if you have a data set with many different variables, you may not want to count the exact position of the variable of interest to you. In this case, it may be more useful to index rows by their position but columns by their name like so:
7,c("name")] survey[
## [1] "Fabienne"
Or you could simply retrieve the variable itself to then get the 7th element of the corresponding vector (knowing that observation number seven belongs to “Fabienne”):
$name[7] survey
## [1] "Fabienne"
In short: As with many problems in R, there isn’t just one solution - many variations of code will lead to the same solution (but may differ in efficiency).
5.1.4 Inspecting lists
Lastly, you can also inspect lists according to their position (or name).
Let’s create a list consisting of the following two elements:
- the data frame survey
- the vector numbers
<- list(survey,numbers) list
You can now access each element of this list via its position. For instance, you could retrieve the second element of list, here the vector numbers, using double square brackets:
2]] list[[
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We could also give each element of your list list a name via name() to then access elements within list via their name:
names(list) <- c("survey", "numbers")
"numbers"] list[
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5.2 Transforming objects
Now that we know that objects in R can by inspected/accessed by position and name, we also (almost) know how to transform data.
In many cases, we need to change existing objects in our working environment - for instance to filter data sets by conditions or to change specific values with an existing object.
5.2.1 Transforming scalars
In the case of scalars, we can simply overwrite our data. Since scalars consist of a single value, this is easily done:
<- "hello"
word word
## [1] "hello"
<- "goodbye"
word word
## [1] "goodbye"
5.2.2 Tranforming vectors
We know that we can access vectors via indexing/their position.
Let’s first take the simple case that we want to change all elements belonging to a vector. For instance, we have a vector numbers that consists of all numbers between 1 to 20:
<- c(1:20)
numbers numbers
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can edit all values of numbers by, again, overwriting the vector:
<- numbers + 1
numbers numbers
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
However, if we only want to change a specific value within numbers, we need to do so via indexing. Say that we only want to change the second number in the vector:
2] <- 100
numbers[ numbers
## [1] 2 100 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
We can also, of course, change several elements within a vector. For instance, we may want to also replace the first and the last number:
c(1,20)] <- 0
numbers[ numbers
## [1] 0 100 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0
5.2.3 Transforming data frames
The same logic applies to changing values in data frames.
Using the survey data set, we may for example want to only include specific variables or only selected observations (filter data set by condition). Or we may want to change variables, for instance transform the variable on trust in news media scores to low and high trust instead of numeric values between 1 to 5 (change specific values).
5.2.3.1 Including/excluding columns of data frames
Say we want to use the data set survey, but only want to use the variables name and age.
For a better overview, we decide to create a new data set called survey_new, including only the variables name and age from the object data.
How would we do this?
5.2.3.1.1 Option 1: Indexing/selection via position
We only include selected variables via their position. We know that the variables name and age are saved in the first and second column of our object data, thus:
<- survey[,c(1:2)]
survey_new str(survey_new)
## 'data.frame': 20 obs. of 2 variables:
## $ name: chr "Alexandra" "Alex" "Maximilian" "Moritz" ...
## $ age : int 20 25 29 22 25 26 26 27 8 26 ...
We could also specify those columns that we want to exclude. The negative operator - guarantees that all columns following it will be excluded:
<- survey[,-c(3:6)]
survey_new str(survey_new)
## 'data.frame': 20 obs. of 2 variables:
## $ name: chr "Alexandra" "Alex" "Maximilian" "Moritz" ...
## $ age : int 20 25 29 22 25 26 26 27 8 26 ...
5.2.3.1.2 Option 2: Selection via name
Again, the more variables your data frame contains, the more prone to errors selection via a variable’s position. In many cases, it may just be easier to select variables we want to include via the variable’s name6:
<- survey[,c("name", "age")]
survey_new str(survey_new)
## 'data.frame': 20 obs. of 2 variables:
## $ name: chr "Alexandra" "Alex" "Maximilian" "Moritz" ...
## $ age : int 20 25 29 22 25 26 26 27 8 26 ...
5.2.3.2 Including/excluding rows of data frames
In addition, we may want to only keep or analyze selected rows (in this data set: observations).
Imagine that we want to include only those respondents that are older than 21 years.7 While we could also do this via indexing, i.e., write down all row numbers of respondents older than 21 years, this often becomes highly inefficient and prone to errors for many observations (imagine having to do such counting if your data set includes 1,000 observations).
Instead, we work with logical conditions to select rows. In principle, we ask R to consider a specific column - here, the variable survey$age - and only consider those rows where values for this variable take on values higher than 21.
Here, you should work with the tidyverse, in particular the dplyr package.
The so-called tidyverse is a popular collection of different R packages especially useful and accessible for R “beginners”8.
In the following, I’ll give you a short introduction into the tidyverse (especially dplyr for data wrangling), but not an overall introduction. For more detailed sources on this, see
- The tidyverse Website, especially Wickham’s book R for data science
- Tutorial 12: Tidy data by G. Grolemund und H. Wickham
dplyr is one of the most popular packages belonging to the tidyverse. It contains a lot of really helpful functions for data manipulation, out of which I will only mention a few. In particular, the pipe operator %>% has a central function.
The pipe operator %>% takes a certain object which is then passed over to one or several functions to its right.
To use dplyr, you first have to install and activate the package. For help, see this useful cheat sheet on dplyr.
install.packages("dplyr")
library("dplyr")
Let’s take the same example as before: We want to include include those observations from the object survey where respondents said that they were older than 21 years.
To do this, we can use dplyr’s filter() function:
<- survey %>% filter(age > 21)
survey_new str(survey_new)
## 'data.frame': 17 obs. of 6 variables:
## $ name : chr "Alex" "Maximilian" "Moritz" "Vanessa" ...
## $ age : int 25 29 22 25 26 26 27 26 27 26 ...
## $ date : chr "2021-09-08" "2021-09-09" "2021-09-06" "2021-09-07" ...
## $ outlet : chr "Online " "Zeitung" "TV" "Online " ...
## $ outlet_use : int 3 4 2 1 3 3 0 NA NA 2 ...
## $ outlet_trust: int 5 1 2 3 4 2 1 2 2 2 ...
How does dplyr work?
- We define the object with which we want to work: the data frame survey
- We pass on the object survey to our pipeline: %>%
- We filter the data set, i.e., keep only those observations where respondents are older than 21 years. We do not have to define the object - note that we only specify age instead of survey$age. We already told R to only use data from the object survey: filter(age > 21)
- We assign the result a new object survey_new: survey <-
5.2.3.3 Including/excluding columns and rows of data frames
In some cases, you may also want to reduce your data frame to specific columns and rows.
Let’s assume you want to reduce the object survey to (a) respondents older than 21 years and (b) the variables name and age.
using the tidyverse approach using the functions filter() and select():
<- survey %>%
survey_new filter(age > 21) %>%
select(name, age)
str(survey_new)
## 'data.frame': 17 obs. of 2 variables:
## $ name: chr "Alex" "Maximilian" "Moritz" "Vanessa" ...
## $ age : int 25 29 22 25 26 26 27 26 27 26 ...
5.2.3.4 Transforming values in data frames
The same logic applied for transforming existing data.
Let’s say that we want to transform the variable outlet_trust in the object survey.
Remember: The variable indicates how much each student trusts a specific media outlet described in the variable outlet (from 1 = not at all to 5 = very much).
Instead of numeric values ranging from 1 to 5, we want to create a new variable named survey$outlet_trust_new which should include:
- “low trust” instead of the numeric values 1 and 2
- “medium trust” instead of the numeric value 3
- “high trust” instead of the numeric values 4 and 5
Using dplyr, this becomes easy. Here, we use the mutate() function to create new variables based on our existing data.
#create empty variable
$outlet_trust_new <- NA
survey
#transform all trust scores in the same pipeline
<- survey %>%
survey mutate(outlet_trust_new=replace(outlet_trust_new, outlet_trust<=2, "low trust")) %>%
mutate(outlet_trust_new=replace(outlet_trust_new, outlet_trust==3, "medium trust")) %>%
mutate(outlet_trust_new=replace(outlet_trust_new, outlet_trust>=4, "high trust"))
#inspect results
str(survey)
## 'data.frame': 20 obs. of 7 variables:
## $ name : chr "Alexandra" "Alex" "Maximilian" "Moritz" ...
## $ age : int 20 25 29 22 25 26 26 27 8 26 ...
## $ date : chr "2021-09-09" "2021-09-08" "2021-09-09" "2021-09-06" ...
## $ outlet : chr "TV" "Online " "Zeitung" "TV" ...
## $ outlet_use : int 2 3 4 2 1 3 3 0 1 NA ...
## $ outlet_trust : int 5 5 1 2 3 4 2 1 4 2 ...
## $ outlet_trust_new: chr "high trust" "high trust" "low trust" "low trust" ...
5.3 Take Aways
- Inspecting objects:
- simply type in the object’s name, use print(), View(), str(), or head()
- to retrieve elements of a vector: vector[number of element]
- to retrieve elements of a data frame:
- for specific rows: data frame[number of row(s), ]
- for specific columns: data frame[ ,number of column(s) ] OR data frame$column
- for values in specific rows and columns: data frame[number of row(s), number of column(s)] OR data frame$column[number of rows(s)]
- for lists: list[[number of list element]] OR list$name of list element
- Transforming objects:
- Useful commands via dplyr, for instance: %>%, select(), filter(), mutate(), replace()
5.4 More tutorials on this
You still have questions? The following tutorials & papers can help you with that:
5.5 Test your knowledge
You’ve worked through all the material of Tutorial 5? Let’s see it - the following tasks will test your knowledge.
Since it is almost Halloween season, we’ll work with a data set from fivethirthyeight on The Ultimate Halloween Candy Power Ranking.
In short, the data contains an online survey in which participants were asked to choose their favorite out of two different candy types. The data is accessible via a Creative Commons Attribution 4.0 International License here. You’ll have to copy the data into a “.txt” file. Else, you can also download the data from Moodle/Data for R (file called data_tutorial4.txt).
The data includes a range of variables, including
- the name of each candy bar: competitorname
- whether the candy bar contains chocolate: chocolate
- whether the candy bar contains fruit flavor: fruity
- whether the candy bar contains caramel: caramel
- the unit price percentile the candy bar is in compared to the rest of the set: pricepercent
5.5.1 Task 5.1
Read the data set into R. Writing the corresponding R code, find out
- how many observations and how many variables the data set contains.
5.5.2 Task 5.2
Writing the corresponding R code, find out
- how many candy bars contain chocolate.
- how many candy bars contain fruit flavor.
5.5.3 Task 5.3
Writing the corresponding R code, find out
- the name(s) of candy bars containing both chocolate and fruit flavor.
5.5.4 Task 5.4
Create a new data frame called data_new. Writing the corresponding R code,
- reduce the data set only observations containing chocolate but not caramel. The data set should also only include the variables competitorname and pricepercent.
- round the variable pricepercent to two decimals.
- sort the data by pricepercent in descending order, i.e., make sure that candy bars with the highest price are on top of the data frame and those with the lowest price on the bottom.
The corresponding data frame should look like this:
## competitorname pricepercent
## 1 Nestle Smarties 0.98
## 2 Hershey's Krackel 0.92
## 3 Hershey's Milk Chocolate 0.92
## 4 Hershey's Special Dark 0.92
## 5 Mr Good Bar 0.92
## 6 Mounds 0.86
## 7 Whoppers 0.85
## 8 Almond Joy 0.77
## 9 Nestle Butterfinger 0.77
## 10 Nestle Crunch 0.77
## 11 Peanut butter M&M's 0.65
## 12 M&M's 0.65
## 13 Peanut M&Ms 0.65
## 14 Reese's Peanut Butter cup 0.65
## 15 Reese's pieces 0.65
## 16 Reese's stuffed with pieces 0.65
## 17 3 Musketeers 0.51
## 18 Charleston Chew 0.51
## 19 Junior Mints 0.51
## 20 Kit Kat 0.51
## 21 Tootsie Roll Juniors 0.51
## 22 Tootsie Pop 0.32
## 23 Tootsie Roll Snack Bars 0.32
## 24 Reese's Miniatures 0.28
## 25 Hershey's Kisses 0.09
## 26 Sixlets 0.08
## 27 Tootsie Roll Midgies 0.01
This is where you’ll find solutions for tutorial 5.
Let’s keep going: Tutorial 6: Data Collection: Testing Intercoder Reliability.
We could have also done that for the scalar word. However, since the object only contains one value, this would not make much sense. To understand that, compare the commands word[1] and word[2]↩︎
We could already rely on the dplyr package here which eases data transformations - we’ll get to that in a minute↩︎
Remember Tutorial 3.1.6? We’ve already encountered this kind of code!↩︎
Though I never met anyone who “finished” R, so remember that we are all beginners in R - or that we are all learning R, but are at different stages of doing so.↩︎