4 Using dplyr
4.1 Intro
The most intutive way to manipulate your data is using the functions inside the dplyr package, which is downloaded and installed into R (as all packages are, see Chapter 2 for more details). “Manipulating” your data refers to choosing a subset of the variables and/or observations in your dataset, as well as filtering (selecting observations based on their variable values), creating new variables, and more. Now that you understand the interface of R and R Studio and how to work with objects and functions, the rest of these tutorials will introduce techniques in R designed to address specific needs in standard data analyses. We start with data manipulation.
4.2 dplyr Functions
The package dplyr has many functions, each designed to do a specific task in data manipulation. The main ones are: select(), filter(), arrange(),mutate(), and summarize(). We will cover all of these, as well as other functions such as group_by(), spread(), gather(), separate(), and unite(). As with all packages, we first must load the package after it has been installed.
# install.packages("dplyr")
library(dplyr) # loads package
4.2.1 Select, filter, and arrange
The functions select(), filter(), and arrange() handle the usual subsetting operations discussed in the previous chapter. We will go through each one in order.
The select() function is used to include or omit chosen variables (columns) from the dataset, as well as rearrange the order of the variables. The syntax is the following
select(dataset, variable_1, variable_2, …)
where the argument “dataset” is where you state the name of the object that represents the dataset of interest, and “variable_1”, “variable_2”,… are the arguments where you state the name of the variables/columns you want to include or remove (to remove the variable, begin the name with a -) in the order you would like them to apear in the dataset. There is no limit (I believe) to the number of variables that can be referenced. You can reference an ordered collection of variables (ordered meaning from left to right in the dataset) using variable_i:variable_j as an argument in the function, where “variable_i” is the name of the first variable and “variable_j” is the name of the last. This idea works with - as well. Lastly, unlike with subsetting in Base R, you can reference variables to include and remove (with a -) in a single select() function call.
The filter() function is used to reduce the dataset to a subset of observations based on a set of criteria for their variable values. The syntax is the following
filter(dataset, criteria_1, criteria_2, …)
where the arguments “criteria_1”, “criteria_2”,… are where you state the conditions needed for a observation to be included. Note that these are AND conditions, meaning an observation has to meet all provided criteria to be included and OR conditions where an observation has to meet at least one of the provided criteria. Let’s go into more detail on a AND condition and an OR condition, as well as logical statements in general.
In R, a logical statement is an operator that returns the value TRUE, FALSE, or NA (missing). They are conditions where if the condition is met, TRUE is returned, if not FALSE is returned, and if the variable in the condition is missing NA is returned. Conditional operators include > (greater then), < (less then), equal (==), >= (greater then or equal to), <= (less then or equal to), and ! (not). Consider the following simple examples.
x <- 5
x==5
## [1] TRUE
!(x==5) # not equal
## [1] FALSE
x!=5 # also not equal
## [1] FALSE
x>3
## [1] TRUE
x<3
## [1] FALSE
!(x<3)
## [1] TRUE
x<=5
## [1] TRUE
Logical statements can be combined and evaluated as a single statement using AND and OR statements. An AND statement means all individual statements must be TRUE to be evaluated as TRUE, otherwise is FALSE (or NA) while an OR statement means at least one statement must be TRUE to be evaluated as TRUE, and is FALSE only if all statements are FALSE. AND statements are created by separating individual logical statements with & while OR statements are created by separating individual logical statements with |. Parentheses may also be needed so that R evaluates the individual statements in the correct order. Consider the following examples.
x <- 5
y <- 3
x==5|y==4
## [1] TRUE
x==5 & y==4
## [1] FALSE
x==5 & y==3
## [1] TRUE
x>3 & x<7
## [1] TRUE
!(x>3 & x<7) # negates TRUE expression => FALSE returned
## [1] FALSE
Now, let’s apply select() and filter() to some IBIS data. We will use the small AOSI dataset from before, and do the following:
- Select specific variables to keep
- Filter observations by AOSI total score and diagnosis group
Remember that all changes are not saved nor do they overwrite the original dataset object in R, unless done by the user by either creating a new object with these changes or saving these new changes as an object of the same R object name as the dataset. Even if one changes the original R object, the Excel sheet is not changed (we will see later how to have the changes done in R be reflected in either the original Excel sheet or in a new Excel sheet).
excel_data <- read.csv("Data/AOSI_small.csv", na.strings=c(".", " ")) # read-in data as data frame, save as named object
# View whole dataset
excel_data
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 1 HR_ASD PHI Male 7.1
## 2 2 HR_ASD PHI Male 6.1
## 3 3 HR_ASD PHI Male 6.6
## 4 4 HR_ASD PHI Male NA
## 5 5 HR_neg PHI Male 6.8
## 6 6 HR_neg PHI Male NA
## 7 7 HR_ASD PHI Male 6.2
## 8 8 HR_ASD PHI Male 6.2
## 9 9 HR_neg PHI Male NA
## 10 10 HR_ASD PHI Male NA
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 8 12.5
## 2 18 12.4
## 3 4 12.9
## 4 NA 12.7
## 5 6 NA
## 6 NA 12.1
## 7 16 12.3
## 8 10 12.2
## 9 NA 12.4
## 10 NA 12.2
## V12.aosi.total_score_1_18
## 1 3
## 2 10
## 3 2
## 4 3
## 5 NA
## 6 4
## 7 4
## 8 14
## 9 1
## 10 7
# Select only ID and Group and view data
select(excel_data, Identifiers, GROUP)
## Identifiers GROUP
## 1 1 HR_ASD
## 2 2 HR_ASD
## 3 3 HR_ASD
## 4 4 HR_ASD
## 5 5 HR_neg
## 6 6 HR_neg
## 7 7 HR_ASD
## 8 8 HR_ASD
## 9 9 HR_neg
## 10 10 HR_ASD
# Remove gender and site from data
select(excel_data, -Gender, -Study_Site)
## Identifiers GROUP V06.aosi.Candidate_Age V06.aosi.total_score_1_18
## 1 1 HR_ASD 7.1 8
## 2 2 HR_ASD 6.1 18
## 3 3 HR_ASD 6.6 4
## 4 4 HR_ASD NA NA
## 5 5 HR_neg 6.8 6
## 6 6 HR_neg NA NA
## 7 7 HR_ASD 6.2 16
## 8 8 HR_ASD 6.2 10
## 9 9 HR_neg NA NA
## 10 10 HR_ASD NA NA
## V12.aosi.Candidate_Age V12.aosi.total_score_1_18
## 1 12.5 3
## 2 12.4 10
## 3 12.9 2
## 4 12.7 3
## 5 NA NA
## 6 12.1 4
## 7 12.3 4
## 8 12.2 14
## 9 12.4 1
## 10 12.2 7
# Only include observations with ASD diagnosis
filter(excel_data, GROUP=="HR_ASD")
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 1 HR_ASD PHI Male 7.1
## 2 2 HR_ASD PHI Male 6.1
## 3 3 HR_ASD PHI Male 6.6
## 4 4 HR_ASD PHI Male NA
## 5 7 HR_ASD PHI Male 6.2
## 6 8 HR_ASD PHI Male 6.2
## 7 10 HR_ASD PHI Male NA
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 8 12.5
## 2 18 12.4
## 3 4 12.9
## 4 NA 12.7
## 5 16 12.3
## 6 10 12.2
## 7 NA 12.2
## V12.aosi.total_score_1_18
## 1 3
## 2 10
## 3 2
## 4 3
## 5 4
## 6 14
## 7 7
# only include observations with ASD diagnosis and month 6 visit AOSI total score of at least 10
filter(excel_data, GROUP=="HR_ASD" & V06.aosi.total_score_1_18>=10)
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 2 HR_ASD PHI Male 6.1
## 2 7 HR_ASD PHI Male 6.2
## 3 8 HR_ASD PHI Male 6.2
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 18 12.4
## 2 16 12.3
## 3 10 12.2
## V12.aosi.total_score_1_18
## 1 10
## 2 4
## 3 14
# only include Female observations
filter(excel_data, Gender=="Female") # Blank dataset returned, only have male subjects
## [1] Identifiers GROUP
## [3] Study_Site Gender
## [5] V06.aosi.Candidate_Age V06.aosi.total_score_1_18
## [7] V12.aosi.Candidate_Age V12.aosi.total_score_1_18
## <0 rows> (or 0-length row.names)
Finally, the arrange() function is used to reorder the observations in the dataset based on specific variables. The syntax for the function is arrange(dataset, variable1, variable2, …) where variable1, variable2, … are the variables you wish to used for the reordering. Simply stating the variable name will order the dataset by increasing values of the variable. To use a decreasing order, use desc(variable) instead.
arrange(excel_data, GROUP, V12.aosi.Candidate_Age) # increasing age
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 8 HR_ASD PHI Male 6.2
## 2 10 HR_ASD PHI Male NA
## 3 7 HR_ASD PHI Male 6.2
## 4 2 HR_ASD PHI Male 6.1
## 5 1 HR_ASD PHI Male 7.1
## 6 4 HR_ASD PHI Male NA
## 7 3 HR_ASD PHI Male 6.6
## 8 6 HR_neg PHI Male NA
## 9 9 HR_neg PHI Male NA
## 10 5 HR_neg PHI Male 6.8
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 10 12.2
## 2 NA 12.2
## 3 16 12.3
## 4 18 12.4
## 5 8 12.5
## 6 NA 12.7
## 7 4 12.9
## 8 NA 12.1
## 9 NA 12.4
## 10 6 NA
## V12.aosi.total_score_1_18
## 1 14
## 2 7
## 3 4
## 4 10
## 5 3
## 6 3
## 7 2
## 8 4
## 9 1
## 10 NA
arrange(excel_data, GROUP, desc(V12.aosi.Candidate_Age)) # decreasing age
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 3 HR_ASD PHI Male 6.6
## 2 4 HR_ASD PHI Male NA
## 3 1 HR_ASD PHI Male 7.1
## 4 2 HR_ASD PHI Male 6.1
## 5 7 HR_ASD PHI Male 6.2
## 6 8 HR_ASD PHI Male 6.2
## 7 10 HR_ASD PHI Male NA
## 8 9 HR_neg PHI Male NA
## 9 6 HR_neg PHI Male NA
## 10 5 HR_neg PHI Male 6.8
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 4 12.9
## 2 NA 12.7
## 3 8 12.5
## 4 18 12.4
## 5 16 12.3
## 6 10 12.2
## 7 NA 12.2
## 8 NA 12.4
## 9 NA 12.1
## 10 6 NA
## V12.aosi.total_score_1_18
## 1 2
## 2 3
## 3 3
## 4 10
## 5 4
## 6 14
## 7 7
## 8 1
## 9 4
## 10 NA
arrange(excel_data, desc(GROUP), V12.aosi.Candidate_Age) # 'decreasing GROUP', notice R uses alphabetical order when character variables are used
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 6 HR_neg PHI Male NA
## 2 9 HR_neg PHI Male NA
## 3 5 HR_neg PHI Male 6.8
## 4 8 HR_ASD PHI Male 6.2
## 5 10 HR_ASD PHI Male NA
## 6 7 HR_ASD PHI Male 6.2
## 7 2 HR_ASD PHI Male 6.1
## 8 1 HR_ASD PHI Male 7.1
## 9 4 HR_ASD PHI Male NA
## 10 3 HR_ASD PHI Male 6.6
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 NA 12.1
## 2 NA 12.4
## 3 6 NA
## 4 10 12.2
## 5 NA 12.2
## 6 16 12.3
## 7 18 12.4
## 8 8 12.5
## 9 NA 12.7
## 10 4 12.9
## V12.aosi.total_score_1_18
## 1 4
## 2 1
## 3 NA
## 4 14
## 5 7
## 6 4
## 7 10
## 8 3
## 9 3
## 10 2
4.2.2 Mutate and summarize
The functions mutate() and summarize() are used to create new variables to add to the dataset. First, we cover mutate(). The syntax for this function is when used with the pipe is mutate(dataset, variable1, variable2, …) where variable1, variable2, … are the definitions of the variables you want to add. They will be calculated for each observation in the dataset. We consider an IBIS dataset of high risk children (HR) with Vineland, Mullen, and AOSI information.
Suppose we wanted create the following variables 1) variable containing each child’s mean Mullen composite standard score across the 4 visits in the dataset and 2) variable indicating if the site of the visit is on the East Coast, Midwest, or West Coast of the United States. We can easily do this using mutate(), as shown below.
# Create data set of only HR children
HR_data <- read.csv("Data/Cross-sec_full.csv", na.strings=c(".", " "))
HR_data <- filter(HR_data, V24.demographics.Risk=="HR")
# Create variables using mutate()
HR_data <- mutate(HR_data, Mullen_Mean=(`V06.mullen.composite_standard_score`+`V12.mullen.composite_standard_score`+
`V24.mullen.composite_standard_score`+`V36.mullen.composite_standard_score`)/4,
Site_Location=factor(ifelse(Study_Site=="PHI"|Study_Site=="UNC","East_Coast",
ifelse(Study_Site=="STL","Midwest","West_Coast"))))
ftable(HR_data$Study_Site, HR_data$Site_Location)
## East_Coast Midwest West_Coast
##
## PHI 108 0 0
## SEA 0 0 101
## STL 0 110 0
## UNC 90 0 0
From the above example, you will notice the use of the function ifelse(). Using this function, you can create a series of if…else commands so that R execute pieces of code based on specified criteria. The syntax for ifelse() is ifelse(condition, action, else) where condition is some logical statement to test (i.e., this condition needs to return TRUE or FALSE), action is the command R will execute if the condition is TRUE, and else is the command R will execute if the condition if FALSE. As you can see in the above example, you ca create a series of if…else commands by chaining ifelse() calls by setting the else argument to another call to ifelse(). The ifelse chain used in the above example is interpreted as follows.
If site is UNC or site is PHI, set variable Site_Location’s value to East_Coast
Else, if site is STL, set variable Site_Location’s value to Midwest
Else, if 1) and 2) are not true, set variable Site_Location’s value to West_Coast (since site then must be SEA)
Make sure to keep track of missing values; in our example, there are no missing values for site so this was not a concern. If there were missing values for site, the variable Site_Location would be set to West_Coast and not NA as desired.
The function summarize() is used to reduce a dataset to a set of chosen summary statistics such as the mean, sample size, standard deviation, etc. The syntax is summarize(dataset, variable1, variable2, …) where variable1, variable2, … are the definitions of the summary statistics you wish to calculate. For example, suppose in the AOSI dataset, we want to create a dataset containing the mean values of total score at month 6 and at month 12. This can be done using summarise().
aosi_data <-read.csv("Data/cross-sec_aosi.csv", na.strings=c(".", " "))
summarise(aosi_data, sample_size=n(), mean_aosi_ts_V6=mean(V06.aosi.total_score_1_18), mean_aosi_ts_V12=mean(V12.aosi.total_score_1_18))
## sample_size mean_aosi_ts_V6 mean_aosi_ts_V12
## 1 587 NA NA
This function becomes much more useful when combined with the function group_by(). The calling group_by() on the dataset before using summarise() will have R incorporate the created groups in calculations; after calling group_by(), while the data may look to the same to us, R has marked observations as being grouped together in the background. Let’s repeat the above example, but first group by gender and diagnosis.
aosi_data_grouped <- group_by(aosi_data, Gender, GROUP) # Notice in the environment window, this object is marked as type grouped_...
summarise(aosi_data_grouped, sample_size=n(), mean_aosi_ts_V6=mean(V06.aosi.total_score_1_18), mean_aosi_ts_V12=mean(V12.aosi.total_score_1_18))
## # A tibble: 8 x 5
## # Groups: Gender [?]
## Gender GROUP sample_size mean_aosi_ts_V6 mean_aosi_ts_V12
## <fct> <fct> <int> <dbl> <dbl>
## 1 Female HR_ASD 20 NA NA
## 2 Female HR_neg 145 NA NA
## 3 Female LR_ASD 1 6 5
## 4 Female LR_neg 69 NA NA
## 5 Male HR_ASD 75 NA NA
## 6 Male HR_neg 173 NA NA
## 7 Male LR_ASD 2 8 3
## 8 Male LR_neg 102 NA NA
We can see the summary statistics are calculated for each specified group. There are many summary statistics that can be calculated using the summarise() function; note that due to missing values, some of the means are NA (this can be changed by adding na.omit=TRUE as arguments to the calls to mean(). For a list of some of these statistics and how to implement them, use ?summarise.
4.2.3 Spread, Gather, Separate and Unite
The dplyr package also has functions used to transform the dimensions of your data, for example going from wide to long. Recall the full dataset has data on children for multiple time points; thus, it will be essential for certain analyses to have the data in long form or in wide form. The functions spread(), gather(), separate(), and unite() can be used to do these transformations. We show all four functions through an example. For more explanation of these functions, please see and from Hadley Wickham’s excellent book R for Data Science.
Consider the AOSI_small.csv data set. Recall, we previously loaded this dataset into R and saved it as excel_data. This dataset has the child’s age and AOSI total score at 6 months and 12 months and is “wide” form (1 row per child). Let us transform this to “long” form (multiple rows per subject).
excel_data
## Identifiers GROUP Study_Site Gender V06.aosi.Candidate_Age
## 1 1 HR_ASD PHI Male 7.1
## 2 2 HR_ASD PHI Male 6.1
## 3 3 HR_ASD PHI Male 6.6
## 4 4 HR_ASD PHI Male NA
## 5 5 HR_neg PHI Male 6.8
## 6 6 HR_neg PHI Male NA
## 7 7 HR_ASD PHI Male 6.2
## 8 8 HR_ASD PHI Male 6.2
## 9 9 HR_neg PHI Male NA
## 10 10 HR_ASD PHI Male NA
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 8 12.5
## 2 18 12.4
## 3 4 12.9
## 4 NA 12.7
## 5 6 NA
## 6 NA 12.1
## 7 16 12.3
## 8 10 12.2
## 9 NA 12.4
## 10 NA 12.2
## V12.aosi.total_score_1_18
## 1 3
## 2 10
## 3 2
## 4 3
## 5 NA
## 6 4
## 7 4
## 8 14
## 9 1
## 10 7
First, we extract the variable names which represent the repeated measures (age and AOSI total score) for the children.
vars_to_convert <- names(excel_data)[5:8]
vars_to_convert
## [1] "V06.aosi.Candidate_Age" "V06.aosi.total_score_1_18"
## [3] "V12.aosi.Candidate_Age" "V12.aosi.total_score_1_18"
Then, we force each of these variables to be it’s own row, creating two new variables; 1) “variable” which contains the variable’s name and 2) “var_value” with the value of the variable. This is done using gather(). With gather(), first you the generate new variable name which holds the old variable, second you specify the new variable name which holds of the value of the old variable, and thirdly you specify the names of the old variables. Notice the use of %>% in the following code. This symbol is called the pipe and is used to connect various lines together, instead of writing and saving each line as a separate object to then be overwritten. We discuss the pipe in detail in a later part of this chapter; for now, all you need to understand is that it connects each of the proceeding lines together into a single command.
library(tidyr)
excel_data %>%
gather(variable, var_value, vars_to_convert)
## Identifiers GROUP Study_Site Gender variable
## 1 1 HR_ASD PHI Male V06.aosi.Candidate_Age
## 2 2 HR_ASD PHI Male V06.aosi.Candidate_Age
## 3 3 HR_ASD PHI Male V06.aosi.Candidate_Age
## 4 4 HR_ASD PHI Male V06.aosi.Candidate_Age
## 5 5 HR_neg PHI Male V06.aosi.Candidate_Age
## 6 6 HR_neg PHI Male V06.aosi.Candidate_Age
## 7 7 HR_ASD PHI Male V06.aosi.Candidate_Age
## 8 8 HR_ASD PHI Male V06.aosi.Candidate_Age
## 9 9 HR_neg PHI Male V06.aosi.Candidate_Age
## 10 10 HR_ASD PHI Male V06.aosi.Candidate_Age
## 11 1 HR_ASD PHI Male V06.aosi.total_score_1_18
## 12 2 HR_ASD PHI Male V06.aosi.total_score_1_18
## 13 3 HR_ASD PHI Male V06.aosi.total_score_1_18
## 14 4 HR_ASD PHI Male V06.aosi.total_score_1_18
## 15 5 HR_neg PHI Male V06.aosi.total_score_1_18
## 16 6 HR_neg PHI Male V06.aosi.total_score_1_18
## 17 7 HR_ASD PHI Male V06.aosi.total_score_1_18
## 18 8 HR_ASD PHI Male V06.aosi.total_score_1_18
## 19 9 HR_neg PHI Male V06.aosi.total_score_1_18
## 20 10 HR_ASD PHI Male V06.aosi.total_score_1_18
## 21 1 HR_ASD PHI Male V12.aosi.Candidate_Age
## 22 2 HR_ASD PHI Male V12.aosi.Candidate_Age
## 23 3 HR_ASD PHI Male V12.aosi.Candidate_Age
## 24 4 HR_ASD PHI Male V12.aosi.Candidate_Age
## 25 5 HR_neg PHI Male V12.aosi.Candidate_Age
## 26 6 HR_neg PHI Male V12.aosi.Candidate_Age
## 27 7 HR_ASD PHI Male V12.aosi.Candidate_Age
## 28 8 HR_ASD PHI Male V12.aosi.Candidate_Age
## 29 9 HR_neg PHI Male V12.aosi.Candidate_Age
## 30 10 HR_ASD PHI Male V12.aosi.Candidate_Age
## 31 1 HR_ASD PHI Male V12.aosi.total_score_1_18
## 32 2 HR_ASD PHI Male V12.aosi.total_score_1_18
## 33 3 HR_ASD PHI Male V12.aosi.total_score_1_18
## 34 4 HR_ASD PHI Male V12.aosi.total_score_1_18
## 35 5 HR_neg PHI Male V12.aosi.total_score_1_18
## 36 6 HR_neg PHI Male V12.aosi.total_score_1_18
## 37 7 HR_ASD PHI Male V12.aosi.total_score_1_18
## 38 8 HR_ASD PHI Male V12.aosi.total_score_1_18
## 39 9 HR_neg PHI Male V12.aosi.total_score_1_18
## 40 10 HR_ASD PHI Male V12.aosi.total_score_1_18
## var_value
## 1 7.1
## 2 6.1
## 3 6.6
## 4 NA
## 5 6.8
## 6 NA
## 7 6.2
## 8 6.2
## 9 NA
## 10 NA
## 11 8.0
## 12 18.0
## 13 4.0
## 14 NA
## 15 6.0
## 16 NA
## 17 16.0
## 18 10.0
## 19 NA
## 20 NA
## 21 12.5
## 22 12.4
## 23 12.9
## 24 12.7
## 25 NA
## 26 12.1
## 27 12.3
## 28 12.2
## 29 12.4
## 30 12.2
## 31 3.0
## 32 10.0
## 33 2.0
## 34 3.0
## 35 NA
## 36 4.0
## 37 4.0
## 38 14.0
## 39 1.0
## 40 7.0
Next, we split the “variable” into two new variables; 1) contains the visit and 2) contains the variable at that visit. This is done using separate(). With separate(), first you specify the variable which you are separating, second you specify the names of the new variables created from the separating, and third you specify where in the old variable you want to separate (here we spit after the 3rd character, hence sep=3).
library(tidyr)
excel_data %>%
gather(variable, var_value, vars_to_convert) %>%
separate(variable,c("Visit","Variable"),sep=3)
## Identifiers GROUP Study_Site Gender Visit Variable
## 1 1 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 2 2 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 3 3 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 4 4 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 5 5 HR_neg PHI Male V06 .aosi.Candidate_Age
## 6 6 HR_neg PHI Male V06 .aosi.Candidate_Age
## 7 7 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 8 8 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 9 9 HR_neg PHI Male V06 .aosi.Candidate_Age
## 10 10 HR_ASD PHI Male V06 .aosi.Candidate_Age
## 11 1 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 12 2 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 13 3 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 14 4 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 15 5 HR_neg PHI Male V06 .aosi.total_score_1_18
## 16 6 HR_neg PHI Male V06 .aosi.total_score_1_18
## 17 7 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 18 8 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 19 9 HR_neg PHI Male V06 .aosi.total_score_1_18
## 20 10 HR_ASD PHI Male V06 .aosi.total_score_1_18
## 21 1 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 22 2 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 23 3 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 24 4 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 25 5 HR_neg PHI Male V12 .aosi.Candidate_Age
## 26 6 HR_neg PHI Male V12 .aosi.Candidate_Age
## 27 7 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 28 8 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 29 9 HR_neg PHI Male V12 .aosi.Candidate_Age
## 30 10 HR_ASD PHI Male V12 .aosi.Candidate_Age
## 31 1 HR_ASD PHI Male V12 .aosi.total_score_1_18
## 32 2 HR_ASD PHI Male V12 .aosi.total_score_1_18
## 33 3 HR_ASD PHI Male V12 .aosi.total_score_1_18
## 34 4 HR_ASD PHI Male V12 .aosi.total_score_1_18
## 35 5 HR_neg PHI Male V12 .aosi.total_score_1_18
## 36 6 HR_neg PHI Male V12 .aosi.total_score_1_18
## 37 7 HR_ASD PHI Male V12 .aosi.total_score_1_18
## 38 8 HR_ASD PHI Male V12 .aosi.total_score_1_18
## 39 9 HR_neg PHI Male V12 .aosi.total_score_1_18
## 40 10 HR_ASD PHI Male V12 .aosi.total_score_1_18
## var_value
## 1 7.1
## 2 6.1
## 3 6.6
## 4 NA
## 5 6.8
## 6 NA
## 7 6.2
## 8 6.2
## 9 NA
## 10 NA
## 11 8.0
## 12 18.0
## 13 4.0
## 14 NA
## 15 6.0
## 16 NA
## 17 16.0
## 18 10.0
## 19 NA
## 20 NA
## 21 12.5
## 22 12.4
## 23 12.9
## 24 12.7
## 25 NA
## 26 12.1
## 27 12.3
## 28 12.2
## 29 12.4
## 30 12.2
## 31 3.0
## 32 10.0
## 33 2.0
## 34 3.0
## 35 NA
## 36 4.0
## 37 4.0
## 38 14.0
## 39 1.0
## 40 7.0
Finally, we force the non-visit index variable to be separate columns instead of a single column of variable names. This is done using spread(). With spread(), first we specify the list of variables (key=) and second we specify the list of values of these variables (value=)
library(tidyr)
excel_data %>%
gather(variable, var_value, vars_to_convert) %>%
separate(variable,c("Visit","Variable"),sep=3) %>%
spread(key=Variable, value=var_value)
## Identifiers GROUP Study_Site Gender Visit .aosi.Candidate_Age
## 1 1 HR_ASD PHI Male V06 7.1
## 2 1 HR_ASD PHI Male V12 12.5
## 3 2 HR_ASD PHI Male V06 6.1
## 4 2 HR_ASD PHI Male V12 12.4
## 5 3 HR_ASD PHI Male V06 6.6
## 6 3 HR_ASD PHI Male V12 12.9
## 7 4 HR_ASD PHI Male V06 NA
## 8 4 HR_ASD PHI Male V12 12.7
## 9 5 HR_neg PHI Male V06 6.8
## 10 5 HR_neg PHI Male V12 NA
## 11 6 HR_neg PHI Male V06 NA
## 12 6 HR_neg PHI Male V12 12.1
## 13 7 HR_ASD PHI Male V06 6.2
## 14 7 HR_ASD PHI Male V12 12.3
## 15 8 HR_ASD PHI Male V06 6.2
## 16 8 HR_ASD PHI Male V12 12.2
## 17 9 HR_neg PHI Male V06 NA
## 18 9 HR_neg PHI Male V12 12.4
## 19 10 HR_ASD PHI Male V06 NA
## 20 10 HR_ASD PHI Male V12 12.2
## .aosi.total_score_1_18
## 1 8
## 2 3
## 3 18
## 4 10
## 5 4
## 6 2
## 7 NA
## 8 3
## 9 6
## 10 NA
## 11 NA
## 12 4
## 13 16
## 14 4
## 15 10
## 16 14
## 17 NA
## 18 1
## 19 NA
## 20 7
Finally, we clean up the formatting of the variables using the functions discussed previously (rename(), mutate(), etc.). We save this new transformed dataset as long_data.
library(tidyr)
long_data <- excel_data %>%
gather(variable, var_value, vars_to_convert) %>%
separate(variable,c("Visit","Variable"),sep=3) %>%
spread(key=Variable, value=var_value) %>%
plyr::rename(c(".aosi.Candidate_Age"="AOSI_Age",
".aosi.total_score_1_18"="AOSI_Total_Score")) %>%
mutate(ASD_Diag = factor(ifelse(grepl("ASD", GROUP), "ASD_Neg", "ASD_Pos")),
Visit=factor(Visit)) %>%
arrange(Identifiers, Visit)
long_data
## Identifiers GROUP Study_Site Gender Visit AOSI_Age AOSI_Total_Score
## 1 1 HR_ASD PHI Male V06 7.1 8
## 2 1 HR_ASD PHI Male V12 12.5 3
## 3 2 HR_ASD PHI Male V06 6.1 18
## 4 2 HR_ASD PHI Male V12 12.4 10
## 5 3 HR_ASD PHI Male V06 6.6 4
## 6 3 HR_ASD PHI Male V12 12.9 2
## 7 4 HR_ASD PHI Male V06 NA NA
## 8 4 HR_ASD PHI Male V12 12.7 3
## 9 5 HR_neg PHI Male V06 6.8 6
## 10 5 HR_neg PHI Male V12 NA NA
## 11 6 HR_neg PHI Male V06 NA NA
## 12 6 HR_neg PHI Male V12 12.1 4
## 13 7 HR_ASD PHI Male V06 6.2 16
## 14 7 HR_ASD PHI Male V12 12.3 4
## 15 8 HR_ASD PHI Male V06 6.2 10
## 16 8 HR_ASD PHI Male V12 12.2 14
## 17 9 HR_neg PHI Male V06 NA NA
## 18 9 HR_neg PHI Male V12 12.4 1
## 19 10 HR_ASD PHI Male V06 NA NA
## 20 10 HR_ASD PHI Male V12 12.2 7
## ASD_Diag
## 1 ASD_Neg
## 2 ASD_Neg
## 3 ASD_Neg
## 4 ASD_Neg
## 5 ASD_Neg
## 6 ASD_Neg
## 7 ASD_Neg
## 8 ASD_Neg
## 9 ASD_Pos
## 10 ASD_Pos
## 11 ASD_Pos
## 12 ASD_Pos
## 13 ASD_Neg
## 14 ASD_Neg
## 15 ASD_Neg
## 16 ASD_Neg
## 17 ASD_Pos
## 18 ASD_Pos
## 19 ASD_Neg
## 20 ASD_Neg
4.2.4 Renaming variables
The final function in dplyr we discuss is rename(), used for renaming variables. The syntax for this function is rename(data, variable1_new_name=variable1_old_name, variable2_new_name=variable2_old_name, …). See the example below where the ID variable and diagnosis group variable are renamed.
rename_ex <- read.csv("Data/cross-sec_aosi.csv", na.strings=c(".", " ")) %>%
rename(ID=Identifiers, Diagnosis_group=GROUP)
rename_ex[1:10,]
## ID Diagnosis_group Study_Site Gender V06.aosi.Candidate_Age
## 1 1 HR_ASD PHI Male 7.1
## 2 2 HR_ASD PHI Male 6.1
## 3 3 HR_ASD PHI Male 6.6
## 4 4 HR_ASD PHI Male NA
## 5 5 HR_neg PHI Male 6.8
## 6 6 HR_neg PHI Male NA
## 7 7 HR_ASD PHI Male 6.2
## 8 8 HR_ASD PHI Male 6.2
## 9 9 HR_neg PHI Male NA
## 10 10 HR_ASD PHI Male NA
## V06.aosi.total_score_1_18 V12.aosi.Candidate_Age
## 1 8 12.5
## 2 18 12.4
## 3 4 12.9
## 4 NA 12.7
## 5 6 NA
## 6 NA 12.1
## 7 16 12.3
## 8 10 12.2
## 9 NA 12.4
## 10 NA 12.2
## V12.aosi.total_score_1_18
## 1 3
## 2 10
## 3 2
## 4 3
## 5 NA
## 6 4
## 7 4
## 8 14
## 9 1
## 10 7
4.2.5 Using the pipe
All of these functions in dplyr can be used in combination efficiently by using the pipe operator. The idea of the pipe is very simple; instead of writing each new data transformation as a new line and saving each previous transformation as an object to then overwrite, the pipe allows you to connect all of the steps together in a single line.
The pipe operator is denoted by %>%. Many functions can be connected using the pipe, including all of the dplyr package functions described above along with most of the other functions you will use in R when reading-in and manipulating data.
For example, let’s revisit calculating the mean AOSI scores by gender and site. Previously, this required multiple lines of R code, including creating a separate form of the dataset as an R object in which the dataset was grouped by gender and site. Let’s do this example using the pipe operator, seen below. You can see it is essentially the same code as before, with the first arguments of each function removed and each line connected by the operator %>%. The first argument is removed because by connecting the function calls with %>%, the result of the previous function is carried forward into the first argument of the next function by %>%. Thus, any function in which the first argument is an R dataset can bse used with the pipe operator. Notice also that no intermediate objects need to be saved.
aosi_data %>%
group_by(Gender, GROUP) %>%
summarise(sample_size=n(), mean_aosi_ts_V6=mean(V06.aosi.total_score_1_18), mean_aosi_ts_V12=mean(V12.aosi.total_score_1_18))
## # A tibble: 8 x 5
## # Groups: Gender [?]
## Gender GROUP sample_size mean_aosi_ts_V6 mean_aosi_ts_V12
## <fct> <fct> <int> <dbl> <dbl>
## 1 Female HR_ASD 20 NA NA
## 2 Female HR_neg 145 NA NA
## 3 Female LR_ASD 1 6 5
## 4 Female LR_neg 69 NA NA
## 5 Male HR_ASD 75 NA NA
## 6 Male HR_neg 173 NA NA
## 7 Male LR_ASD 2 8 3
## 8 Male LR_neg 102 NA NA
As a result, using the pipe operator makes your code more efficient (memory-wise in your computer and in the amount of code used) and readable. In fact, you can even avoid saving the dataset aosi_data all together and writing your code in the folowing fashion.
read.csv("Data/cross-sec_aosi.csv", na.strings=c(".", " ")) %>%
group_by(Gender, GROUP) %>%
summarise(sample_size=n(), mean_aosi_ts_V6=mean(V06.aosi.total_score_1_18), mean_aosi_ts_V12=mean(V12.aosi.total_score_1_18))
## # A tibble: 8 x 5
## # Groups: Gender [?]
## Gender GROUP sample_size mean_aosi_ts_V6 mean_aosi_ts_V12
## <fct> <fct> <int> <dbl> <dbl>
## 1 Female HR_ASD 20 NA NA
## 2 Female HR_neg 145 NA NA
## 3 Female LR_ASD 1 6 5
## 4 Female LR_neg 69 NA NA
## 5 Male HR_ASD 75 NA NA
## 6 Male HR_neg 173 NA NA
## 7 Male LR_ASD 2 8 3
## 8 Male LR_neg 102 NA NA
As a result, no objects are saved in R, making this code very efficient if you do not plan on using the aosi_data dataset for eany other analyses.
If you want to save the results of this code as a new dataset, just use new_data <- … as usual to save the output as a data object in R named new_data. This underscores how R is a language and has fundamental rules which allow your instructions to be understood by the software. If you can understand these rules and properties, you can apply them in general and understand how R is operating under the surface. This knowledge allows you to go beyond memorizing specific commands and examples, providing the ability to continually learn new techniques in R.
new_data <-
read.csv("Data/cross-sec_aosi.csv", na.strings=c(".", " ")) %>%
group_by(Gender, GROUP) %>%
summarise(sample_size=n(), mean_aosi_ts_V6=mean(V06.aosi.total_score_1_18), mean_aosi_ts_V12=mean(V12.aosi.total_score_1_18))
new_data
## # A tibble: 8 x 5
## # Groups: Gender [?]
## Gender GROUP sample_size mean_aosi_ts_V6 mean_aosi_ts_V12
## <fct> <fct> <int> <dbl> <dbl>
## 1 Female HR_ASD 20 NA NA
## 2 Female HR_neg 145 NA NA
## 3 Female LR_ASD 1 6 5
## 4 Female LR_neg 69 NA NA
## 5 Male HR_ASD 75 NA NA
## 6 Male HR_neg 173 NA NA
## 7 Male LR_ASD 2 8 3
## 8 Male LR_neg 102 NA NA
4.2.6 Editing factor variables: recode() and relevel()
Recall categorical variables are referred to as factor variables in R. Every factor variable also includes a set of levels which is an vector containing the group labels, in a specific order. Often times you will want to either 1) change these group levels or 2) change the order of these group levels. Changing the levels can be done easily with the function fct_recode() and changing the order can be done easily with the function fct_relevel(). These functions are actually in the forcats package.
To use fct_recode(), the first argument is the factor variable vector, and then you specify the old and desired new value for the group levels you wish to change using “new”=“old” where old is the old level and new is the new level. You then separate each level change with commas. We take the AOSI data and change “HR_neg” to “HR_NoASD” and “LR_neg” to “LR_NoASD” below. Note that we only print the first 10 observations.
library(forcats)
fct_recode(aosi_data$GROUP, "HR_NoASD"="HR_neg", "LR_NoASD"="LR_neg")[1:10]
## [1] HR_ASD HR_ASD HR_ASD HR_ASD HR_NoASD HR_NoASD HR_ASD
## [8] HR_ASD HR_NoASD HR_ASD
## Levels: HR_ASD HR_NoASD LR_ASD LR_NoASD
To permanently change the GROUP variable in the dataset, we can use the pipe. Since we are changing the variable, we actually have to use mutate() to have R overwrite the old GROUP variable.
aosi_data_2 <- aosi_data %>%
mutate(GROUP=fct_recode(GROUP, "HR_NoASD"="HR_neg", "LR_NoASD"="LR_neg"))
aosi_data_2$GROUP[1:10]
## [1] HR_ASD HR_ASD HR_ASD HR_ASD HR_NoASD HR_NoASD HR_ASD
## [8] HR_ASD HR_NoASD HR_ASD
## Levels: HR_ASD HR_NoASD LR_ASD LR_NoASD
# Same result
Now, we rearrange these levels using fct_relevel. Note that this does NOT change the actual values of the variable in data, simply what order R considers when using these group levels. This order is important when sorting the dataset, plotting, running linear regression and ANOVA analyses, and in other contexts. To use fct_relevel(), the first argument is the factor variable vector, then list the levels in increasing order using strings. For example, using the new group levels from fct_recode, we change the diagnosis group label order to start at Low Risk: Negative and end at High Risk: ASD.
fct_relevel(aosi_data_2$GROUP, "LR_NoASD", "LR_ASD", "HR_NoASD", "HR_ASD")[1:10]
## [1] HR_ASD HR_ASD HR_ASD HR_ASD HR_NoASD HR_NoASD HR_ASD
## [8] HR_ASD HR_NoASD HR_ASD
## Levels: LR_NoASD LR_ASD HR_NoASD HR_ASD
# Notice the Levels: row
To change the variable permanently in the dataset, we again need to use mutate() and can use the pipe for cleaner looking code.
aosi_data_3 <- aosi_data_2 %>%
mutate(GROUP=fct_relevel(GROUP, "LR_NoASD", "LR_ASD", "HR_NoASD", "HR_ASD"))
aosi_data_3$GROUP[1:10]
## [1] HR_ASD HR_ASD HR_ASD HR_ASD HR_NoASD HR_NoASD HR_ASD
## [8] HR_ASD HR_NoASD HR_ASD
## Levels: LR_NoASD LR_ASD HR_NoASD HR_ASD
# Same result