Chapter 15 Appendix 1 – Selected additional R code and resources
This section contains code, tips, and resources that are not elsewhere in the book and that may be useful at times. Full explanations are not provided in most cases. This appendix is perpetually under construction and is not comprehensive in any way. Please report feedback, suggestions, and links that no longer work to Anshul at akumar@mghihp.edu.
This chapter is not part of the required materials for HE-902.
15.1 General (non-R) quantitative analysis topics
15.1.1 Other selected resources on quantitative analysis or R
I get asked sometimes about whether Python or R is better. There is no single right answer. Below are my personal thoughts about choosing a data analysis platform. It is mostly written as a comparison of R and Python but it could apply to other platforms and tools as well.
Both R and Python are equally capable of doing everything needed for healthcare and social science quantitative analysis. I say this based on the needs of my coworkers and students as well as my own projects.
Some people prefer one over the other. I think that’s fine. People should use whatever they’re comfortable with that can get the job done.
Some people might try to get you to agree with them that one is better than the other. I personally try to avoid getting involved in that kind of thinking. It’s irrelevant. It’s like asking if apples or oranges are better. Both are fine. If Oscar prefers oranges and Ann prefers apples, who cares? Oscar eating oranges isn’t hurting Ann. Ann eating apples isn’t hurting Oscar. There’s no need for us to choose just one side. Of course, it might not hurt for Oscar to have some familiarity with apples, in case he ever goes to a picnic planned by Ann. And it might not hurt for Ann to have some exposure to oranges, in case she ever goes over to Oscar’s house for dinner. Translation: If one data analysis team works in R and the other works in Python, and now they have to collaborate on a project, it could be useful if each data team has some familiarity with the other team’s language, so that they can work together. But even if they don’t have much experience with the other’s language/platform, it’s okay. Most of this type of data analysis is just running calculations and manipulations on data in spreadsheets. The two teams can discuss what they did to the data with each other and then each do further work in their preferred language/platform.
Note that there are lots of platforms that we’re not even discussing here: SPSS, Stata, SAS, Matlab, Mathematica, and many more. That’s fine. For our purposes, they all mostly do the same thing and you don’t need to worry about learning them all; just learn the one that help you with your work.
The building blocks for each language or platform are similar to the building blocks for the others. For example, if you can code at an intermediate level in R, you can probably learn to code at an intermediate level in Python pretty quickly. The actual code you will write will look a little bit different, but the logic is all the same.
15.2 Manipulating data or other items
15.2.1 Extract cells from a dataframe
General form:
NameOfDataFrame[RowToExtract,ColumnToExtract]
Examples
Using row and column names:
n <- mtcars["Valiant","hp"]n
## [1] 105
Using row and column numbers:
n <- mtcars[6,4]n
## [1] 105
Using a combination of names and numbers:
mtcars[6,"hp"]
## [1] 105
mtcars["Valiant",4]
## [1] 105
15.2.2 Extract information from a lookup table or database
Let’s say we want to look up median incomes for zip codes from a data frame called zipcodes and put those incomes into another dataframe called demographics.
15.2.7 Separate/split a variable into multiple dummy variables
d <-data.frame(person =c("Audi","Broof","Chruuma","Deenolo", "Eeman"),gender=c("A","A","B","B","B"), IceCreamFlavorsYouLikeSelectAllApply =c("chocolate","strawberry", "chocolate,vanilla","strawberry,vanilla,chocolate","vanilla,other") )d
## person gender IceCreamFlavorsYouLikeSelectAllApply
## 1 Audi A chocolate
## 2 Broof A strawberry
## 3 Chruuma B chocolate,vanilla
## 4 Deenolo B strawberry,vanilla,chocolate
## 5 Eeman B vanilla,other
Result not shown:
# Split the values in the column by commasflavors <-strsplit(d$IceCreamFlavorsYouLikeSelectAllApply, ",")# Get all unique ice cream flavorsunique_flavors <-unique(unlist(flavors))# Create dummy variables for each unique flavorfor (flavor in unique_flavors) {# Create a new column with the flavor name and initialize with 0 d[[flavor]] <-0# Set the value to 1 if the flavor is in the list for that row d[[flavor]] <-as.integer(sapply(flavors, function(x) flavor %in% x))}d
The code above to create dummy variables was written by OpenAI’s GPT-3 model and accessed on 24 Sep 2023. This prompt was used: in r, i have a dataframe called d. there is a variable called IceCreamFlavorsYouLikeSelectAllApply in which each person (row in in the dataframe) might have one or multiple answers. some might say just ‘chocolate’ or ‘vanilla’ but others might say ’chocolate,vanilla,strawberry,other” or just “chocolate,strawberry”, for example. I want to create separate dummy variables in dataframe d for each of the ice cream flavors.
We get the code below with the following follow-up prompt to GPT-3: rewrite the code above so that each new variable will be called flavor_chocolate instead of just chocolate.
# Split the values in the column by commasflavors <-strsplit(d$IceCreamFlavorsYouLikeSelectAllApply, ",")# Get all unique ice cream flavorsunique_flavors <-unique(unlist(flavors))# Create dummy variables with names like 'flavor_chocolate', 'flavor_vanilla', etc.for (flavor in unique_flavors) {# Create a new column with the flavor name and initialize with 0 d[[paste0("flavor_", flavor)]] <-0# Set the value to 1 if the flavor is in the list for that row d[[paste0("flavor_", flavor)]] <-as.integer(sapply(flavors, function(x) flavor %in% x))}d
## person gender IceCreamFlavorsYouLikeSelectAllApply flavor_chocolate
## 1 Audi A chocolate 1
## 2 Broof A strawberry 0
## 3 Chruuma B chocolate,vanilla 1
## 4 Deenolo B strawberry,vanilla,chocolate 1
## 5 Eeman B vanilla,other 0
## flavor_strawberry flavor_vanilla flavor_other
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 1 1 0
## 5 0 1 1
Follow-up prompt: how do i calculate summary counts and percentages of how many people like each flavor?
# Calculate summary counts for each flavorflavor_counts <-colSums(d[paste0("flavor_", unique_flavors)])# Calculate total number of respondentstotal_respondents <-nrow(d)# Calculate percentagesflavor_percentages <- (flavor_counts / total_respondents) *100# Create a summary dataframesummary_df <-data.frame(Flavor = unique_flavors, Counts = flavor_counts, Percentages = flavor_percentages)# Print the summary dataframeprint(summary_df)
Follow-up prompt (after a few attempts): we now have a dataframe called d with variables for gender as well as flavor_chocolate dummy variables and similar variables for each flavor. for each flavor, we need to know the count and what percentage of people in each gender group is equal to 1
# Load the dplyr package if not already loaded# install.packages("dplyr")library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Assuming you have a 'gender' column in your dataframe 'd'# Calculate the count and percentage of each flavor for each gendersummary_df <- d %>%group_by(gender) %>%summarize(across(starts_with("flavor_"), ~sum(. ==1), .names ="Count_{.col}")) %>%mutate(across(starts_with("Count_"), ~. /sum(.), .names ="Percentage_{.col}"))# Rename columnscolnames(summary_df) <-c("Gender", paste0("Count_", unique_flavors), paste0("Percentage_", unique_flavors))# Print the summary dataprint(summary_df)
Now we can transpose the result to make it easier to read (we also have to change gender into the column names):
summary_df <-t(summary_df)colnames(summary_df) <-unlist(summary_df[1, ])summary_df <- summary_df[-1, ] # Remove the first row (which is now the column names)
And here’s a version that’s possibly simpler, for flavor counts by gender (result not shown):
Let’s say you have one dataset called dtrain and another one called dtest. And you want to make sure that dtest has the same variables (columns) as dtrain.
The code below tells the computer to retain within dtestonly the variables that are in dtrain:
15.2.21 Remove observations with missing values (NA values) in a single column
Take the existing dataframe OldData and make a new dataframe called NewData which only contains the rows in OldData that do NOT have a missing value—meaning do not have NA—for a variable called Var1:
NewData <- OldData[which(!is.na(OldData$Var1)),]
NewData should be a version of OldData in which any observation (row) coded as NA for Var1 has been removed.
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## function (x, replace = NULL, warn_missing = TRUE)
## {
## if (!is.null(x) && !is.factor(x) && !is.character(x)) {
## stop("x is not a factor or a character vector.")
## }
## mapvalues(x, from = names(replace), to = replace, warn_missing = warn_missing)
## }
## <bytecode: 0x00000286eb1525d8>
## <environment: namespace:plyr>
likertFix <-function(df, questionPrefix="", variablePrefix=""){# df is the dataset you're starting with# questionPrefix is the characters in quotation marks that label relevant columns for conversion to numbers and totaling up# variablePrefix is a short word in quotation marks that you want to label the columns being fixed and added df.q <- df %>%select(starts_with(questionPrefix)) df.q[] <-lapply(df.q, function(x) substring(x,1,1)) df.q[] <-lapply(df.q, function(x) as.numeric(x)) df.q$total <-rowSums(df.q)colnames(df.q) <-paste(variablePrefix, colnames(df.q), sep =".") df <-cbind(df,df.q) df <- df %>%select(-starts_with(questionPrefix))return(df)}
Run the likertFix function on the initial data and save the result
d <- mtcars # example datad$carName <-rownames(d) # make a new variable called carName containing the row names of dView(d) # check if it worked
15.2.47 Assign row names based on a variable
# example datad <-data.frame(name =c("Kazaan","Kaalaa","Koona"), age =c(1,2,3), employed =c("no","no","no"))rownames(d) <- d$name # change row names to be the wt of each carView(d) # check if it worked
15.2.48 Format numbers as dollar currency
Let’s pretend that we want to take the numbers in the disp variable in the mtcars data and change them to USD (United States Dollars):
# Prepare datadf <- mtcarsdf$disp <- df$disp *100# Do the conversionif (!require(scales)) install.packages('scales') library(scales)df$disp <-dollar(df$disp)# Inspect resultsdfdf$dispView(df)
Above, I multiplied disp by 100 just to illustrate how the dollar function would work on large numbers.
15.2.49 Check if values in one variable are in another variable
## name age signedUp
## 1 Beebo 23 0
## 2 Brakaansha 45 0
## 3 Bettle 93 1
## 4 Bo 23 1
## 5 Erl 4 0
15.2.50 Unique identifiers
Make a unique identification (ID) number for observations or groups of observations.
15.2.50.1 Simple ID number by row
Add a new variable with the row number of each observation:
YourDataFrame$IDnum <-seq(1:nrow(YourDataFrame))
15.2.50.2 More complicated ID numbers
Sample data to practice with:
d <-data.frame(name =c("Aaaaaaron","Beela","Cononan","Duh","Eeena","Beela","Eeena","Beela"), age =c(1,2,3,4,1,2,1,2), occupation =c("hunter","vegan chef","plumber","plumbing destroyer","omnivore chef","vegan chef","omnivore chef","vegan chef"), day =c(1,1,1,1,15,15,15,15), month =c("January","February","March","April","January","February","March","April"), year =rep(2020,8), result =seq(1,8))d
## name age occupation day month year result
## 1 Aaaaaaron 1 hunter 1 January 2020 1
## 2 Beela 2 vegan chef 1 February 2020 2
## 3 Cononan 3 plumber 1 March 2020 3
## 4 Duh 4 plumbing destroyer 1 April 2020 4
## 5 Eeena 1 omnivore chef 15 January 2020 5
## 6 Beela 2 vegan chef 15 February 2020 6
## 7 Eeena 1 omnivore chef 15 March 2020 7
## 8 Beela 2 vegan chef 15 April 2020 8
Generate variable ID in dataset d containing a unique identification number for each person:
if (!require(udpipe)) install.packages('udpipe') library(udpipe)d$ID <-unique_identifier(d, c("name"))d
## name age occupation day month year result ID
## 1 Aaaaaaron 1 hunter 1 January 2020 1 1
## 2 Beela 2 vegan chef 1 February 2020 2 2
## 3 Cononan 3 plumber 1 March 2020 3 3
## 4 Duh 4 plumbing destroyer 1 April 2020 4 4
## 5 Eeena 1 omnivore chef 15 January 2020 5 5
## 6 Beela 2 vegan chef 15 February 2020 6 2
## 7 Eeena 1 omnivore chef 15 March 2020 7 5
## 8 Beela 2 vegan chef 15 April 2020 8 2
More sample data for practice:
d2 <-data.frame(name =c("Aaaaaaron","Beela","Cononan","Duh","Eeena","Fewe","Graam","Hiol"), number =c(1,1,1,1,0,0,0,0), color =c("green","brown","green","brown","green","brown","green","brown"))d2
## name number color
## 1 Aaaaaaron 1 green
## 2 Beela 1 brown
## 3 Cononan 1 green
## 4 Duh 1 brown
## 5 Eeena 0 green
## 6 Fewe 0 brown
## 7 Graam 0 green
## 8 Hiol 0 brown
Generate variable group in dataset d2 containing a unique identification number for each number-color pair:
if (!require(udpipe)) install.packages('udpipe') library(udpipe)d2$group <-unique_identifier(d2, c("number","color"))d2
## name number color group
## 1 Aaaaaaron 1 green 4
## 2 Beela 1 brown 3
## 3 Cononan 1 green 4
## 4 Duh 1 brown 3
## 5 Eeena 0 green 2
## 6 Fewe 0 brown 1
## 7 Graam 0 green 2
## 8 Hiol 0 brown 1
15.2.51 Within-group count of observations and group size, within-group ID number
If you have multiple observations in one group or for one person and you need to count each observation’s number within each group or person, the code below should help.
# Install and load the required packagelibrary(dplyr)# Create an example dataframedf <-data.frame(person =c("John", "John", "Mary", "Mary", "Mary", "Peter"),age =c(25, 30, 35, 40, 45, 50))# Create a new variable with the count of rows for each persondf <- df %>% dplyr::group_by(person) %>% dplyr::mutate(row_count =row_number()) %>%ungroup()# Print the modified dataframeprint(df)
The code and comments above were generated by ChatGPT on May 22 2023, with minor modifications by Anshul.
If you want to generate a group size variable instead, this might help:
# Install and load the required packagelibrary(dplyr)# Create an example dataframedf <-data.frame(person =c("John", "John", "Mary", "Mary", "Mary", "Peter"),age =c(25, 30, 35, 40, 45, 50))# Create a new variable with the count of rows for each persondf <- df %>% dplyr::group_by(person) %>% dplyr::mutate(row_count =n()) %>%ungroup()# Print the modified dataframeprint(df)
The code and comments above were generated by ChatGPT on May 22 2023.
15.2.52 Make new character variable or summary report based on other variables
Example data:
d <-data.frame(flavor =c("chocolate","vanilla","strawberry","other"),numberEaten =c(1,2,3,4))
d$joinedVariable <-paste0("Flavor and number: ",as.character(d$flavor)," ", as.character(d$numberEaten))
d
## flavor numberEaten joinedVariable
## 1 chocolate 1 Flavor and number: chocolate 1
## 2 vanilla 2 Flavor and number: vanilla 2
## 3 strawberry 3 Flavor and number: strawberry 3
## 4 other 4 Flavor and number: other 4
You might have data in which there are time stamps which you need to convert into a continuous variable that you can put into a regression. For example, maybe you have data in which each row (observation) is a patient and then you have a variable (column) for the date and time on which the patient came into the hospital. To analyze this data as a continuous variable, maybe you want to calculate how many seconds after midnight in each day a patient came in.
How can we convert a time to a number of seconds? There are helper functions in R that help us do this. Let’s start with an example:
if (!require(lubridate)) install.packages('lubridate')
## Warning: package 'lubridate' was built under R version 4.2.3
library(lubridate) # this package has the period_to_seconds function# example of what the data looks likeExampleTimestamp <-"01-01-2019 09:04:58"# extract just the time (remove the date)(ExampleTimeOnly <-substr(ExampleTimestamp,12,19))
## [1] "09:04:58"
# convert from time to number of seconds(TotalSeconds <-period_to_seconds(hms(ExampleTimeOnly)))
## [1] 32698
As you can see above, the time “09:04:59” was converted into 32698 seconds. But we only did it for a single stored value, ExampleTimestamp. How do we do it for the entire variable in a dataset? Let’s say you have a datset called d with a variable with a time stamp called timestamp and you want to make a new variable (column) in the dataset called seconds. Here’s how you can do it:
if (!require(lubridate)) install.packages('lubridate') library(lubridate)d$seconds <-period_to_seconds(hms(substr(d$timestamp,12,19)))
15.2.54 Making and modifying lists
This is not complete.
l <-list() # make empty listl <-append(l, "bob") # add something to the list
one <-1two <-c(1,2,3,4)three <-list("byron","anshul")four <-"bob"t <-list(one, two, three) # make a listt <-append(t, four) # add something to the list
15.2.55 Categorize variable into quantiles
Make a new variable called quintile that identifies each observation’s quintile for the variable mpg:
d <- mtcarsd$mpg[4] <-NA# create missing data for illustration onlylibrary(dplyr)d$mpg.quintile <-ntile(d$mpg, 5)
As an option, recode the new variable as a factor and label NA values as missing, so that the computer doesn’t know they are NA anymore (the NA observations won’t get thrown out of an analysis):
## name skill.x skill.y
## 1 A. Onlyleft pig latin <NA>
## 2 B. Onlyleft latin <NA>
## 3 C. Both pig farming speling
## 4 D. Both pig surgery boating
## 5 <NA> pig liberating gloating
## name skill.x skill.y
## 1 C. Both pig farming speling
## 2 D. Both pig surgery boating
## 3 <NA> pig liberating gloating
## 4 E. Onlyright <NA> reading
## 5 F. Onlyright <NA> writing
## name skill.x skill.y
## 1 A. Onlyleft pig latin <NA>
## 2 B. Onlyleft latin <NA>
## 3 C. Both pig farming speling
## 4 D. Both pig surgery boating
## 5 <NA> pig liberating gloating
## 6 E. Onlyright <NA> reading
## 7 F. Onlyright <NA> writing
merged <-merge(left, right, by="name", all =TRUE, incomparables =NA) # all.x or all.y also possible to do only left or right joins# change to all=F to only include observations that matchmerged
## name skill.x skill.y
## 1 A. Onlyleft pig latin <NA>
## 2 B. Onlyleft latin <NA>
## 3 C. Both pig farming speling
## 4 D. Both pig surgery boating
## 5 E. Onlyright <NA> reading
## 6 F. Onlyright <NA> writing
## 7 <NA> pig liberating <NA>
## 8 <NA> <NA> gloating
What if we wanted to make a new variable called dataSource which identifies who all came from which dataset? I’m not sure of the very best way to do it, but here’s one way that seems to work:
left$inLeft <-1right$inRight <-1merged2 <-merge(left, right, by="name", all =TRUE, incomparables =NA)merged2$dataSource <-NAmerged2$dataSource[merged2$inLeft==1& merged2$inRight==1] <-"Present in both left and right"merged2$dataSource[merged2$inLeft==1&is.na(merged2$inRight)] <-"Present in left only"merged2$dataSource[is.na(merged2$inLeft) & merged2$inRight==1] <-"Present in right only"merged2
## name skill.x inLeft skill.y inRight
## 1 A. Onlyleft pig latin 1 <NA> NA
## 2 B. Onlyleft latin 1 <NA> NA
## 3 C. Both pig farming 1 speling 1
## 4 D. Both pig surgery 1 boating 1
## 5 E. Onlyright <NA> NA reading 1
## 6 F. Onlyright <NA> NA writing 1
## 7 <NA> pig liberating 1 <NA> NA
## 8 <NA> <NA> NA gloating 1
## dataSource
## 1 Present in left only
## 2 Present in left only
## 3 Present in both left and right
## 4 Present in both left and right
## 5 Present in right only
## 6 Present in right only
## 7 Present in left only
## 8 Present in right only
15.3.2 Identify number of unique values in a variable
List all unique values:
unique(mtcars$cyl)
## [1] 6 4 8
Calculate number of unique values:
length(unique(mtcars$cyl))
## [1] 3
15.3.3 Compare elements of two vectors
Let’s say we have two vectors called first and second and we want to compare the elements in each vector. Here are the vectors:
first <-c("a","b","c","d")second <-c("c","d","e","f")
This is how we can see which elements are common to the two vectors:
intersect(first, second)
## [1] "c" "d"
Here are the elements that are only in first and not in second:
setdiff(first,second)
## [1] "a" "b"
And here are the elements that are only in second and not in first:
setdiff(second,first)
## [1] "e" "f"
15.3.4 Basic descriptive table
There are many ways to make a concise descriptive statistics table. A few options are shown below.
15.3.4.1 Row and column totals
Make some data:
d <-data.frame(variableWeCareAbout =c("red","black","blue","blue","red"))d$Frequency <-"Frequency"d
## variableWeCareAbout Frequency
## 1 red Frequency
## 2 black Frequency
## 3 blue Frequency
## 4 blue Frequency
## 5 red Frequency
Make a table with a total count:
addmargins(table(d$variableWeCareAbout, d$Frequency),1, FUN =list(TOTAL=sum) )
##
## Frequency
## black 1
## blue 2
## red 2
## TOTAL 5
Above, we trick R into displaying the table vertically instead of its default horizontal format, by adding a new variable called Frequency that has only one level. I’m not sure if there’s a better way to do this, but I like the way this comes out looking.
In the addmargins function, change the 1 to a 2 to get row totals instead of column totals.
15.3.4.2table1 function
This option works well with HTML outputs but not PDF or Word, as far as I know.
Basic table:
if(!require(table1)) install.packages("table1")library(table1)table1(~ mpg + hp +as.factor(am), data=mtcars, topclass="Rtable1-zebra")
Grouped table:
if(!require(table1)) install.packages("table1")library(table1)table1(~ mpg + hp +as.factor(am) | cyl, data=mtcars, topclass="Rtable1-zebra")
The code below looks like a lot of lines, but you should be able to just copy and paste it without making any changes, other than replacing mtcars.partial with the name of your own dataset.
Default descriptive statistics table from the pastecs package:
Below, we make a new dataset or a descriptive table in which rows are created based on means of all variables for selected groups.
Prepare data:
d <-data.frame(name =c("Aaron","Baron","Caron","Daron","Earon","Faron","Aaron","Baron","Caron","Daron","Earon","Faron"),performance1 =c(1,7,3,8,3,7,3,7,3,6,8,3),performance2 =seq(1:12)+30,group =c("A","A","A","B","B","B","A","A","A","B","B","B"),time =c(rep(1,6),rep(2,6)) )d
## name performance1 performance2 group time
## 1 Aaron 1 31 A 1
## 2 Baron 7 32 A 1
## 3 Caron 3 33 A 1
## 4 Daron 8 34 B 1
## 5 Earon 3 35 B 1
## 6 Faron 7 36 B 1
## 7 Aaron 3 37 A 2
## 8 Baron 7 38 A 2
## 9 Caron 3 39 A 2
## 10 Daron 6 40 B 2
## 11 Earon 8 41 B 2
## 12 Faron 3 42 B 2
Calculate mean of all variables for each group-time pair:
GroupMeanData <-aggregate(.~group+time, d, mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
GroupMeanData
## group time name performance1 performance2
## 1 A 1 NA NA NA
## 2 B 1 NA NA NA
## 3 A 2 NA NA NA
## 4 B 2 NA NA NA
Above, group and time are the grouping variables according to which all observations will be aggregated. Any number of grouping variables can be placed after the ~.
Remove the now-unnecessary name variable:
GroupMeanData$name <-NULLGroupMeanData
## group time performance1 performance2
## 1 A 1 NA NA
## 2 B 1 NA NA
## 3 A 2 NA NA
## 4 B 2 NA NA
15.3.10 Principal Component Analysis (PCA)
These, I think, are the most important steps, below.
Load the data:
d <- mtcars
Run PCA on selected variables (after they have been scaled and centered):
pca1 <-prcomp(d[c("mpg","cyl","am","gear","wt","disp")], scale =TRUE, center =TRUE)
The code below makes a scatterplot with two numeric variables (mpg plotted against wt, from the mtcars dataset), color coded according to a different variable (am):
plot(mtcars$wt, mtcars$mpg, pch=19, col=factor(mtcars$am))legend("topright", title="am",legend =levels(factor(mtcars$am)), pch =19, col =factor(levels(factor(mtcars$am))))
Sometimes, I find that the legend will annoyingly overlap with the plotted data. One way to fix this is to add an xlim argument to the plot command:
Above, the legend is now in the top-left corner, where it would get in the way of some of our plotted points. To make space for the legend and plotted points to display separately, we tell the computer to start plotting the x-axis at 0 and end at 6. Once we do this, the legend is well to the left of the plotted points.
mpg plotted against wt in mtcars, with coloring based on am, with am treated as a categorical variable.
# prepare data for this example onlyd <- mtcarsd$ambinary <-as.factor(d$am) if (!require(ggplot2)) install.packages('ggplot2')library(ggplot2)ggplot(d,aes(wt,mpg,colour=ambinary))+geom_point()
All you need to do is make the following changes to the line of code ggplot(d,aes(wt,mpg,colour=ambinary))+geom_point():
change d to your own data set’s name
change wt to your desired X variable
change mpg to your desired Y variable
change ambinary to your desired grouping (coloring) variable
We can also add connecting lines to the plot above:
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
15.4.4.2 Base R
d <- mtcarshgA <-hist(d[which(d$am==1),]$mpg, plot =FALSE)hgB <-hist(d[which(d$am==0),]$mpg, plot =FALSE)c1 <-rgb(173,216,230,max =255, alpha =80, names ="lt.blue")c2 <-rgb(255,192,203, max =255, alpha =80, names ="lt.pink")plot(hgA, col = c1, xlim =c(0,40), ylim =c(0,10))plot(hgB, col = c2, xlim =c(0,40), ylim =c(0,10), add =TRUE)
15.4.5 Line plot by groups
Example data:
d <-data.frame(country =c(rep("A",3),rep("B",3),rep("C",3)),score =c(5,2,3,3,5,2,4,3,4),time =c(1,2,3,1,2,3,1,2,3))d
## country score time
## 1 A 5 1
## 2 A 2 2
## 3 A 3 3
## 4 B 3 1
## 5 B 5 2
## 6 B 2 3
## 7 C 4 1
## 8 C 3 2
## 9 C 4 3
keywords: line chart, line graph, line segment, scatterplot with lines
15.4.6 Side-by-side grouped boxplots
In separate plots:
library(ggplot2)ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+geom_boxplot()+facet_wrap(.~as.factor(am), scales ="free")+labs(title="Miles per gallon by number of cylinders and transmission type", caption="cyl groups indicate number of cylinders; transmission type: 0 = automatic, 1 = manual")
library(ggplot2)ggplot(mtcars, aes(x=factor(cyl), y=mpg, color =factor(am)))+geom_boxplot()+theme( legend.position ="right" )+ylim(0,40)+labs(title="Miles per gallon by number of cylinders and transmission type", caption="transmission type: 0 = automatic, 1 = manual", x ="number of cylinders", y="miles per gallon", color="Transmission")
Different colors:
library(ggplot2)ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill =factor(am)))+geom_boxplot()+theme( legend.position ="right" )+ylim(0,40)+labs(title="Miles per gallon by number of cylinders and transmission type", caption="transmission type: 0 = automatic, 1 = manual", x ="number of cylinders", y="miles per gallon", fill="Transmission")
Different colors, grayscale:
library(ggplot2)ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill =factor(am)))+geom_boxplot()+theme( legend.position ="right" )+ylim(0,40)+labs(title="Miles per gallon by number of cylinders and transmission type", caption="transmission type: 0 = automatic, 1 = manual", x ="number of cylinders", y="miles per gallon", fill="Transmission")+scale_fill_grey(start =0.8, end =0.5)
15.4.7 Flowchart or DAG
15.4.7.1 DiagrammeR package
This is my preferred way to make a flowchart or DAG.
I consider ggdag to be more primitive than DiagrammeR, but both might be useful depending on the situation. ggdag is likely faster, which could occasionally be useful. An example is below.
if (!require(ggdag)) install.packages('ggdag')
## Warning: package 'ggdag' was built under R version 4.2.3
15.4.8 Dumbbell plot for pre-post or other paired data
How can we visualize pre-post data when we don’t have too many observations?
# example pre-post dataname<-c("Zeld","Xya","Blork","Weeda","Mobant")pre<-c(5,6,4,5,7)post<-c(6,8,4,9,5)age<-c(10,10,11,11,12)d <-data.frame(name,pre,post,age)d
## name pre post age
## 1 Zeld 5 6 10
## 2 Xya 6 8 10
## 3 Blork 4 4 11
## 4 Weeda 5 9 11
## 5 Mobant 7 5 12
if (!require(dumbbell)) install.packages('dumbbell')
## Loading required package: dumbbell
## Warning: package 'dumbbell' was built under R version 4.2.3
library(ggplot2)ggplot(mtcars, aes(as.factor(am), as.factor(cyl), fill= mpg)) +geom_tile() +scale_fill_gradient(low="white", high="red") +labs(caption ="Average miles per gallon, by transmission and cylinders")
If you ran a regression and saved the regression result as regobj, you can of course see the results whenever you want by running summary(regobj) as you already know. But you can also see the 95% confidence intervals for the estimated coefficients by running confint(regobj).
To measure or count the duration of how long it takes for R to run something for you, you can put the code
(start <- Sys.time())
and
(end <- Sys.time())
(duration <- end-start)
around the code you want to measure/count.
Let’s say you want to load a data file into R using the code d <- read.csv("SomeData.csv") and you want to measure how long it takes to load the file. You would write this code:
Above, the stored object duration contains the amount of time it took to load the data file.
15.6.14 Save R objects to a folder on the computer
Let’s say you have an R object called MyRobject, which could be a dataframe, saved regression model, table, or anything else.
You can save that R object to your computer like this (it will go to the working directory or directory where your code file is):
saveRDS(MyRobject, file ="ChooseAname.rds")
With the code above, MyRobject will be saved as a file on your computer with the name ChooseAname.rds. You can of course change it to a name other than ChooseAname.rds.
Later, when you want to open the saved object again in RStudio, you can run this code:
MyRobjectFromBefore <-readRDS("ChooseAname.rds")
Above, we load the object saved in ChooseAname.rds into R, where it will have the name MyRobjectFromBefore. Of course you can choose to call it anything other than MyRobjectFromBefore.
Here is the code above once again, for easy copying and pasting:
15.6.15 Comment out portions of a single R command
In the examples below, you can add the # in front of any line to rapidly remove a variable from the code. You can then delete the # to once again include that variable.
15.6.15.1 Adding variables together example
Initial code:
d <- mtcarsd$newVariable = (d$cyl+ d$hp+ d$mpg )
Easily comment out the mpg variable:
d <- mtcarsd$newVariable = (d$cyl+ d$hp# + d$mpg )
15.6.15.2 Regression formula example
Initial code:
r1 <-lm(mpg~cyl+am+hp+disp+drat ,mtcars)
Easily remove the hp and drat variables:
r1 <-lm(mpg~cyl+am# +hp+disp# +drat ,mtcars)
15.6.16 Get the name of an object or dataframe as a string
Below, the name of the dataset mtcars will be printed out as a string:
15.7.2 Page breaks to restart content on a new page
To stop content from appearing on one page and continue on a brand new page (what we would call a page break in a word processor), write the following into your RMarkdown document where you would like the break to happen.
\newpage
Put it in the plain text portion of your RMarkdown file, where you put sentences that you write. Do NOT put it into a code chunk or other field of any kind.
I often use this if I want to make sure that a table or other output appears entirely on one page.
15.7.3 Opening links in a new window
To insert a hyperlink (clickable link that takes the reader/user somewher) in R Markdown the usual way, you write the following into your plain text portion of your R Markdown file:
[List of helicopter prison escapes, article on Wikipedia](https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes)
Above, when the user clicks on the link, they will go to the link’s destination in the same browser tab/window that they are already in.
If you add {target="_blank"} to the end of the link syntax, the new link will open in a new tab/window on the user’s device:
Paste the two versions above into your own R Markdown file (in the plain text part, not in a code chunk), and see what happens when you knit it!
15.7.4 YAML headers examples
You can try some of these YAML headers at the top of your R Markdown documents to see what they make your knitted file look like. The ones below are some of my favorites. There are—of course—infinite other possibilities.
HTML with automatic date, code folding, table of contents up to 4th level:
---title:"Problems with Bugs Bunny"subtitle:"This is not real"author:"Duck, Daffy"date:"Version -- `r format(Sys.time(), '%d %B %Y')`"output: html_document: code_folding: hide toc: true toc_depth:4---
Another HTML example:
title:'How to write good titles'author:"J.K."date:"Date -- `r format(Sys.time(), '%d %B %Y')`"output: html_document: toc: yes toc_depth:'3' df_print: paged number_sections: yes
readthedown HTML (click Knit -> Knit to readthedown):
---title:"The argument against wainscoting"author:"Compiled by Anshul Kumar"date:"March"output: rmdformats::readthedown: toc_depth:3 use_bookdown: true---
PDF with automatic date, numbered sections, and table of contents up to level 3:
---title:"This title has five words"author:"Prepared by Anshul Kumar"date:"Version -- `r format(Sys.time(), '%d %B %Y')`"output: pdf_document: number_sections: yes toc: yes toc_depth:3---
Word document with automatic date, table of contents, and numbered sections:
---title:'Supplementary Content: Full analysis and results'author:"Horticulture working group"date:"`r format(Sys.time(), '%d %B %Y')`"output: word_document: toc: yes number_sections: yes toc_depth:'3'---
slidy presentation with footer and slightly decreased font:
---title:"Meat-based vegetables: a rebuttal against plant-based diets"subtitle:"just kidding"author:"Anshul Kumar"date:"Version -- `r format(Sys.time(), '%d %B %Y')`"output: slidy_presentation: duration:15 footer:"This text will be visible on all slides!" highlight: espresso font_adjustment:-1---
ioslides presentation:
---title:"Spoons: useful little bowls on sticks"subtitle:"Why you should ditch forks and use more spoons"author:"There is no rule that my name has to go here"date:"Click here and then use arrow keys to advance slides."output: ioslides_presentation: highlight: espresso smaller: true---
15.7.5 Two columns or multiple columns
This only works for HTML outputs, as far as I know. Within your R Markdown file (not in a code chunk), put:
This text is not in multiple columns. Below, we initiate multiple columns.
:::: {style="display: flex;"}
::: {}
This text will go in the left column.
This text will also go in the left column.
:::
::: {}
<!-- optional empty middle column for spacing; sometimes I put something here to create space between the other two columns -->
:::
::: {}
This text will go in the right column.
This text will also go in the right column.
:::
::::
Now the multiple columns are over. This text will appear normally, without columns involved.
In general, all commands should go into your RMarkdown file which is likely in the top-left of your RStudio screen. This way, you have them saved for later when you want to replicate your work or copy the code you used to use again with different data.
However, there are at least two exceptions. The following two commands should never be put in the RMarkdown file and only be run directly in the console:
View(YourDataName) – This is the command you use to view your data in spreadsheet form, within R.
file.choose() – This is the command you can use to get the exact file path of a file or folder.
These commands cannot go into your RMarkdown file. Most of you will “Knit” your RMarkdown file into a PDF file to submit/publish your work. But the View command requires a separate window to open in RStudio to show you a spreadsheet. This separate spreadsheet window can’t open in a PDF! You wouldn’t want your entire dataset to show up in the PDF in spreadsheet form, would you? No. You just want the results of the other commands you run to show up. That’s why you get an error if you include the View command in your RMarkdown file and you try to Knit it.
15.7.8 Significant figures in output (number of decimal places)
The following command can be used to potentially control how many decimal places are displayed in your output.
A common error in R is: Error in plot.new() : figure margins too large
Running this code might fix it:
par(mar=c(1,1,1,1))
You can read more about this issue by clicking here.
15.8.2 rlang or xfun package version error
I recommend you read this entire section before you decide what solution to use.
This is an error I have seen multiple times, when trying to load packages other than rlang. For example, I might try to run:
library(somePackage)
And then I’ll see something like this:
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.2 is required
The error is not always written exactly as it is above, but it usually is telling you that your version of the package rlang is outdated.
Resolving this error is not always easy. The following procedure has worked for me before:
Open R alone, outside of RStudio.
Remove the existing rlang package: remove.packages("rlang")
Make sure it is removed: run library(rlang) and you should get an error message.
Re-install rlang: install.packages("rlang")
(might or might not be needed) Save all of your open work in R or RStudio and then run: q() to start a fresh session of R.
Another approach that has worked for me is:
Run .libPaths().
Go to the resulting folder(s).
Manually delete the folders for the package causing trouble.
Reinstall the package causing trouble like usual: install.packages("packageCausingTrouble").
I do not have too much experience with this error and cannot guarantee that the procedure above will work. While doing the procedure above, you can check whether rlang is loaded or not and its current version by running sessionInfo() after each step.
Similar issues can also happen with the package xfun, in which case you can try the procedure above but replace rlang with xfun, like below:
remove.packages("xfun")
library(xfun) # should result in error
install.packages("xfun")
library(xfun) # should be successful this time
sessionInfo() # should show a newer version of the package
Sometimes, you can just run update.packages(ask = FALSE) and that can solve everything.
The solution above, in my RStudio version226 involves clicking on: Tools -> Global Options -> Code -> Diagnostics. Then uncheck items related to diagnostics, like the ones that say “show diagnostics…”.
Below is a way to completely remove everything related to R and RStudio from your computer and reinstall them:
Put all of the PDF files that you want to combine with each other into a single directory (folder) on your computer.227 Then, set that directory containing your PDF files as the working directory in R. Once everything is ready, run the code below:
Above, the images get rescaled to the size 200 by 200, and each of the individual images will display for 2 seconds each, because fps (frames per second) is 0.5.
Export the image as a GIF:
image_write(mygif, path ="My Favorite File Name.gif", format ="gif")getwd()
To force a video you made to show subtitles (captions) by default, add the following to the list of tags in the video: yt:cc=on.228
I have not tested this yet myself, at the time of writing.↩︎
RStudio 2022.07.1+554 “Spotted Wakerobin” Release (7872775ebddc40635780ca1ed238934c3345c5de, 2022-07-22) for Windows. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.8 Chrome/69.0.3497.128 Safari/537.36.↩︎
You might find it useful to rename your files to order them the way you want. For example, you can add the number 1 to the start of the file name of the first file, 2 to the start of the file name of the second file.↩︎
15.6.15 Comment out portions of a single R command
In the examples below, you can add the
#
in front of any line to rapidly remove a variable from the code. You can then delete the#
to once again include that variable.15.6.15.1 Adding variables together example
Initial code:
Easily comment out the
mpg
variable:15.6.15.2 Regression formula example
Initial code:
Easily remove the
hp
anddrat
variables: