Chapter 15 Appendix 1 – Selected additional R code and resources

This section contains code, tips, and resources that are not elsewhere in the book and that may be useful at times. Full explanations are not provided in most cases. This appendix is perpetually under construction and is not comprehensive in any way. Please report feedback, suggestions, and links that no longer work to Anshul at .

This chapter is not part of the required materials for HE-902.

15.1 General (non-R) quantitative analysis topics

15.1.1 Other selected resources on quantitative analysis or R

15.1.2 Where to find data

If you are looking for some datasets to practice with or even do a project with, here are some places you could look:

This list is not comprehensive. Please email Anshul with additional suggestions to add to this list.

15.1.3 Noteworthy datasets

15.1.4 Python or R?

I get asked sometimes about whether Python or R is better. There is no single right answer. Below are my personal thoughts about choosing a data analysis platform. It is mostly written as a comparison of R and Python but it could apply to other platforms and tools as well.

  • Both R and Python are equally capable of doing everything needed for healthcare and social science quantitative analysis. I say this based on the needs of my coworkers and students as well as my own projects.

  • Some people prefer one over the other. I think that’s fine. People should use whatever they’re comfortable with that can get the job done.

  • Some people might try to get you to agree with them that one is better than the other. I personally try to avoid getting involved in that kind of thinking. It’s irrelevant. It’s like asking if apples or oranges are better. Both are fine. If Oscar prefers oranges and Ann prefers apples, who cares? Oscar eating oranges isn’t hurting Ann. Ann eating apples isn’t hurting Oscar. There’s no need for us to choose just one side. Of course, it might not hurt for Oscar to have some familiarity with apples, in case he ever goes to a picnic planned by Ann. And it might not hurt for Ann to have some exposure to oranges, in case she ever goes over to Oscar’s house for dinner. Translation: If one data analysis team works in R and the other works in Python, and now they have to collaborate on a project, it could be useful if each data team has some familiarity with the other team’s language, so that they can work together. But even if they don’t have much experience with the other’s language/platform, it’s okay. Most of this type of data analysis is just running calculations and manipulations on data in spreadsheets. The two teams can discuss what they did to the data with each other and then each do further work in their preferred language/platform.

  • Note that there are lots of platforms that we’re not even discussing here: SPSS, Stata, SAS, Matlab, Mathematica, and many more. That’s fine. For our purposes, they all mostly do the same thing and you don’t need to worry about learning them all; just learn the one that help you with your work.

  • The building blocks for each language or platform are similar to the building blocks for the others. For example, if you can code at an intermediate level in R, you can probably learn to code at an intermediate level in Python pretty quickly. The actual code you will write will look a little bit different, but the logic is all the same.

15.2 Manipulating data or other items

15.2.1 Extract cells from a dataframe

General form:

NameOfDataFrame[RowToExtract,ColumnToExtract]

Examples

Using row and column names:

n <- mtcars["Valiant","hp"]

n
## [1] 105

Using row and column numbers:

n <- mtcars[6,4]

n
## [1] 105

Using a combination of names and numbers:

mtcars[6,"hp"]
## [1] 105
mtcars["Valiant",4]
## [1] 105

15.2.2 Extract information from a lookup table or database

Let’s say we want to look up median incomes for zip codes from a data frame called zipcodes and put those incomes into another dataframe called demographics.

Make the data:

demographics <- data.frame(
  name = c("Aandy", "Bratfoy","Char","Deedee","Eena","Farthington"), 
  occupation =c("architect","brewer","chocolatier","dentist","educator","financier"),
  zipcode = c(02140,79901,36511,02140,36511,021400000)
  )

zipcodes <- data.frame(
  zipcode = c(02140,79901,36511,75001),
  medIncome = c(119827,13110,14060, 73383)
)

Look up:

demographics <- dplyr::left_join(demographics, zipcodes, by="zipcode")

15.2.3 Add random noise to a variable

mtcarscopy <- mtcars

mtcarscopy$am <- mtcarscopy$am+runif(nrow(mtcarscopy),-1,1)

The jitter() function can also be useful for this:

mtcarscopy$am <- jitter(mtcarscopy$am)

15.2.4 Sum all variables into new a variable

Create a new variable called VarSum within dataset d which is the sum of all other variables in d:

d$VarSum <- rowSums(d)

The computer will add together the values for each variable in each row.

15.2.5 Sum all values of a variable

For entire dataset:

colSums(mtcars)

For a single variable:

colSums(as.data.frame(mtcars$cyl))[[1]]

Above, colSums can be changed to rowSums to sum up rows instead of columns.

15.2.6 Join words together

Use the paste(...) function:

something <- "three"
somethingelse <- "word"
anotherthing <- "phrase"
allthethingswithspaces <- paste(something, somethingelse, anotherthing)

allthethingswithspaces
## [1] "three word phrase"
allthethingswithoutspaces <- paste(something, somethingelse, anotherthing, sep = '')

allthethingswithoutspaces
## [1] "threewordphrase"

More or fewer than three items, separated by commas, can be added to the list of items in the paste(...) function. For example:

evenmorethings <- paste(something, somethingelse, anotherthing, anotherthing,something,somethingelse,somethingelse, sep = '')

evenmorethings
## [1] "threewordphrasephrasethreewordword"

15.2.7 Separate/split a variable into multiple dummy variables

d <- data.frame(person = c("Audi","Broof","Chruuma","Deenolo", "Eeman"),gender=c("A","A","B","B","B"), IceCreamFlavorsYouLikeSelectAllApply = c("chocolate","strawberry", "chocolate,vanilla","strawberry,vanilla,chocolate","vanilla,other") )

d
##    person gender IceCreamFlavorsYouLikeSelectAllApply
## 1    Audi      A                            chocolate
## 2   Broof      A                           strawberry
## 3 Chruuma      B                    chocolate,vanilla
## 4 Deenolo      B         strawberry,vanilla,chocolate
## 5   Eeman      B                        vanilla,other

Result not shown:

# Split the values in the column by commas
flavors <- strsplit(d$IceCreamFlavorsYouLikeSelectAllApply, ",")

# Get all unique ice cream flavors
unique_flavors <- unique(unlist(flavors))

# Create dummy variables for each unique flavor
for (flavor in unique_flavors) {
  # Create a new column with the flavor name and initialize with 0
  d[[flavor]] <- 0
  
  # Set the value to 1 if the flavor is in the list for that row
  d[[flavor]] <- as.integer(sapply(flavors, function(x) flavor %in% x))
}

d

The code above to create dummy variables was written by OpenAI’s GPT-3 model and accessed on 24 Sep 2023. This prompt was used: in r, i have a dataframe called d. there is a variable called IceCreamFlavorsYouLikeSelectAllApply in which each person (row in in the dataframe) might have one or multiple answers. some might say just ‘chocolate’ or ‘vanilla’ but others might say ’chocolate,vanilla,strawberry,other” or just “chocolate,strawberry”, for example. I want to create separate dummy variables in dataframe d for each of the ice cream flavors.

We get the code below with the following follow-up prompt to GPT-3: rewrite the code above so that each new variable will be called flavor_chocolate instead of just chocolate.

# Split the values in the column by commas
flavors <- strsplit(d$IceCreamFlavorsYouLikeSelectAllApply, ",")

# Get all unique ice cream flavors
unique_flavors <- unique(unlist(flavors))

# Create dummy variables with names like 'flavor_chocolate', 'flavor_vanilla', etc.
for (flavor in unique_flavors) {
  # Create a new column with the flavor name and initialize with 0
  d[[paste0("flavor_", flavor)]] <- 0
  
  # Set the value to 1 if the flavor is in the list for that row
  d[[paste0("flavor_", flavor)]] <- as.integer(sapply(flavors, function(x) flavor %in% x))
}

d
##    person gender IceCreamFlavorsYouLikeSelectAllApply flavor_chocolate
## 1    Audi      A                            chocolate                1
## 2   Broof      A                           strawberry                0
## 3 Chruuma      B                    chocolate,vanilla                1
## 4 Deenolo      B         strawberry,vanilla,chocolate                1
## 5   Eeman      B                        vanilla,other                0
##   flavor_strawberry flavor_vanilla flavor_other
## 1                 0              0            0
## 2                 1              0            0
## 3                 0              1            0
## 4                 1              1            0
## 5                 0              1            1

Follow-up prompt: how do i calculate summary counts and percentages of how many people like each flavor?

# Calculate summary counts for each flavor
flavor_counts <- colSums(d[paste0("flavor_", unique_flavors)])

# Calculate total number of respondents
total_respondents <- nrow(d)

# Calculate percentages
flavor_percentages <- (flavor_counts / total_respondents) * 100

# Create a summary dataframe
summary_df <- data.frame(Flavor = unique_flavors, Counts = flavor_counts, Percentages = flavor_percentages)

# Print the summary dataframe
print(summary_df)
##                       Flavor Counts Percentages
## flavor_chocolate   chocolate      3          60
## flavor_strawberry strawberry      2          40
## flavor_vanilla       vanilla      3          60
## flavor_other           other      1          20

Follow-up prompt (after a few attempts): we now have a dataframe called d with variables for gender as well as flavor_chocolate dummy variables and similar variables for each flavor. for each flavor, we need to know the count and what percentage of people in each gender group is equal to 1

# Load the dplyr package if not already loaded
# install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Assuming you have a 'gender' column in your dataframe 'd'
# Calculate the count and percentage of each flavor for each gender
summary_df <- d %>%
  group_by(gender) %>%
  summarize(across(starts_with("flavor_"), ~sum(. == 1), .names = "Count_{.col}")) %>%
  mutate(across(starts_with("Count_"), ~. / sum(.), .names = "Percentage_{.col}"))

# Rename columns
colnames(summary_df) <- c("Gender", paste0("Count_", unique_flavors), paste0("Percentage_", unique_flavors))

# Print the summary data
print(summary_df)
## # A tibble: 2 × 9
##   Gender Count_chocolate Count_strawberry Count_vanilla Count_other
##   <chr>            <int>            <int>         <int>       <int>
## 1 A                    1                1             0           0
## 2 B                    2                1             3           1
## # ℹ 4 more variables: Percentage_chocolate <dbl>, Percentage_strawberry <dbl>,
## #   Percentage_vanilla <dbl>, Percentage_other <dbl>

Now we can transpose the result to make it easier to read (we also have to change gender into the column names):

summary_df <- t(summary_df)

colnames(summary_df) <- unlist(summary_df[1, ])
summary_df <- summary_df[-1, ]  # Remove the first row (which is now the column names)

And here’s a version that’s possibly simpler, for flavor counts by gender (result not shown):

d %>%
  group_by(gender) %>% 
  summarise(
    count = n(),
    `count chocolate` = sum(flavor_chocolate, na.rm = TRUE),
    `count strawberry` = sum(flavor_strawberry, na.rm = TRUE)
  )

15.2.8 Output plain text

something <- "three"
somethingelse <- "word"
anotherthing <- "phrase"
cat(something, somethingelse, sep = '')
## threeword
allthethingswithspaces <- paste(something, somethingelse, anotherthing)

cat(allthethingswithspaces)
## three word phrase

15.2.9 Select rows (observations) from a dataset

Select first 50 rows:

d.new <-  d.old[c(1:50),]

Select individual rows (in this example, select rows 3, 17, 22):

d.new <-  d.old[c(3,17,22),]

15.2.10 Select rows (observations) from a dataset using dplyr filter

library(dplyr)
mtcars %>% filter(cyl==4)

Save a copy:

mtcars.cyl4 <- mtcars %>% filter(cyl==4)

15.2.11 Remove rows (observations) from a dataset, based on negative criteria

Create a new data set called newdata.cars that contains the observations (rows) in mtcars that do NOT have cyl equal to 8.

newdata.cars <- mtcars[ which(mtcars$cyl!=8), ]

table(mtcars$cyl)
table(newdata.cars$cyl)

Above, we see that the 14 cars with 8 cylinders were removed from the original data.

15.2.12 Remove rows (observations) from a dataset, based on observation numbers

Remove first 50 rows:

d.new <-  d.old[-c(1:50),]

Remove selected rows (in this example, remove rows 3, 17, 22):

d.new <-  d.old[-c(3,17,22),]

Remove all rows except header (to create an empty dataset while preserving column names):

d.new <-  d.old[-c(1:nrow(d.old)),]

15.2.13 Subset based on multiple qualitative variable levels

d <- data.frame(name = c("Aabe","Bobay","Chock","Deela","Edweeeena","Foort","Gooba","Hi"),
                group = c("A","A","B","B","C","C","D","D")
                )

dpartial <- d[d$group %in% c("A", "B"), ]

15.2.14 Subset and rename selected variables (columns)

library(dplyr)
newData <- mtcars %>% dplyr::select(mileage=mpg, transmission=am)

15.2.15 Rename selected variables (columns)

library(dplyr)
newData <- mtcars %>% dplyr::rename(mileage=mpg, transmission=am)

15.2.16 Remove selected variables (columns) from a dataset

Remove variable Var1 from dataset d:

d$Var1 <- NULL

Remove variable mpg from the data set mtcars and save the new version of the data set as d:

d <- subset(mtcars, select = -mpg)

Remove variables mpg, cyl, and carb from the data set mtcars and save it as d:

d <- subset(mtcars, select = -c(mpg, cyl, carb))

Above, deleting the - sign in the select argument will keep (rather than remove) the listed variables and remove all others.

Using dplyr, remove variable Var1 from dataset d:

library(dplyr)
d <- d %>% select(-Var1)

Using dplyr, remove variables Var1 and Var2 from dataset d:

library(dplyr)
d <- d %>% select(-Var1,-Var2)

15.2.17 Remove variables with missing data

15.2.17.1 Option 1

This removes all variables that have one or more missing values:

NewData <- OldData[ , colSums(is.na(OldData)) == 0]

The code above was taken from the following resource:

15.2.17.2 Option 2

Remove all variables which contain only NA values:

NewData <- OldData[ , colSums(is.na(OldData)) < nrow(OldData)]

The code above was taken from the following resource:

15.2.18 Remove non-numeric variables from a dataset

Example data:

d <- data.frame(
  name = factor(c("Aronda","Baeoi","Chromp","Daroona")),
  age = c(23,45,56,67),
  citizenship = c("Tanzania","Nigeria","Mexico","France"),
  educationYears = c(1,2,3,4)
)

d
##      name age citizenship educationYears
## 1  Aronda  23    Tanzania              1
## 2   Baeoi  45     Nigeria              2
## 3  Chromp  56      Mexico              3
## 4 Daroona  67      France              4
str(d)
## 'data.frame':    4 obs. of  4 variables:
##  $ name          : Factor w/ 4 levels "Aronda","Baeoi",..: 1 2 3 4
##  $ age           : num  23 45 56 67
##  $ citizenship   : chr  "Tanzania" "Nigeria" "Mexico" "France"
##  $ educationYears: num  1 2 3 4

Remove non-numeric variables:

dNumericOnly <- d[,sapply(d, is.numeric)]

View new data:

dNumericOnly
##   age educationYears
## 1  23              1
## 2  45              2
## 3  56              3
## 4  67              4

Do something with the data that you couldn’t do before:

cor(dNumericOnly)
##                      age educationYears
## age            1.0000000      0.9827076
## educationYears 0.9827076      1.0000000

keywords: keep numeric variables, retain numeric variables

15.2.19 Identical variables in two datasets

Let’s say you have one dataset called dtrain and another one called dtest. And you want to make sure that dtest has the same variables (columns) as dtrain.

The code below tells the computer to retain within dtest only the variables that are in dtrain:

library(dplyr)
dtest <- dtest %>% select(names(dtrain))

15.2.20 Remove observations with missing values (NA values) in a single column

Take the existing dataframe OldData and make a new dataframe called NewData which only contains the rows in OldData that do NOT have a missing value—meaning do not have NA—for a variable called Var1:

NewData <- OldData[which(!is.na(OldData$Var1)),]

NewData should be a version of OldData in which any observation (row) coded as NA for Var1 has been removed.

15.2.21 Replace missing values (NA values) with 0

15.2.21.1 In an entire dataset

my_dataframe[is.na(my_dataframe)] <- 0

15.2.21.2 In one column of the dataset

my_dataframe["pages"][is.na(my_dataframe["pages"])] <- 0

15.2.22 Combine or concatenate strings

s <- ""
s1 <- "something"
s2 <- "Else"
s <- paste(s, s1, s2, sep = "")

s
## [1] "somethingElse"

15.2.23 Search vector or variable for values

x <- c(1,2,3,4,5)
any(x>5)
## [1] FALSE
any(x<2)
## [1] TRUE

15.2.24 Search to see if a row is contained in a data frame

Let’s say we want to check if mtcars contains any rows with am = 1, gear = 5, and carb = 6:

library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
match_df(mtcars, data.frame(am=1, gear=5, carb=6))
## Matching on: am, gear, carb
##               mpg cyl disp  hp drat   wt qsec vs am gear carb
## Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

And now let’s test for something that doesn’t exist:

library(plyr)
match_df(mtcars, data.frame(am=1, gear=5, carb="bobb"))
## Matching on: am, gear, carb
##  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
## <0 rows> (or 0-length row.names)

15.2.25 Search for text strings in variables

Given a dataset like d below, identify which names do and do not contain any of the strings ee and a:

d <- data.frame(Name = c("Aweo","Beena","Cidu","Deleek","Erga", "Fymo","Henny"), LevelOfInterest = c(3,1,2,3,2,1,2))
searchStrings <- c("ee","a")

d$searchMatch <- grepl(paste(toupper(searchStrings),collapse="|"), toupper(d$Name))

d
##     Name LevelOfInterest searchMatch
## 1   Aweo               3        TRUE
## 2  Beena               1        TRUE
## 3   Cidu               2       FALSE
## 4 Deleek               3        TRUE
## 5   Erga               2        TRUE
## 6   Fymo               1       FALSE
## 7  Henny               2       FALSE

In each row in d, searchMatch is TRUE if Name contains at least one of the strings in searchStrings.

15.2.26 Look up or search for a single item or value

mtcars['Fiat 128','mpg']
## [1] 32.4

15.2.27 Randomly reorder/resort data set

15.2.27.1 One-line version

Quick code to randomly reorder df and save it still as df:

df <- df[sample(1:nrow(df)), ]

15.2.27.2 More details

Re-sort myOriginalData and save it as myRandomlySortedData:

myOriginalData <- mtcars

rows <- sample(nrow(myOriginalData))
myRandomlySortedData <- myOriginalData[rows, ]

Check if it worked (results not shown):

head(myOriginalData)
head(myRandomlySortedData)

Quick code to randomly order df and save it still as df:

df <- df[sample(1:nrow(df)), ]

15.2.28 Sort/order data set by one or more variables

Sort dataframe oldData by Var1 from low to high:

newData <- oldData[order(oldData$Var1),]

Sort dataframe oldData by Var1 from high to low:

newData <- oldData[order(-oldData$Var1),]

Sort dataframe oldData by Var1 and then Var2, both from low to high:

newData <- oldData[order(oldData$Var1, oldData$Var2),]

Sort dataframe oldData by Var1 from low to high and then Var2 from high to low:

newData <- oldData[order(oldData$Var1, -oldData$Var2),]
  • The code above has not been tested for accuracy, as of July 22 2021.

15.2.29 Create a rank variable based on another variable

In the dataset below, we want to take each student’s score and use that to determine their rank in the class.

(d <- data.frame(studentName = c("Aabe","Beebe","Cheech","Doola","Eena","Fon"), score = c(77,89,45,33,99,77)))
##   studentName score
## 1        Aabe    77
## 2       Beebe    89
## 3      Cheech    45
## 4       Doola    33
## 5        Eena    99
## 6         Fon    77
d$classRank <- rank(d$score)

d
##   studentName score classRank
## 1        Aabe    77       3.5
## 2       Beebe    89       5.0
## 3      Cheech    45       2.0
## 4       Doola    33       1.0
## 5        Eena    99       6.0
## 6         Fon    77       3.5

Above, we now have a classRank variable which identifies how each student compares to the rest on score.

15.2.30 Count the occurrence number of each subject or within each group

Below, we have a data frame d1:

d1 <- data.frame(subjectID = c("a","a","a","b","b"))

d1
##   subjectID
## 1         a
## 2         a
## 3         a
## 4         b
## 5         b

We want a new variable that counts how many times each subject appears in the data:

library(dplyr)

d2 <- d1 %>%
      dplyr::group_by(subjectID) %>%
      dplyr::mutate(withinPersonRecordNumber = row_number()) %>%
      ungroup()

d2
## # A tibble: 5 × 2
##   subjectID withinPersonRecordNumber
##   <chr>                        <int>
## 1 a                                1
## 2 a                                2
## 3 a                                3
## 4 b                                1
## 5 b                                2

15.2.31 Count the frequency (number of times) that a single subject ID or variable value/level occurs

Below, we have a data frame d1:

d1 <- data.frame(subjectID = c("a","a","a","b","b"))

d1
##   subjectID
## 1         a
## 2         a
## 3         a
## 4         b
## 5         b

We want a new variable that counts how often each subject appears in the data:

library(dplyr)

d2 <- d1 %>%
  add_count(subjectID, name = "Frequency")

d2
##   subjectID Frequency
## 1         a         3
## 2         a         3
## 3         a         3
## 4         b         2
## 5         b         2

15.2.32 Change variable names to numbers

d <- mtcars
colnames(d) <- seq(1:ncol(d))

15.2.33 Select start of variable name

Select variables that start with the characters StartOfVar1 and apple

library(dplyr)
NewData <- OldData %>% select(someVariableIwant,starts_with(c("StartOfVar1","apple")))

NewData will now contain the variable someVariableIwant from OldData as well as any in OldData that start with the characters StartOfVar1 or apple.

If you don’t want someVariableIwant to be included, you can just do this:

NewData <- OldData %>% select(starts_with(c("StartOfVar1","apple")))

15.2.34 Select numeric variables

Create a copy of olddata which only contains numeric variables and we save it as newdata:

library(dplyr)
newdata <- olddata %>% select_if(is.numeric)

15.2.35 Select variables from another data set

mtcars.temp <- mtcars[c("mpg","cyl")]

mtcars.copy1 <- mtcars[names(mtcars.temp)]

Above, mtcars.copy1 contains the same variables that are in mtcars.temp, taken from mtcars.

15.2.36 Converting categorical and numeric data

The following code converts a numeric variable to a categorical one:

DataSet$NewVariable <- as.factor(DataSet$OldVariable)

The following code converts a factor (categorical) variable to a numeric one:

DataSet$NewVariable <- as.numeric(as.character(DataSet$OldVariable))

Check what type of data is in each variable:

DataSet$OldVariable
DataSet$NewVariable

Here is how we can convert a numeric variable to a categorical variable and then relabel the values:

mtcarscopy <- mtcars
mtcarscopy$amfactor <- as.factor(mtcarscopy$am)

library(plyr)
plyr::revalue
## function (x, replace = NULL, warn_missing = TRUE) 
## {
##     if (!is.null(x) && !is.factor(x) && !is.character(x)) {
##         stop("x is not a factor or a character vector.")
##     }
##     mapvalues(x, from = names(replace), to = replace, warn_missing = warn_missing)
## }
## <bytecode: 0x00000213e1b421a8>
## <environment: namespace:plyr>
mtcarscopy$amlabeled <- revalue(mtcarscopy$amfactor, c("0"="automatic", "1"="manual"))

head(mtcarscopy[c("am", "amfactor", "amlabeled")], n=10)
##                   am amfactor amlabeled
## Mazda RX4          1        1    manual
## Mazda RX4 Wag      1        1    manual
## Datsun 710         1        1    manual
## Hornet 4 Drive     0        0 automatic
## Hornet Sportabout  0        0 automatic
## Valiant            0        0 automatic
## Duster 360         0        0 automatic
## Merc 240D          0        0 automatic
## Merc 230           0        0 automatic
## Merc 280           0        0 automatic

15.2.37 Converting likert responses to numeric and summing up totals

Initial data, called d.original:

d.original <- data.frame(
  name = c("Abby","Beeta","Chock"),
  Q1 = c("2 - somewhat bad","3 - neutral","5 - very good"),
  Q2 = c("4 - somewhat good","1 - very bad","3 - neutral"),
  Q3 = c("2 - somewhat bad","5 - very good","2 - somewhat bad")
                )

Make likertFix function to handle data like this:

likertFix <- function(df, questionPrefix="", variablePrefix=""){
  
  # df is the dataset you're starting with
  # questionPrefix is the characters in quotation marks that label relevant columns for conversion to numbers and totaling up
  # variablePrefix is a short word in quotation marks that you want to label the columns being fixed and added
  
  df.q <- df %>% select(starts_with(questionPrefix))
  df.q[] <- lapply(df.q, function(x) substring(x,1,1))
  df.q[] <- lapply(df.q, function(x) as.numeric(x))
  df.q$total <- rowSums(df.q)
  colnames(df.q) <- paste(variablePrefix, colnames(df.q), sep = ".")
  df <- cbind(df,df.q)
  df <- df %>% select(-starts_with(questionPrefix))
  return(df)
}

Run the likertFix function on the initial data and save the result

dfixed <- likertFix(df = d.original, questionPrefix = "Q", variablePrefix = "Pretest")
dfixed
##    name Pretest.Q1 Pretest.Q2 Pretest.Q3 Pretest.total
## 1  Abby          2          4          2             8
## 2 Beeta          3          1          5             9
## 3 Chock          5          3          2            10

15.2.38 Assign reference group or reference category in factor variables

Generic code:

mydata$variableToFix <- relevel(as.factor(mydata$variableToFix), ref = "Label of reference group")

Example in which we want to use the cyl variable in mtcars (which we’ll re-save as d) as a factor variable with 6 as the reference group:

d<-mtcars
d$newCyl <- relevel(as.factor(d$cyl), ref = "6")

summary(lm(mpg~newCyl,d))
## 
## Call:
## lm(formula = mpg ~ newCyl, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.743      1.218  16.206 4.49e-16 ***
## newCyl4        6.921      1.558   4.441 0.000119 ***
## newCyl8       -4.643      1.492  -3.112 0.004152 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.714 
## F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

As we can see above, cars with 6 cylinders were omitted because they’re in the reference group.

15.2.39 Assign values in dataframe based on matched criteria

Below, set mpg equal to 200 for all observations for which am = 0, gear = 0, and carb = 2.

d <- mtcars

d[which(d$am==0 & d$gear==3 & d$carb==2),]$mpg <- 200

mtcars

15.2.40 Save table as dataframe

tableAMxCYL <- with(mtcars, table(am, cyl, useNA = 'ifany'))
dfAMxCYL <- as.data.frame.matrix(tableAMxCYL)
dfAMxCYL

Add am column as variable:

dfAMxCYL$am <- rownames(dfAMxCYL)
dfAMxCYL

15.2.41 Convert all columns in data frame to numeric

dNew <- sapply(dOriginal, as.numeric)

Read more:

15.2.42 Make a copy of an object

Make a copy of df called dfcopy:

dfcopy <- df

This can be done for any object like a dataset, stored value, regression object, and so on.

15.2.43 Remove missing data

Remove rows in dataset df that contain any missing data:

df <- na.omit(df)

15.2.44 Reverse code a numeric variable

If you have a variable like LevelOfInterest in the dataset d below…

d <- data.frame(Name = c("Aweo","Been","Cida","Deleek","Erga"), LevelOfInterest = c(3,1,2,3,2))

d
##     Name LevelOfInterest
## 1   Aweo               3
## 2   Been               1
## 3   Cida               2
## 4 Deleek               3
## 5   Erga               2

…you might want to change it so that:

  • 3 becomes 1
  • 2 remains 2
  • 1 becomes 3

This might help you do that:

d$LevelOfInterest.reverse <- (max(d$LevelOfInterest, na.rm = T)+1)-d$LevelOfInterest

Use a two-way table to see if this worked:

with(d, table(LevelOfInterest, LevelOfInterest.reverse))
##                LevelOfInterest.reverse
## LevelOfInterest 1 2 3
##               1 0 0 1
##               2 0 2 0
##               3 2 0 0

This code should work even for a variable with greater or less than three levels.

This issue is discussed more here:

15.2.45 Put row names into a variable

d <- mtcars # example data

d$carName <- rownames(d) # make a new variable called carName containing the row names of d

View(d) # check if it worked

15.2.46 Assign row names based on a variable

# example data
d <- data.frame(name = c("Kazaan","Kaalaa","Koona"), age = c(1,2,3), employed = c("no","no","no"))

rownames(d) <- d$name # change row names to be the wt of each car

View(d) # check if it worked

15.2.47 Format numbers as dollar currency

Let’s pretend that we want to take the numbers in the disp variable in the mtcars data and change them to USD (United States Dollars):

# Prepare data
df <- mtcars
df$disp <- df$disp * 100

# Do the conversion
if (!require(scales)) install.packages('scales') 
library(scales)
df$disp <- dollar(df$disp)

# Inspect results
df
df$disp
View(df)

Above, I multiplied disp by 100 just to illustrate how the dollar function would work on large numbers.

15.2.48 Check if values in one variable are in another variable

entireGroup <- data.frame(name = c("Beebo","Brakaansha","Bettle","Bo","Erl"), age = c(23,45,93,23,4))
signedUpList <- data.frame(writtenName = c("Bettle","Bo"), profession = c("Sword swallower swallower", "Anti snake charming activist"))

entireGroup$signedUp <- ifelse(entireGroup$name %in% signedUpList$writtenName, 1,0)

entireGroup
##         name age signedUp
## 1      Beebo  23        0
## 2 Brakaansha  45        0
## 3     Bettle  93        1
## 4         Bo  23        1
## 5        Erl   4        0

15.2.49 Unique identifiers

Make a unique identification (ID) number for observations or groups of observations.

15.2.49.1 Simple ID number by row

Add a new variable with the row number of each observation:

YourDataFrame$IDnum <- seq(1:nrow(YourDataFrame))

15.2.49.2 More complicated ID numbers

Sample data to practice with:

d <- data.frame(name = c("Aaaaaaron","Beela","Cononan","Duh","Eeena","Beela","Eeena","Beela"), age = c(1,2,3,4,1,2,1,2), occupation = c("hunter","vegan chef","plumber","plumbing destroyer","omnivore chef","vegan chef","omnivore chef","vegan chef"), day = c(1,1,1,1,15,15,15,15), month = c("January","February","March","April","January","February","March","April"), year = rep(2020,8), result = seq(1,8))

d
##        name age         occupation day    month year result
## 1 Aaaaaaron   1             hunter   1  January 2020      1
## 2     Beela   2         vegan chef   1 February 2020      2
## 3   Cononan   3            plumber   1    March 2020      3
## 4       Duh   4 plumbing destroyer   1    April 2020      4
## 5     Eeena   1      omnivore chef  15  January 2020      5
## 6     Beela   2         vegan chef  15 February 2020      6
## 7     Eeena   1      omnivore chef  15    March 2020      7
## 8     Beela   2         vegan chef  15    April 2020      8

Generate variable ID in dataset d containing a unique identification number for each person:

if (!require(udpipe)) install.packages('udpipe') 
library(udpipe)

d$ID <- unique_identifier(d, c("name"))

d
##        name age         occupation day    month year result ID
## 1 Aaaaaaron   1             hunter   1  January 2020      1  1
## 2     Beela   2         vegan chef   1 February 2020      2  2
## 3   Cononan   3            plumber   1    March 2020      3  3
## 4       Duh   4 plumbing destroyer   1    April 2020      4  4
## 5     Eeena   1      omnivore chef  15  January 2020      5  5
## 6     Beela   2         vegan chef  15 February 2020      6  2
## 7     Eeena   1      omnivore chef  15    March 2020      7  5
## 8     Beela   2         vegan chef  15    April 2020      8  2

More sample data for practice:

d2 <- data.frame(name = c("Aaaaaaron","Beela","Cononan","Duh","Eeena","Fewe","Graam","Hiol"), number = c(1,1,1,1,0,0,0,0), color = c("green","brown","green","brown","green","brown","green","brown"))

d2
##        name number color
## 1 Aaaaaaron      1 green
## 2     Beela      1 brown
## 3   Cononan      1 green
## 4       Duh      1 brown
## 5     Eeena      0 green
## 6      Fewe      0 brown
## 7     Graam      0 green
## 8      Hiol      0 brown

Generate variable group in dataset d2 containing a unique identification number for each number-color pair:

if (!require(udpipe)) install.packages('udpipe') 
library(udpipe)

d2$group <- unique_identifier(d2, c("number","color"))

d2
##        name number color group
## 1 Aaaaaaron      1 green     4
## 2     Beela      1 brown     3
## 3   Cononan      1 green     4
## 4       Duh      1 brown     3
## 5     Eeena      0 green     2
## 6      Fewe      0 brown     1
## 7     Graam      0 green     2
## 8      Hiol      0 brown     1

15.2.50 Within-group count of observations and group size, within-group ID number

If you have multiple observations in one group or for one person and you need to count each observation’s number within each group or person, the code below should help.

# Install and load the required package
library(dplyr)

# Create an example dataframe
df <- data.frame(
  person = c("John", "John", "Mary", "Mary", "Mary", "Peter"),
  age = c(25, 30, 35, 40, 45, 50)
)

# Create a new variable with the count of rows for each person
df <- df %>%
  dplyr::group_by(person) %>%
  dplyr::mutate(row_count = row_number()) %>%
  ungroup()

# Print the modified dataframe
print(df)

The code and comments above were generated by ChatGPT on May 22 2023, with minor modifications by Anshul.

If you want to generate a group size variable instead, this might help:

# Install and load the required package
library(dplyr)

# Create an example dataframe
df <- data.frame(
  person = c("John", "John", "Mary", "Mary", "Mary", "Peter"),
  age = c(25, 30, 35, 40, 45, 50)
)

# Create a new variable with the count of rows for each person
df <- df %>%
  dplyr::group_by(person) %>%
  dplyr::mutate(row_count = n()) %>%
  ungroup()

# Print the modified dataframe
print(df)

The code and comments above were generated by ChatGPT on May 22 2023.

15.2.51 Make new character variable or summary report based on other variables

Example data:

d <- data.frame(
  flavor = c("chocolate","vanilla","strawberry","other"),
  numberEaten = c(1,2,3,4)
)
d$joinedVariable <- paste0("Flavor and number: ",as.character(d$flavor)," ", as.character(d$numberEaten))
d
##       flavor numberEaten                  joinedVariable
## 1  chocolate           1  Flavor and number: chocolate 1
## 2    vanilla           2    Flavor and number: vanilla 2
## 3 strawberry           3 Flavor and number: strawberry 3
## 4      other           4      Flavor and number: other 4

Source: Solution by user sandeep. Concatenating two string variables in r. https://stackoverflow.com/questions/26321702/concatenating-two-string-variables-in-r.

keywords: concatenate, join

15.2.52 Converting time variables in R

You might have data in which there are time stamps which you need to convert into a continuous variable that you can put into a regression. For example, maybe you have data in which each row (observation) is a patient and then you have a variable (column) for the date and time on which the patient came into the hospital. To analyze this data as a continuous variable, maybe you want to calculate how many seconds after midnight in each day a patient came in.

How can we convert a time to a number of seconds? There are helper functions in R that help us do this. Let’s start with an example:

if (!require(lubridate)) install.packages('lubridate') 
## Warning: package 'lubridate' was built under R version 4.2.3
library(lubridate) # this package has the period_to_seconds function

# example of what the data looks like
ExampleTimestamp <- "01-01-2019 09:04:58"

# extract just the time (remove the date)
(ExampleTimeOnly <- substr(ExampleTimestamp,12,19))
## [1] "09:04:58"
# convert from time to number of seconds
(TotalSeconds <- period_to_seconds(hms(ExampleTimeOnly)))
## [1] 32698

As you can see above, the time “09:04:59” was converted into 32698 seconds. But we only did it for a single stored value, ExampleTimestamp. How do we do it for the entire variable in a dataset? Let’s say you have a datset called d with a variable with a time stamp called timestamp and you want to make a new variable (column) in the dataset called seconds. Here’s how you can do it:

if (!require(lubridate)) install.packages('lubridate') 
library(lubridate)

d$seconds <- period_to_seconds(hms(substr(d$timestamp,12,19)))

15.2.53 Making and modifying lists

This is not complete.

l <- list() # make empty list
l <- append(l, "bob") # add something to the list
one <- 1
two <- c(1,2,3,4)
three <- list("byron","anshul")
four <- "bob"
t <- list(one, two, three) # make a list

t <- append(t, four) # add something to the list

15.2.54 Categorize variable into quantiles

Make a new variable called quintile that identifies each observation’s quintile for the variable mpg:

d <- mtcars
d$mpg[4] <- NA # create missing data for illustration only

library(dplyr)
d$mpg.quintile <- ntile(d$mpg, 5)

Check if it worked:

table(d$mpg.quintile, useNA = "always")
class(d$mpg.quintile)

As an option, recode the new variable as a factor and label NA values as missing, so that the computer doesn’t know they are NA anymore (the NA observations won’t get thrown out of an analysis):

d$mpg.quintile.fac <- ifelse(is.na(d$mpg.quintile), "missing",as.character(d$mpg.quintile))
table(d$mpg.quintile.fac, useNA = "always")
class(d$mpg.quintile.fac)

15.2.55 Combine similar levels into groups

This is incomplete. See this reference:

15.2.56 Merging and joining datasets together

left <- data.frame(name=c("A. Onlyleft","B. Onlyleft","C. Both","D. Both", NA), skill=c("pig latin","latin","pig farming","pig surgery","pig liberating"))
left
##          name          skill
## 1 A. Onlyleft      pig latin
## 2 B. Onlyleft          latin
## 3     C. Both    pig farming
## 4     D. Both    pig surgery
## 5        <NA> pig liberating
right <- data.frame(name = c("E. Onlyright","F. Onlyright","C. Both","D. Both", NA), skill=c("reading","writing","speling","boating","gloating"))
right
##           name    skill
## 1 E. Onlyright  reading
## 2 F. Onlyright  writing
## 3      C. Both  speling
## 4      D. Both  boating
## 5         <NA> gloating
leftjoined <- dplyr::left_join(left, right, by="name")
leftjoined
##          name        skill.x  skill.y
## 1 A. Onlyleft      pig latin     <NA>
## 2 B. Onlyleft          latin     <NA>
## 3     C. Both    pig farming  speling
## 4     D. Both    pig surgery  boating
## 5        <NA> pig liberating gloating
rightjoined <- dplyr::right_join(left, right, by="name")
rightjoined
##           name        skill.x  skill.y
## 1      C. Both    pig farming  speling
## 2      D. Both    pig surgery  boating
## 3         <NA> pig liberating gloating
## 4 E. Onlyright           <NA>  reading
## 5 F. Onlyright           <NA>  writing
fulljoined <- dplyr::full_join(left, right, by="name")
fulljoined
##           name        skill.x  skill.y
## 1  A. Onlyleft      pig latin     <NA>
## 2  B. Onlyleft          latin     <NA>
## 3      C. Both    pig farming  speling
## 4      D. Both    pig surgery  boating
## 5         <NA> pig liberating gloating
## 6 E. Onlyright           <NA>  reading
## 7 F. Onlyright           <NA>  writing
merged <- merge(left, right, by="name", all = TRUE, incomparables = NA) 
# all.x or all.y also possible to do only left or right joins
# change to all=F to only include observations that match
merged
##           name        skill.x  skill.y
## 1  A. Onlyleft      pig latin     <NA>
## 2  B. Onlyleft          latin     <NA>
## 3      C. Both    pig farming  speling
## 4      D. Both    pig surgery  boating
## 5 E. Onlyright           <NA>  reading
## 6 F. Onlyright           <NA>  writing
## 7         <NA> pig liberating     <NA>
## 8         <NA>           <NA> gloating

What if we wanted to make a new variable called dataSource which identifies who all came from which dataset? I’m not sure of the very best way to do it, but here’s one way that seems to work:

left$inLeft <- 1
right$inRight <- 1

merged2 <- merge(left, right, by="name", all = TRUE, incomparables = NA)

merged2$dataSource <- NA
merged2$dataSource[merged2$inLeft==1 & merged2$inRight==1] <- "Present in both left and right"
merged2$dataSource[merged2$inLeft==1 & is.na(merged2$inRight)] <- "Present in left only"
merged2$dataSource[is.na(merged2$inLeft) & merged2$inRight==1] <- "Present in right only"

merged2
##           name        skill.x inLeft  skill.y inRight
## 1  A. Onlyleft      pig latin      1     <NA>      NA
## 2  B. Onlyleft          latin      1     <NA>      NA
## 3      C. Both    pig farming      1  speling       1
## 4      D. Both    pig surgery      1  boating       1
## 5 E. Onlyright           <NA>     NA  reading       1
## 6 F. Onlyright           <NA>     NA  writing       1
## 7         <NA> pig liberating      1     <NA>      NA
## 8         <NA>           <NA>     NA gloating       1
##                       dataSource
## 1           Present in left only
## 2           Present in left only
## 3 Present in both left and right
## 4 Present in both left and right
## 5          Present in right only
## 6          Present in right only
## 7           Present in left only
## 8          Present in right only

15.3 Descriptive statistics

15.3.1 Calculate mode of a variable

Run this code to create a new function:

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

Use the function to get the mode:

getmode(mtcars$cyl)

15.3.2 Identify number of unique values in a variable

List all unique values:

unique(mtcars$cyl)
## [1] 6 4 8

Calculate number of unique values:

length(unique(mtcars$cyl))
## [1] 3

15.3.3 Compare elements of two vectors

Let’s say we have two vectors called first and second and we want to compare the elements in each vector. Here are the vectors:

first <- c("a","b","c","d")
second <- c("c","d","e","f")

This is how we can see which elements are common to the two vectors:

intersect(first, second)
## [1] "c" "d"

Here are the elements that are only in first and not in second:

setdiff(first,second)
## [1] "a" "b"

And here are the elements that are only in second and not in first:

setdiff(second,first)
## [1] "e" "f"

15.3.4 Basic descriptive table

There are many ways to make a concise descriptive statistics table. A few options are shown below.

15.3.4.1 Row and column totals

Make some data:

d <- data.frame(variableWeCareAbout = c("red","black","blue","blue","red"))

d$Frequency <- "Frequency"

d
##   variableWeCareAbout Frequency
## 1                 red Frequency
## 2               black Frequency
## 3                blue Frequency
## 4                blue Frequency
## 5                 red Frequency

Make a table with a total count:

addmargins(table(d$variableWeCareAbout, d$Frequency),1, FUN = list(TOTAL=sum) )
##        
##         Frequency
##   black         1
##   blue          2
##   red           2
##   TOTAL         5

Above, we trick R into displaying the table vertically instead of its default horizontal format, by adding a new variable called Frequency that has only one level. I’m not sure if there’s a better way to do this, but I like the way this comes out looking.

In the addmargins function, change the 1 to a 2 to get row totals instead of column totals.

15.3.4.2 table1 function

This option works well with HTML outputs but not PDF or Word, as far as I know.

Basic table:

if(!require(table1)) install.packages("table1")
library(table1)

table1(~ mpg + hp + as.factor(am), data=mtcars, topclass="Rtable1-zebra")

Grouped table:

if(!require(table1)) install.packages("table1")
library(table1)

table1(~ mpg + hp + as.factor(am) | cyl, data=mtcars, topclass="Rtable1-zebra")

More information:

15.3.4.3 pastecs function

This option works better with PDF and Word documents, as far as I know.

Prepare data:

mtcars.partial <- mtcars[c("mpg","cyl","am","hp")]

The code below looks like a lot of lines, but you should be able to just copy and paste it without making any changes, other than replacing mtcars.partial with the name of your own dataset.

Default descriptive statistics table from the pastecs package:

if(!require(pastecs)) install.packages("pastecs")
library(pastecs)

descriptivetable <- stat.desc(mtcars.partial)

descriptivetable
##                      mpg         cyl          am           hp
## nbr.val       32.0000000  32.0000000 32.00000000   32.0000000
## nbr.null       0.0000000   0.0000000 19.00000000    0.0000000
## nbr.na         0.0000000   0.0000000  0.00000000    0.0000000
## min           10.4000000   4.0000000  0.00000000   52.0000000
## max           33.9000000   8.0000000  1.00000000  335.0000000
## range         23.5000000   4.0000000  1.00000000  283.0000000
## sum          642.9000000 198.0000000 13.00000000 4694.0000000
## median        19.2000000   6.0000000  0.00000000  123.0000000
## mean          20.0906250   6.1875000  0.40625000  146.6875000
## SE.mean        1.0654240   0.3157093  0.08820997   12.1203173
## CI.mean.0.95   2.1729465   0.6438934  0.17990541   24.7195501
## var           36.3241028   3.1895161  0.24899194 4700.8669355
## std.dev        6.0269481   1.7859216  0.49899092   68.5628685
## coef.var       0.2999881   0.2886338  1.22828533    0.4674077

Export the customized table so that you can open it in a separate spreadsheet:

write.csv(descriptivetable, "mtcars partial descriptive table 1.csv")

The table above might be all you need. If you want to further customize the table, keep reading below.

Transpose the table so that variables are in rows:

descriptivetable.transpose <- as.data.frame(t(as.matrix(descriptivetable)))

descriptivetable.transpose
##     nbr.val nbr.null nbr.na  min   max range    sum median      mean
## mpg      32        0      0 10.4  33.9  23.5  642.9   19.2  20.09062
## cyl      32        0      0  4.0   8.0   4.0  198.0    6.0   6.18750
## am       32       19      0  0.0   1.0   1.0   13.0    0.0   0.40625
## hp       32        0      0 52.0 335.0 283.0 4694.0  123.0 146.68750
##         SE.mean CI.mean.0.95          var    std.dev  coef.var
## mpg  1.06542396    2.1729465   36.3241028  6.0269481 0.2999881
## cyl  0.31570933    0.6438934    3.1895161  1.7859216 0.2886338
## am   0.08820997    0.1799054    0.2489919  0.4989909 1.2282853
## hp  12.12031731   24.7195501 4700.8669355 68.5628685 0.4674077

Select and re-order specific items from the transposed table and only display those, also rounding certain statistics:

descriptivetable.abbreviated <- data.frame(N = descriptivetable.transpose$nbr.val, Mean = round(descriptivetable.transpose$mean,1), SD = round(descriptivetable.transpose$std.dev,1), Missing = descriptivetable.transpose$nbr.na)

row.names(descriptivetable.abbreviated) <- row.names(descriptivetable.transpose)

descriptivetable.abbreviated
##      N  Mean   SD Missing
## mpg 32  20.1  6.0       0
## cyl 32   6.2  1.8       0
## am  32   0.4  0.5       0
## hp  32 146.7 68.6       0

Make it look nicer if you are using R Markdown (this should come out well in PDF and Word outputs, in addition to HTML):

library(knitr)
## Warning: package 'knitr' was built under R version 4.2.3
knitr::kable(descriptivetable.abbreviated)
N Mean SD Missing
mpg 32 20.1 6.0 0
cyl 32 6.2 1.8 0
am 32 0.4 0.5 0
hp 32 146.7 68.6 0

Make it look even nicer in R Markdown:

if (!require(DT)) install.packages('DT') 
## Warning: package 'DT' was built under R version 4.2.3
library(DT)

DT::datatable(descriptivetable.abbreviated)

Export the customized table so that you can open it in a separate spreadsheet:

write.csv(descriptivetable.abbreviated, "mtcars partial descriptive table 2.csv")

15.3.5 Make one-way table for each variable in data

For a dataset called d:

apply(d, 2, table)

15.3.6 Missing values in two-way tables

with(mtcars, table(am,vs, useNA = 'always'))
##       vs
## am      0  1 <NA>
##   0    12  7    0
##   1     6  7    0
##   <NA>  0  0    0

15.3.7 Adding totals to tables

There is a section called Sum totals in tables in Chapter 1 which could be useful. Another method is here:

# Prepare data
df <- data.frame(Student = c("Aannaa","Beena","Chep"), Major = c("Fishing","Architecture","Soup"), Credits = c(2,3,1), Blocks = c(33,14,2))
View(df)

# Remove variables that can't be added, temporarily
tempFactorsDF <- df[c("Student","Major")]
library(dplyr)
df <- select(df, -c(Student, Major))

# Add TOTAL label, if desired
tempFactorsDF <- rbind(tempFactorsDF,c("TOTAL",""))

# Calculate totals for numeric variables
df <- rbind(df,colSums(df))

# Join everything back together
df <- cbind(tempFactorsDF,df)

# Inspect results
df

# Export results
kable(df)

15.3.8 Two-way table with formatting

Simplest version:

if (!require(expss)) install.packages('expss')
library(expss)

cro(mtcars$am, mtcars$vs)
 mtcars$vs 
 0   1 
 mtcars$am 
   0  12 7
   1  6 7
   #Total cases  18 14

With only variables as row and column labels:

if (!require(expss)) install.packages('expss')
library(expss)

with(mtcars, cro(am, vs))
 vs 
 0   1 
 am 
   0  12 7
   1  6 7
   #Total cases  18 14

Without column totals:

if (!require(expss)) install.packages('expss')
library(expss)

cro(mtcars$am, mtcars$vs, total_row_position = "none")
 mtcars$vs 
 0   1 
 mtcars$am 
   0  12 7
   1  6 7

Change the row and column labels:

if (!require(expss)) install.packages('expss')
library(expss)

`Transmission` <- mtcars$am
`Engine` <- mtcars$vs

cro(`Transmission`, `Engine`)
 Engine 
 0   1 
 Transmission 
   0  12 7
   1  6 7
   #Total cases  18 14

Here is another option, using the very popular and handy kable package:

if (!require(knitr)) install.packages('knitr')
library(knitr)

knitr::kable(with(mtcars, table(cyl,am)))
0 1
4 3 8
6 4 3
8 12 2

15.3.9 Grouped descriptive statistics

15.3.9.1 Grouped descriptive statistics with multiple nested levels

dplyr::group_by(myDataframe, groupingVariable1, groupingVariable2) %>% 
  dplyr::summarise(
    Count = n(),
    Mean = mean(anotherVariable)
  )

15.3.9.2 Descriptive table with grouped means

Below, we make a new dataset or a descriptive table in which rows are created based on means of all variables for selected groups.

Prepare data:

d <- data.frame(
  name = c("Aaron","Baron","Caron","Daron","Earon","Faron","Aaron","Baron","Caron","Daron","Earon","Faron"),
  performance1 = c(1,7,3,8,3,7,3,7,3,6,8,3),
  performance2 = seq(1:12)+30,
  group = c("A","A","A","B","B","B","A","A","A","B","B","B"),
  time = c(rep(1,6),rep(2,6))
  )

d
##     name performance1 performance2 group time
## 1  Aaron            1           31     A    1
## 2  Baron            7           32     A    1
## 3  Caron            3           33     A    1
## 4  Daron            8           34     B    1
## 5  Earon            3           35     B    1
## 6  Faron            7           36     B    1
## 7  Aaron            3           37     A    2
## 8  Baron            7           38     A    2
## 9  Caron            3           39     A    2
## 10 Daron            6           40     B    2
## 11 Earon            8           41     B    2
## 12 Faron            3           42     B    2

Calculate mean of all variables for each group-time pair:

GroupMeanData <- aggregate(.~group+time, d, mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
GroupMeanData
##   group time name performance1 performance2
## 1     A    1   NA           NA           NA
## 2     B    1   NA           NA           NA
## 3     A    2   NA           NA           NA
## 4     B    2   NA           NA           NA

Above, group and time are the grouping variables according to which all observations will be aggregated. Any number of grouping variables can be placed after the ~.

Remove the now-unnecessary name variable:

GroupMeanData$name <- NULL

GroupMeanData
##   group time performance1 performance2
## 1     A    1           NA           NA
## 2     B    1           NA           NA
## 3     A    2           NA           NA
## 4     B    2           NA           NA

15.3.10 Principal Component Analysis (PCA)

These, I think, are the most important steps, below.

Load the data:

d <- mtcars

Run PCA on selected variables (after they have been scaled and centered):

pca1 <- prcomp(d[c("mpg","cyl","am","gear","wt","disp")], scale = TRUE, center = TRUE)

Inspect the components:

summary(pca1)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6
## Standard deviation     2.1243 0.9373 0.51315 0.40214 0.3625 0.22820
## Proportion of Variance 0.7521 0.1464 0.04389 0.02695 0.0219 0.00868
## Cumulative Proportion  0.7521 0.8986 0.94247 0.96942 0.9913 1.00000

Inspect details:

print(pca1)
## Standard deviations (1, .., p=6):
## [1] 2.1243489 0.9373486 0.5131459 0.4021381 0.3625038 0.2282031
## 
## Rotation (n x k) = (6 x 6):
##             PC1        PC2        PC3        PC4         PC5         PC6
## mpg  -0.4265888  0.2890011  0.3004707 -0.3396689  0.66803456 -0.28735448
## cyl   0.4178524 -0.3418832  0.4624782  0.4577860  0.23280466 -0.48048222
## am   -0.3732542 -0.5494741  0.4638622 -0.3872494 -0.43492772 -0.06683618
## gear -0.3437230 -0.6479430 -0.5246747  0.1769507  0.39389769  0.01654900
## wt    0.4399693 -0.1132439 -0.3916253 -0.6143673 -0.02650202 -0.51194039
## disp  0.4386162 -0.2549866  0.2280552 -0.3410810  0.39308008  0.64787370

Add components as columns to original dataset:

d <- cbind(d, pca1$x)

15.4 Visualization

15.4.1 Other visualization options

See these resources:

15.4.2 Pie chart

Regular:

pie(table(mtcars$cyl), main="Distribution of cars by number of cylinders")

3-D:

if(!require(plotrix)) install.packages("plotrix")
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.2.3
library(plotrix)

pie3D(table(mtcars$cyl), main="Distribution of cars by number of cylinders", explode = 0.1, labels=table(mtcars$cyl))

pie3D(table(mtcars$cyl), main="Distribution of cars by number of cylinders", explode = 0.1, labels=labels(table(mtcars$cyl))[[1]])

Source: Quick-R by datacamp. Pie Charts. https://www.statmethods.net/graphs/pie.html

15.4.3 Scatterplot with group colors

15.4.3.1 Using plot function built into R

The code below makes a scatterplot with two numeric variables (mpg plotted against wt, from the mtcars dataset), color coded according to a different variable (am):

plot(mtcars$wt, mtcars$mpg, pch=19, col=factor(mtcars$am))

legend("topright", 
       title= "am",
       legend = levels(factor(mtcars$am)), 
       pch = 19, 
       col = factor(levels(factor(mtcars$am))))

Sometimes, I find that the legend will annoyingly overlap with the plotted data. One way to fix this is to add an xlim argument to the plot command:

plot(mtcars$wt, mtcars$mpg, pch=19, col=factor(mtcars$am), xlim = c(0,6))

legend("topleft", 
       title= "am",
       legend = levels(factor(mtcars$am)), 
       pch = 19, 
       col = factor(levels(factor(mtcars$am))))

Above, the legend is now in the top-left corner, where it would get in the way of some of our plotted points. To make space for the legend and plotted points to display separately, we tell the computer to start plotting the x-axis at 0 and end at 6. Once we do this, the legend is well to the left of the plotted points.

More information about labels is available at:

15.4.3.2 Using ggplot

mpg plotted against wt in mtcars, with coloring based on am, with am treated as a continuous numeric variable.

d <- mtcars

if (!require(ggplot2)) install.packages('ggplot2')
## Warning: package 'ggplot2' was built under R version 4.2.3
library(ggplot2)

ggplot(d,aes(wt,mpg,colour=am))+geom_point()

mpg plotted against wt in mtcars, with coloring based on am, with am treated as a categorical variable.

# prepare data for this example only
d <- mtcars
d$ambinary <- as.factor(d$am) 

if (!require(ggplot2)) install.packages('ggplot2')
library(ggplot2)

ggplot(d,aes(wt,mpg,colour=ambinary))+geom_point()

All you need to do is make the following changes to the line of code ggplot(d,aes(wt,mpg,colour=ambinary))+geom_point():

  • change d to your own data set’s name
  • change wt to your desired X variable
  • change mpg to your desired Y variable
  • change ambinary to your desired grouping (coloring) variable

We can also add connecting lines to the plot above:

ggplot(d,aes(wt,mpg,colour=ambinary))+geom_point()+geom_line()

Style the line:

ggplot(d,aes(wt,mpg,colour=ambinary))+geom_point()+geom_line(linetype="dashed")

15.4.4 Multiple histograms on a single plot

15.4.4.1 ggplot

Sources consulted:

df <- iris

library(ggplot2)
library(dplyr)

bw <- 2 * IQR(df$Sepal.Width) / length(df$Sepal.Width)^(1/3)

med_df <- df %>%
  group_by(Species) %>%
  summarize(median=median(Sepal.Width))

med_df <- dplyr::group_by(df, Species) %>% 
dplyr::summarise(
  median = median(Sepal.Width)
  )

df%>%
  ggplot(aes(x=Sepal.Width, color=Species, fill=Species)) +
  geom_density(alpha=0.3,size=1)+ 
  geom_histogram(alpha=0.3, position="identity", aes(y = ..density..), color="black", binwidth = bw) +
  # scale_x_log10()+
  geom_vline(data = med_df, aes(xintercept = median, color = Species), size=1, linetype = "dashed")+
  labs(x= "Sepal Width",
       subtitle="Sepal width distribution by species (iris data)")+
  theme(legend.position="bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

15.4.4.2 Base R

d <- mtcars

hgA <- hist(d[which(d$am==1),]$mpg, plot = FALSE)
hgB <- hist(d[which(d$am==0),]$mpg, plot = FALSE)

c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")

plot(hgA, col = c1, xlim = c(0,40), ylim = c(0,10))
plot(hgB, col = c2, xlim = c(0,40), ylim = c(0,10),  add = TRUE)

15.4.5 Line plot by groups

Example data:

d <- data.frame(
  country = c(rep("A",3),rep("B",3),rep("C",3)),
  score = c(5,2,3,3,5,2,4,3,4),
  time = c(1,2,3,1,2,3,1,2,3)
)

d
##   country score time
## 1       A     5    1
## 2       A     2    2
## 3       A     3    3
## 4       B     3    1
## 5       B     5    2
## 6       B     2    3
## 7       C     4    1
## 8       C     3    2
## 9       C     4    3
library(ggplot2)

ggplot(data=d, aes(x=time, y=score, group=country)) +
  geom_line(linetype="dashed", size=1.2, aes(color=country))+
  geom_point(size=3, aes(color=country)) + ggtitle("Country scores over time")

source: http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization

keywords: line chart, line graph, line segment, scatterplot with lines

15.4.6 Side-by-side grouped boxplots

In separate plots:

library(ggplot2)
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+
  geom_boxplot()+
  facet_wrap(.~as.factor(am), scales = "free")+
  labs(title="Miles per gallon by number of cylinders and transmission type", caption="cyl groups indicate number of cylinders; transmission type: 0 = automatic, 1 = manual")

Reference:

On the same plot:

library(ggplot2)
ggplot(mtcars, aes(x=factor(cyl), y=mpg, color = factor(am)))+
  geom_boxplot()+
  theme( legend.position = "right" )+
  ylim(0,40)+
  labs(title="Miles per gallon by number of cylinders and transmission type", caption="transmission type: 0 = automatic, 1 = manual", x = "number of cylinders", y="miles per gallon", color="Transmission")

Different colors:

library(ggplot2)
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill = factor(am)))+
  geom_boxplot()+
  theme( legend.position = "right" )+
  ylim(0,40)+
  labs(title="Miles per gallon by number of cylinders and transmission type", caption="transmission type: 0 = automatic, 1 = manual", x = "number of cylinders", y="miles per gallon", fill="Transmission")

Different colors, grayscale:

library(ggplot2)
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill = factor(am)))+
  geom_boxplot()+
  theme( legend.position = "right" )+
  ylim(0,40)+
  labs(title="Miles per gallon by number of cylinders and transmission type", caption="transmission type: 0 = automatic, 1 = manual", x = "number of cylinders", y="miles per gallon", fill="Transmission")+
  scale_fill_grey(start = 0.8, end = 0.5) 

15.4.7 Flowchart or DAG

15.4.7.1 DiagrammeR package

This is my preferred way to make a flowchart or DAG.

Please see this guide to making a basic flowchart in R: https://rpubs.com/anshulkumar/FlowchartsInR

15.4.7.2 ggdag package

I consider ggdag to be more primitive than DiagrammeR, but both might be useful depending on the situation. ggdag is likely faster, which could occasionally be useful. An example is below.

if (!require(ggdag)) install.packages('ggdag')
## Warning: package 'ggdag' was built under R version 4.2.3
library(ggdag)

test <- dagify(CardiacArrest ~ cholesterol,
       cholesterol ~ smoking + weight,
       smoking ~ unhealthy,
       weight ~ unhealthy)
ggdag(test, node = FALSE, text_col = "black") + remove_axes() + remove_grid() + labs(caption = "DAG #1")

15.4.8 Dumbbell plot for pre-post or other paired data

How can we visualize pre-post data when we don’t have too many observations?

# example pre-post data
name<- c("Zeld","Xya","Blork","Weeda","Mobant")
pre<- c(5,6,4,5,7)
post<- c(6,8,4,9,5)
age<- c(10,10,11,11,12)

d <- data.frame(name,pre,post,age)

d
##     name pre post age
## 1   Zeld   5    6  10
## 2    Xya   6    8  10
## 3  Blork   4    4  11
## 4  Weeda   5    9  11
## 5 Mobant   7    5  12
if (!require(dumbbell)) install.packages('dumbbell')
## Loading required package: dumbbell
## Warning: package 'dumbbell' was built under R version 4.2.3
library(dumbbell)
# dumbbell plot
dumbbell(xdf=d,id = "name",key="age",column1 = "pre",column2 = "post", lab1 = "pre",lab2 = "post", pointsize=6, delt=1, pt_val=0, textsize=4, arrow=1, arrow_size=0.2,col_seg1 = "black", col_seg2 = "black") 
## Selecting by val

More information:

15.4.9 Heat map

library(ggplot2)
ggplot(mtcars, aes(as.factor(am), as.factor(cyl), fill= mpg)) + 
  geom_tile() + scale_fill_gradient(low="white", high="red") +
  labs(caption = "Average miles per gallon, by transmission and cylinders")

15.5 Regression

15.5.1 Automatically get regression equation

reg <- lm(mpg~., mtcars)
cc <- reg$coefficients
(eqn <- paste("Y =", paste(round(cc[1],2), paste(round(cc[-1],2), names(cc[-1]), sep=" * ", collapse=" + "), sep=" + "), "+ e"))
## [1] "Y = 12.3 + -0.11 * cyl + 0.01 * disp + -0.02 * hp + 0.79 * drat + -3.72 * wt + 0.82 * qsec + 0.32 * vs + 2.52 * am + 0.66 * gear + -0.2 * carb + e"

Source: keithpjolley. 20 Sep 2018. Response to How to translate the results from lm() to an equation?. Accessed Feb 2021. https://stats.stackexchange.com/questions/63600/how-to-translate-the-results-from-lm-to-an-equation.

15.5.2 Confidence Intervals

If you ran a regression and saved the regression result as regobj, you can of course see the results whenever you want by running summary(regobj) as you already know. But you can also see the 95% confidence intervals for the estimated coefficients by running confint(regobj).

15.6 General R use and processes

15.6.1 Run R in web browser without signing in

This should work on your phone, too!

15.6.2 Check R version and packages

See which version of R you are using, which packages are loaded, and other information about the current session of R:

sessionInfo()

15.6.3 Remove an object from the environment

Remove df from the environment in R:

rm(df)

15.6.4 Remove all objects from the environment

rm(list=ls())

15.6.5 Get code to recreate a dataframe or object

dput(nameOfObject)

15.6.6 Fix variable names automatically

With a data set called d:

names(d) <- make.names(names(d))

See the changes:

names(d)

More information:

?make.names

15.6.7 List all variables in a dataset

The following code outputs a list of all variables in the dataset mtcars, in quotation marks and separated by commas:

for(n in names(mtcars)){cat('"',n,'",', sep='')}
## "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb",
  • In the code above, replace mtcars with the name of your own dataset.
  • There will be an extra—likely unwanted—comma at the end of the list.

This code can give you a list of your variables to manually modify and then paste into a regression formula:

for(n in names(mtcars)){cat(' + ',n, sep='')}
##  + mpg + cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

This does the same thing:

(VarListString <- paste(names(mtcars), collapse="+"))
## [1] "mpg+cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb"

Without quotation marks:

cat(paste(names(mtcars), collapse="+"))
## mpg+cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb

15.6.8 List all objects in R environment

ls()
##  [1] "age"                          "allthethingswithoutspaces"   
##  [3] "allthethingswithspaces"       "anotherthing"                
##  [5] "bw"                           "c1"                          
##  [7] "c2"                           "cc"                          
##  [9] "d"                            "d.original"                  
## [11] "d1"                           "d2"                          
## [13] "demographics"                 "descriptivetable"            
## [15] "descriptivetable.abbreviated" "descriptivetable.transpose"  
## [17] "df"                           "dfcopy"                      
## [19] "dfixed"                       "dNumericOnly"                
## [21] "dpartial"                     "Engine"                      
## [23] "entireGroup"                  "eqn"                         
## [25] "evenmorethings"               "ExampleTimeOnly"             
## [27] "ExampleTimestamp"             "first"                       
## [29] "flavor"                       "flavor_counts"               
## [31] "flavor_percentages"           "flavors"                     
## [33] "four"                         "fulljoined"                  
## [35] "GroupMeanData"                "hgA"                         
## [37] "hgB"                          "l"                           
## [39] "left"                         "leftjoined"                  
## [41] "likertFix"                    "med_df"                      
## [43] "merged"                       "merged2"                     
## [45] "mtcars.copy1"                 "mtcars.partial"              
## [47] "mtcars.temp"                  "mtcarscopy"                  
## [49] "n"                            "name"                        
## [51] "one"                          "pca1"                        
## [53] "post"                         "pre"                         
## [55] "reg"                          "right"                       
## [57] "rightjoined"                  "s"                           
## [59] "s1"                           "s2"                          
## [61] "searchStrings"                "second"                      
## [63] "signedUpList"                 "something"                   
## [65] "somethingelse"                "summary_df"                  
## [67] "t"                            "test"                        
## [69] "three"                        "total_respondents"           
## [71] "TotalSeconds"                 "Transmission"                
## [73] "two"                          "unique_flavors"              
## [75] "VarListString"                "x"                           
## [77] "zipcodes"

If you see character(0) as the output, that means the environment is empty.

15.6.9 List all files in current working directory or a selected directory

Working directory:

list.files()

A selected directory:

list.files("C:/Users/MyUserName/Path/To/Desired/Folder")

15.6.10 Assigning new values or making copies

Here’s how you make a copy of anything in R:

thing1 <- thing2

The code above is doing the following:

  • Create a new object called thing1
  • The <- operator assigns whatever is on the right to whatever is on the left.
  • Assign thing1 to be the value of thing2. thing2 still exists too, note.

You can also make a copy of a dataset like this:

library(car)
d <- GSSvocab

This makes a copy of the dataset GSSvocab called d. Then you can just use this dataset without having to type GSSvocab each time.

15.6.11 Cite R

citation()

15.6.12 Run code from one R file in another R file

It is possible to run load functions and run code from other files in one file.

If…

  • Your first file is called one.R
  • Your second file is called two.R

Then you can run the command

source('one.R')

within the file called two.R and all of the code in the file one.R will be run as if it was run from within the file two.R.225

Make sure that the file one.R is in your working directory.

More information:

15.6.13 Count how long a process in R takes

To measure or count the duration of how long it takes for R to run something for you, you can put the code

(start <- Sys.time())

and

(end <- Sys.time())
(duration <- end-start)

around the code you want to measure/count.

Let’s say you want to load a data file into R using the code d <- read.csv("SomeData.csv") and you want to measure how long it takes to load the file. You would write this code:

(start <- Sys.time())
d <- read.csv("SomeData.csv")
(end <- Sys.time())
(duration <- end-start)

Above, the stored object duration contains the amount of time it took to load the data file.

15.6.14 Save R objects to a folder on the computer

Let’s say you have an R object called MyRobject, which could be a dataframe, saved regression model, table, or anything else.

You can save that R object to your computer like this (it will go to the working directory or directory where your code file is):

saveRDS(MyRobject, file = "ChooseAname.rds")

With the code above, MyRobject will be saved as a file on your computer with the name ChooseAname.rds. You can of course change it to a name other than ChooseAname.rds.

Later, when you want to open the saved object again in RStudio, you can run this code:

MyRobjectFromBefore <- readRDS("ChooseAname.rds")

Above, we load the object saved in ChooseAname.rds into R, where it will have the name MyRobjectFromBefore. Of course you can choose to call it anything other than MyRobjectFromBefore.

Here is the code above once again, for easy copying and pasting:

saveRDS(obj, file = "obj.rds") 
obj <- readRDS("obj.rds")

15.6.15 Comment out portions of a single R command

In the examples below, you can add the # in front of any line to rapidly remove a variable from the code. You can then delete the # to once again include that variable.

15.6.15.1 Adding variables together example

Initial code:

d <- mtcars
d$newVariable = 
  (d$cyl
   + d$hp
   + d$mpg
  )

Easily comment out the mpg variable:

d <- mtcars
d$newVariable = 
  (d$cyl
   + d$hp
  # + d$mpg
  )

15.6.15.2 Regression formula example

Initial code:

r1 <- lm(mpg~cyl
         +am
         +hp
         +disp
         +drat
         ,mtcars)

Easily remove the hp and drat variables:

r1 <- lm(mpg~cyl
         +am
        # +hp
         +disp
        # +drat
         ,mtcars)

15.6.16 Get the name of an object or dataframe as a string

Below, the name of the dataset mtcars will be printed out as a string:

deparse(substitute(mtcars))
## [1] "mtcars"

We could also save the string:

theSavedString <- deparse(substitute(mtcars))

theSavedString
## [1] "mtcars"

15.6.17 Get the R code for a saved object

Let’s say we have a stored object and we want the R code that was used to create it, we can use dput():

dput(mtcars$mpg)
## c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 
## 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 
## 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4)

15.6.18 Generating R code using R

15.6.18.1 Parsing and running code from saved strings

Below, the eval(...) function treats a string as if it is R code, and tries to run it.

target <- "myvector"
values <- "c(1,2,3,4)"

eval(parse(text=paste(target, " <- ", values, sep = "")))
myvector
## [1] 1 2 3 4

The code above made the computer run the following:

myvector <- c(1,2,3,4)
  • target and values are saved strings of characters.
  • Replace target, " <- ", values with anything you want and R will run it as if it is R code.
  • More than just target, <-, and values can be added to the comma separated list within the paste(...) function. The list can be endless.

15.6.19 Check if one string is contained in another string

library(stringr)
## 
## Attaching package: 'stringr'
## The following objects are masked from 'package:expss':
## 
##     fixed, regex
str_detect("this is the text that the computer will search within","this is the string the computer is trying to find within the other text")
## [1] FALSE

15.6.20 Replace/substitute text in strings

gsub("replacement text", "text to be replaced", "string of text in which there is some text to be replaced")
new <- "blue"
old <- "red"
oldtext <- "On Mondays, my favorite color is red"

newtext <- gsub(new, old, oldtext)

newtext

Read more (including a more complex and useful example):

15.6.21 Using packages

15.6.21.1 Install and load a package

if (!require(packagename)) install.packages('packagename')
library(packagename)
  • Replace packagename with the name of the package you want to use.

15.6.21.2 Fixing a package

If a package is not loading, you can try this:

install.packages("packagename", dependencies = TRUE)
  • Replace packagename with the name of the package you want to use.

15.6.21.3 Cite a package

citation(package = "packagename")
  • Replace packagename with the name of the package for which you seek the citation.

15.6.21.4 Remove a loaded package

detach("package:NAMEOFPACKAGE", unload = TRUE)
  • Replace NAMEOFPACKAGE with the name of the already-loaded package that you want to remove.

15.6.22 Update all installed packages

update.packages(ask = FALSE, checkBuilt = TRUE)

15.6.23 For loop

k <- 5

for(i in 1:k) {
  cat("\n\n=== Repetition",i,"===\n\n")
  cat(i^2)
}
c <- 0
for (i in 1:5){
  c <- c+1
  print(paste(c,"hello",i))
}

15.7 General R Markdown use and processes

15.7.1 Page breaks to restart content on a new page

To stop content from appearing on one page and continue on a brand new page (what we would call a page break in a word processor), write the following into your RMarkdown document where you would like the break to happen.

\newpage

Put it in the plain text portion of your RMarkdown file, where you put sentences that you write. Do NOT put it into a code chunk or other field of any kind.

I often use this if I want to make sure that a table or other output appears entirely on one page.

15.7.3 YAML headers examples

You can try some of these YAML headers at the top of your R Markdown documents to see what they make your knitted file look like. The ones below are some of my favorites. There are—of course—infinite other possibilities.

HTML with automatic date, code folding, table of contents up to 4th level:

---
title: "Problems with Bugs Bunny"
subtitle: "This is not real"
author: "Duck, Daffy"
date: "Version -- `r format(Sys.time(), '%d %B %Y')`"
output:
  html_document:
    code_folding: hide
    toc: true
    toc_depth: 4
---

Another HTML example:

title: 'How to write good titles'
author: "J.K."
date: "Date -- `r format(Sys.time(), '%d %B %Y')`"
output:
  html_document:
    toc: yes
    toc_depth: '3'
    df_print: paged
    number_sections: yes

readthedown HTML (click Knit -> Knit to readthedown):

---
title: "The argument against wainscoting"
author: "Compiled by Anshul Kumar"
date: "March"
output:
  rmdformats::readthedown:
    toc_depth: 3
    use_bookdown: true
---

PDF with automatic date, numbered sections, and table of contents up to level 3:

---
title: "This title has five words"
author: "Prepared by Anshul Kumar"
date: "Version -- `r format(Sys.time(), '%d %B %Y')`"
output:
  pdf_document:
    number_sections: yes
    toc: yes
    toc_depth: 3
---

Word document with automatic date, table of contents, and numbered sections:

---
title: 'Supplementary Content: Full analysis and results'
author: "Horticulture working group"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  word_document:
    toc: yes
    number_sections: yes
    toc_depth: '3'
---

slidy presentation with footer and slightly decreased font:

---
title: "Meat-based vegetables: a rebuttal against plant-based diets"
subtitle: "just kidding"
author: "Anshul Kumar"
date: "Version -- `r format(Sys.time(), '%d %B %Y')`"
output:
  slidy_presentation:
    duration: 15
    footer: "This text will be visible on all slides!"
    highlight: espresso
    font_adjustment: -1
---

ioslides presentation:

---
title: "Spoons: useful little bowls on sticks"
subtitle: "Why you should ditch forks and use more spoons"
author: "There is no rule that my name has to go here"
date: "Click here and then use arrow keys to advance slides."
output:
  ioslides_presentation:
    highlight: espresso
    smaller: true
---

15.7.4 Two columns or multiple columns

This only works for HTML outputs, as far as I know. Within your R Markdown file (not in a code chunk), put:

This text is not in multiple columns. Below, we initiate multiple columns. 

:::: {style="display: flex;"}

::: {}

This text will go in the left column. 

This text will also go in the left column. 

:::

::: {}

<!-- optional empty middle column for spacing; sometimes I put something here to create space between the other two columns -->

:::

::: {}

This text will go in the right column. 

This text will also go in the right column. 

:::

::::

Now the multiple columns are over. This text will appear normally, without columns involved. 

Notes

15.7.5 Nice looking data table

The DT package creates nice tables that are easy to sort and search.

Display entire dataset (result not shown):

if (!require(DT)) install.packages('DT') 
library(DT)

DT::datatable(mtcars)

Display partial dataset (first 6 rows and 3 columns):

DT::datatable(mtcars[c(1:6),c(1:3)])

15.7.6 Console versus RMarkdown file

In general, all commands should go into your RMarkdown file which is likely in the top-left of your RStudio screen. This way, you have them saved for later when you want to replicate your work or copy the code you used to use again with different data.

However, there are at least two exceptions. The following two commands should never be put in the RMarkdown file and only be run directly in the console:

  • View(YourDataName) – This is the command you use to view your data in spreadsheet form, within R.
  • file.choose() – This is the command you can use to get the exact file path of a file or folder.

These commands cannot go into your RMarkdown file. Most of you will “Knit” your RMarkdown file into a PDF file to submit/publish your work. But the View command requires a separate window to open in RStudio to show you a spreadsheet. This separate spreadsheet window can’t open in a PDF! You wouldn’t want your entire dataset to show up in the PDF in spreadsheet form, would you? No. You just want the results of the other commands you run to show up. That’s why you get an error if you include the View command in your RMarkdown file and you try to Knit it.

15.7.7 Significant figures in output (number of decimal places)

The following command can be used to potentially control how many decimal places are displayed in your output.

options(pillar.sigfig=n)

More information might be available here:

15.8 R errors

15.8.1 Plot margins error

A common error in R is: Error in plot.new() : figure margins too large

Running this code might fix it:

par(mar=c(1,1,1,1))

You can read more about this issue by clicking here.

15.8.2 rlang or xfun package version error

I recommend you read this entire section before you decide what solution to use.

This is an error I have seen multiple times, when trying to load packages other than rlang. For example, I might try to run:

library(somePackage)

And then I’ll see something like this:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 

namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.2 is required

The error is not always written exactly as it is above, but it usually is telling you that your version of the package rlang is outdated.

Resolving this error is not always easy. The following procedure has worked for me before:

  1. Open R alone, outside of RStudio.
  2. Remove the existing rlang package: remove.packages("rlang")
  3. Make sure it is removed: run library(rlang) and you should get an error message.
  4. Re-install rlang: install.packages("rlang")
  5. (might or might not be needed) Save all of your open work in R or RStudio and then run: q() to start a fresh session of R.

Another approach that has worked for me is:

  1. Run .libPaths().
  2. Go to the resulting folder(s).
  3. Manually delete the folders for the package causing trouble.
  4. Reinstall the package causing trouble like usual: install.packages("packageCausingTrouble").

I do not have too much experience with this error and cannot guarantee that the procedure above will work. While doing the procedure above, you can check whether rlang is loaded or not and its current version by running sessionInfo() after each step.

Similar issues can also happen with the package xfun, in which case you can try the procedure above but replace rlang with xfun, like below:

remove.packages("xfun")
library(xfun) # should result in error
install.packages("xfun")
library(xfun) # should be successful this time
sessionInfo() # should show a newer version of the package

Sometimes, you can just run update.packages(ask = FALSE) and that can solve everything.

15.9 R Markdown errors

15.9.1 Formatting when knitting to Word

15.9.2 Knitting to PDF on a Mac

If you encounter problems with knitting your RMarkdown file to PDF on a Mac, you can try the following code and see if it solves the problem:

if (!require(tinytex)) install.packages('tinytex') 
library(tinytex)

tinytex::install_tinytex()

You can read more about this potential solution here, if you wish: https://stackoverflow.com/questions/58078345/failed-to-compile-tex-in-r-markdown.

15.10 RStudio

15.10.1 Resetting RStudio or RStudio running slowly

To reset your RStudio (perhaps because some aspect of it is not working or it is running slowly), you can follow the instructions here:

It could potentially also help you to change the project that you are working in:

Unchecking diagnostics is also a known way to speed up RStudio:

The solution above, in my RStudio version226 involves clicking on: Tools -> Global Options -> Code -> Diagnostics. Then uncheck items related to diagnostics, like the ones that say “show diagnostics…”.

Below is a way to completely remove everything related to R and RStudio from your computer and reinstall them:

15.11 Tricks and tools unrelated to data analysis or R Markdown

15.11.1 Join and manipulate PDF files

if (!require(pdftools)) install.packages('pdftools') 
library(pdftools)

Combine two PDF files

first.file <- file.choose()
second.file <- file.choose()

pdf_combine(c(first.file, second.file), output = "Combined File.pdf")

The new combined file, called Combined File.pdf, will be saved in your working directory. You might need to run one line at a time in the code above.

See and modify the well-written code here:

Combine all PDF files in a single directory

Put all of the PDF files that you want to combine with each other into a single directory (folder) on your computer.227 Then, set that directory containing your PDF files as the working directory in R. Once everything is ready, run the code below:

pdf_combine(list.files(), output = "Combined File.pdf")

And this code will pick out only the PDF files from your working directory and ignore the other files (this code is adapted from the solution by user Miff in “Merging PDF easily with pdftools” at https://stackoverflow.com/questions/57807165/merging-pdf-easily-with-pdftools):

pdf_combine(list.files(getwd(), pattern="pdf", full.names=TRUE), output  = "Combined File.pdf")

Make a partial version of a PDF file

pdf_subset('original.pdf', pages = 1:3, output = "partial.pdf")

15.11.2 Make a GIF image from static images

The results of this code are not shown.

Install and/or load magick package:

if (!require(magick)) install.packages('magick')
library(magick)

Identify directory where images are stored:

setwd("path/to/folder/containing/images")

Identify the static images that will go into the GIF:

image1 <- image_read("myfirstsavedimage.png")
image2 <- image_read("mynextsavedimage.jpg")
image3 <- image_read("myfinalsavedimage.png")

You could also identify them this way if you want, instead (this way is slower but more user-friendly):

image1 <- image_read(file.choose())
image2 <- image_read(file.choose())
image3 <- image_read(file.choose())
image4 <- image_read(file.choose())

Put all of the selected images together in a vector:

allimages <- c(image1, image2, image3, image4)

Put all the images together in an animation:

mygif <- image_animate(image_scale(allimages, "200x200"), fps = 0.5, dispose = "previous")

Above, the images get rescaled to the size 200 by 200, and each of the individual images will display for 2 seconds each, because fps (frames per second) is 0.5.

Export the image as a GIF:

image_write(mygif, path = "My Favorite File Name.gif", format = "gif")

getwd()

More details:

15.11.3 YouTube video tips

  • To force a video you made to show subtitles (captions) by default, add the following to the list of tags in the video: yt:cc=on.228

  1. I have not tested this yet myself, at the time of writing.↩︎

  2. RStudio 2022.07.1+554 “Spotted Wakerobin” Release (7872775ebddc40635780ca1ed238934c3345c5de, 2022-07-22) for Windows. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.8 Chrome/69.0.3497.128 Safari/537.36.↩︎

  3. You might find it useful to rename your files to order them the way you want. For example, you can add the number 1 to the start of the file name of the first file, 2 to the start of the file name of the second file.↩︎

  4. I learned this from: Rev. How To Force Captions & Subtitles On YouTube (2018). https://www.youtube.com/watch?v=DM-shScjSyg.↩︎