This section contains code, tips, and resources that are not elsewhere in the book and that may be useful at times. Full explanations are not provided in most cases. This appendix is perpetually under construction and is not comprehensive in any way. Please report feedback, suggestions, and links that no longer work to Anshul at akumar@mghihp.edu.
Manipulating data or other items
Add random noise to a variable
mtcarscopy <- mtcars
mtcarscopy$am <- mtcarscopy$am+runif(nrow(mtcarscopy),-1,1)
The jitter()
function can also be useful for this:
mtcarscopy$am <- jitter(mtcarscopy$am)
Sum all variables into new a variable
Create a new variable called VarSum
within dataset d
which is the sum of all other variables in d
:
The computer will add together the values for each variable in each row.
Sum all values of a variable
For entire dataset:
For a single variable:
colSums(as.data.frame(mtcars$cyl))[[1]]
Above, colSums
can be changed to rowSums
to sum up rows instead of columns.
Join words together
Use the paste(...)
function:
something <- "three"
somethingelse <- "word"
anotherthing <- "phrase"
allthethingswithspaces <- paste(something, somethingelse, anotherthing)
allthethingswithspaces
## [1] "three word phrase"
allthethingswithoutspaces <- paste(something, somethingelse, anotherthing, sep = '')
allthethingswithoutspaces
## [1] "threewordphrase"
More or fewer than three items, separated by commas, can be added to the list of items in the paste(...)
function. For example:
evenmorethings <- paste(something, somethingelse, anotherthing, anotherthing,something,somethingelse,somethingelse, sep = '')
evenmorethings
## [1] "threewordphrasephrasethreewordword"
Separate a variable into multiple dummy variables
d <- data.frame(person = c("Audi","Broof","Chruuma","Deenolo", "Eeman"),gender=c("A","A","B","B","B"), IceCreamFlavorsYouLikeSelectAllApply = c("chocolate","strawberry", "chocolate,vanilla","strawberry,vanilla,chocolate","vanilla,other") )
d
## person gender IceCreamFlavorsYouLikeSelectAllApply
## 1 Audi A chocolate
## 2 Broof A strawberry
## 3 Chruuma B chocolate,vanilla
## 4 Deenolo B strawberry,vanilla,chocolate
## 5 Eeman B vanilla,other
Result not shown:
# Split the values in the column by commas
flavors <- strsplit(d$IceCreamFlavorsYouLikeSelectAllApply, ",")
# Get all unique ice cream flavors
unique_flavors <- unique(unlist(flavors))
# Create dummy variables for each unique flavor
for (flavor in unique_flavors) {
# Create a new column with the flavor name and initialize with 0
d[[flavor]] <- 0
# Set the value to 1 if the flavor is in the list for that row
d[[flavor]] <- as.integer(sapply(flavors, function(x) flavor %in% x))
}
d
The code above to create dummy variables was written by OpenAI’s GPT-3 model and accessed on 24 Sep 2023. This prompt was used: in r, i have a dataframe called d. there is a variable called IceCreamFlavorsYouLikeSelectAllApply in which each person (row in in the dataframe) might have one or multiple answers. some might say just ‘chocolate’ or ‘vanilla’ but others might say ’chocolate,vanilla,strawberry,other” or just “chocolate,strawberry”, for example. I want to create separate dummy variables in dataframe d for each of the ice cream flavors.
We get the code below with the following follow-up prompt to GPT-3: rewrite the code above so that each new variable will be called flavor_chocolate instead of just chocolate.
# Split the values in the column by commas
flavors <- strsplit(d$IceCreamFlavorsYouLikeSelectAllApply, ",")
# Get all unique ice cream flavors
unique_flavors <- unique(unlist(flavors))
# Create dummy variables with names like 'flavor_chocolate', 'flavor_vanilla', etc.
for (flavor in unique_flavors) {
# Create a new column with the flavor name and initialize with 0
d[[paste0("flavor_", flavor)]] <- 0
# Set the value to 1 if the flavor is in the list for that row
d[[paste0("flavor_", flavor)]] <- as.integer(sapply(flavors, function(x) flavor %in% x))
}
d
## person gender IceCreamFlavorsYouLikeSelectAllApply flavor_chocolate
## 1 Audi A chocolate 1
## 2 Broof A strawberry 0
## 3 Chruuma B chocolate,vanilla 1
## 4 Deenolo B strawberry,vanilla,chocolate 1
## 5 Eeman B vanilla,other 0
## flavor_strawberry flavor_vanilla flavor_other
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 1 1 0
## 5 0 1 1
Follow-up prompt: how do i calculate summary counts and percentages of how many people like each flavor?
# Calculate summary counts for each flavor
flavor_counts <- colSums(d[paste0("flavor_", unique_flavors)])
# Calculate total number of respondents
total_respondents <- nrow(d)
# Calculate percentages
flavor_percentages <- (flavor_counts / total_respondents) * 100
# Create a summary dataframe
summary_df <- data.frame(Flavor = unique_flavors, Counts = flavor_counts, Percentages = flavor_percentages)
# Print the summary dataframe
print(summary_df)
## Flavor Counts Percentages
## flavor_chocolate chocolate 3 60
## flavor_strawberry strawberry 2 40
## flavor_vanilla vanilla 3 60
## flavor_other other 1 20
Follow-up prompt (after a few attempts): we now have a dataframe called d with variables for gender as well as flavor_chocolate dummy variables and similar variables for each flavor. for each flavor, we need to know the count and what percentage of people in each gender group is equal to 1
# Load the dplyr package if not already loaded
# install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Assuming you have a 'gender' column in your dataframe 'd'
# Calculate the count and percentage of each flavor for each gender
summary_df <- d %>%
group_by(gender) %>%
summarize(across(starts_with("flavor_"), ~sum(. == 1), .names = "Count_{.col}")) %>%
mutate(across(starts_with("Count_"), ~. / sum(.), .names = "Percentage_{.col}"))
# Rename columns
colnames(summary_df) <- c("Gender", paste0("Count_", unique_flavors), paste0("Percentage_", unique_flavors))
# Print the summary data
print(summary_df)
## # A tibble: 2 × 9
## Gender Count_chocolate Count_strawberry Count_vanilla Count_other
## <chr> <int> <int> <int> <int>
## 1 A 1 1 0 0
## 2 B 2 1 3 1
## # ℹ 4 more variables: Percentage_chocolate <dbl>, Percentage_strawberry <dbl>,
## # Percentage_vanilla <dbl>, Percentage_other <dbl>
Now we can transpose the result to make it easier to read (we also have to change gender into the column names):
summary_df <- t(summary_df)
colnames(summary_df) <- unlist(summary_df[1, ])
summary_df <- summary_df[-1, ] # Remove the first row (which is now the column names)
Output plain text
something <- "three"
somethingelse <- "word"
anotherthing <- "phrase"
cat(something, somethingelse, sep = '')
## threeword
allthethingswithspaces <- paste(something, somethingelse, anotherthing)
cat(allthethingswithspaces)
## three word phrase
Select rows (observations) from a dataset
Select first 50 rows:
Select individual rows (in this example, select rows 3, 17, 22):
d.new <- d.old[c(3,17,22),]
Select rows (observations) from a dataset using dplyr filter
library(dplyr)
mtcars %>% filter(cyl==4)
Save a copy:
mtcars.cyl4 <- mtcars %>% filter(cyl==4)
Remove rows (observations) from a dataset, based on negative criteria
Create a new data set called newdata.cars
that contains the observations (rows) in mtcars
that do NOT have cyl
equal to 8.
newdata.cars <- mtcars[ which(mtcars$cyl!=8), ]
table(mtcars$cyl)
table(newdata.cars$cyl)
Above, we see that the 14 cars with 8 cylinders were removed from the original data.
Remove rows (observations) from a dataset, based on observation numbers
Remove first 50 rows:
d.new <- d.old[-c(1:50),]
Remove selected rows (in this example, remove rows 3, 17, 22):
d.new <- d.old[-c(3,17,22),]
Remove all rows except header (to create an empty dataset while preserving column names):
d.new <- d.old[-c(1:nrow(d.old)),]
Subset based on multiple qualitative variable levels
d <- data.frame(name = c("Aabe","Bobay","Chock","Deela","Edweeeena","Foort","Gooba","Hi"),
group = c("A","A","B","B","C","C","D","D")
)
dpartial <- d[d$group %in% c("A", "B"), ]
Subset and rename selected variables (columns)
library(dplyr)
newData <- mtcars %>% dplyr::select(mileage=mpg, transmission=am)
Rename selected variables (columns)
library(dplyr)
newData <- mtcars %>% dplyr::rename(mileage=mpg, transmission=am)
Remove selected variables (columns) from a dataset
Remove variable Var1
from dataset d
:
d$Var1 <- NULL
Remove variable mpg
from the data set mtcars
and save the new version of the data set as d
:
d <- subset(mtcars, select = -mpg)
Remove variables mpg
, cyl
, and carb
from the data set mtcars
and save it as d
:
d <- subset(mtcars, select = -c(mpg, cyl, carb))
Above, deleting the -
sign in the select
argument will keep (rather than remove) the listed variables and remove all others.
Using dplyr
, remove variable Var1
from dataset d
:
library(dplyr)
d <- d %>% select(-Var1)
Using dplyr
, remove variables Var1
and Var2
from dataset d
:
library(dplyr)
d <- d %>% select(-Var1,-Var2)
Remove variables with missing data
Option 1
This removes all variables that have one or more missing values:
NewData <- OldData[ , colSums(is.na(OldData)) == 0]
The code above was taken from the following resource:
Option 2
Remove all variables which contain only NA
values:
NewData <- OldData[ , colSums(is.na(OldData)) < nrow(OldData)]
The code above was taken from the following resource:
Remove non-numeric variables from a dataset
Example data:
d <- data.frame(
name = factor(c("Aronda","Baeoi","Chromp","Daroona")),
age = c(23,45,56,67),
citizenship = c("Tanzania","Nigeria","Mexico","France"),
educationYears = c(1,2,3,4)
)
d
## name age citizenship educationYears
## 1 Aronda 23 Tanzania 1
## 2 Baeoi 45 Nigeria 2
## 3 Chromp 56 Mexico 3
## 4 Daroona 67 France 4
## 'data.frame': 4 obs. of 4 variables:
## $ name : Factor w/ 4 levels "Aronda","Baeoi",..: 1 2 3 4
## $ age : num 23 45 56 67
## $ citizenship : chr "Tanzania" "Nigeria" "Mexico" "France"
## $ educationYears: num 1 2 3 4
Remove non-numeric variables:
dNumericOnly <- d[,sapply(d, is.numeric)]
View new data:
## age educationYears
## 1 23 1
## 2 45 2
## 3 56 3
## 4 67 4
Do something with the data that you couldn’t do before:
## age educationYears
## age 1.0000000 0.9827076
## educationYears 0.9827076 1.0000000
keywords: keep numeric variables, retain numeric variables
Identical variables in two datasets
Let’s say you have one dataset called dtrain
and another one called dtest
. And you want to make sure that dtest
has the same variables (columns) as dtrain
.
The code below tells the computer to retain within dtest
only the variables that are in dtrain
:
library(dplyr)
dtest <- dtest %>% select(names(dtrain))
Remove observations with missing values (NA values) in a single column
Take the existing dataframe OldData
and make a new dataframe called NewData
which only contains the rows in OldData
that do NOT have a missing value—meaning do not have NA
—for a variable called Var1
:
NewData <- OldData[which(!is.na(OldData$Var1)),]
NewData
should be a version of OldData
in which any observation (row) coded as NA
for Var1
has been removed.
Replace missing values (NA values) with 0
In an entire dataset
my_dataframe[is.na(my_dataframe)] <- 0
In one column of the dataset
my_dataframe["pages"][is.na(my_dataframe["pages"])] <- 0
Combine or concatenate strings
s <- ""
s1 <- "something"
s2 <- "Else"
s <- paste(s, s1, s2, sep = "")
s
## [1] "somethingElse"
Search vector or variable for values
x <- c(1,2,3,4,5)
any(x>5)
## [1] FALSE
## [1] TRUE
Search to see if a row is contained in a data frame
Let’s say we want to check if mtcars
contains any rows with am = 1
, gear = 5
, and carb = 6
:
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
match_df(mtcars, data.frame(am=1, gear=5, carb=6))
## Matching on: am, gear, carb
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
And now let’s test for something that doesn’t exist:
library(plyr)
match_df(mtcars, data.frame(am=1, gear=5, carb="bobb"))
## Matching on: am, gear, carb
## [1] mpg cyl disp hp drat wt qsec vs am gear carb
## <0 rows> (or 0-length row.names)
Search for text strings in variables
Given a dataset like d
below, identify which names do and do not contain any of the strings ee
and a
:
d <- data.frame(Name = c("Aweo","Beena","Cidu","Deleek","Erga", "Fymo","Henny"), LevelOfInterest = c(3,1,2,3,2,1,2))
searchStrings <- c("ee","a")
d$searchMatch <- grepl(paste(toupper(searchStrings),collapse="|"), toupper(d$Name))
d
## Name LevelOfInterest searchMatch
## 1 Aweo 3 TRUE
## 2 Beena 1 TRUE
## 3 Cidu 2 FALSE
## 4 Deleek 3 TRUE
## 5 Erga 2 TRUE
## 6 Fymo 1 FALSE
## 7 Henny 2 FALSE
In each row in d
, searchMatch
is TRUE
if Name
contains at least one of the strings in searchStrings
.
Look up or search for a single item or value
## [1] 32.4
Randomly reorder/resort data set
One-line version
Quick code to randomly reorder df
and save it still as df
:
df <- df[sample(1:nrow(df)), ]
More details
Re-sort myOriginalData
and save it as myRandomlySortedData
:
myOriginalData <- mtcars
rows <- sample(nrow(myOriginalData))
myRandomlySortedData <- myOriginalData[rows, ]
Check if it worked (results not shown):
head(myOriginalData)
head(myRandomlySortedData)
Quick code to randomly order df
and save it still as df
:
df <- df[sample(1:nrow(df)), ]
Sort/order data set by one or more variables
Sort dataframe oldData
by Var1
from low to high:
newData <- oldData[order(oldData$Var1),]
Sort dataframe oldData
by Var1
from high to low:
newData <- oldData[order(-oldData$Var1),]
Sort dataframe oldData
by Var1
and then Var2
, both from low to high:
newData <- oldData[order(oldData$Var1, oldData$Var2),]
Sort dataframe oldData
by Var1
from low to high and then Var2
from high to low:
newData <- oldData[order(oldData$Var1, -oldData$Var2),]
- The code above has not been tested for accuracy, as of July 22 2021.
Create a rank variable based on another variable
In the dataset below, we want to take each student’s score and use that to determine their rank in the class.
(d <- data.frame(studentName = c("Aabe","Beebe","Cheech","Doola","Eena","Fon"), score = c(77,89,45,33,99,77)))
## studentName score
## 1 Aabe 77
## 2 Beebe 89
## 3 Cheech 45
## 4 Doola 33
## 5 Eena 99
## 6 Fon 77
d$classRank <- rank(d$score)
d
## studentName score classRank
## 1 Aabe 77 3.5
## 2 Beebe 89 5.0
## 3 Cheech 45 2.0
## 4 Doola 33 1.0
## 5 Eena 99 6.0
## 6 Fon 77 3.5
Above, we now have a classRank
variable which identifies how each student compares to the rest on score
.
Count the occurrence number of each subject or within each group
Below, we have a data frame d1
:
d1 <- data.frame(subjectID = c("a","a","a","b","b"))
d1
## subjectID
## 1 a
## 2 a
## 3 a
## 4 b
## 5 b
We want a new variable that counts how many times each subject appears in the data:
library(dplyr)
d2 <- d1 %>%
dplyr::group_by(subjectID) %>%
dplyr::mutate(withinPersonRecordNumber = row_number()) %>%
ungroup()
d2
## # A tibble: 5 × 2
## subjectID withinPersonRecordNumber
## <chr> <int>
## 1 a 1
## 2 a 2
## 3 a 3
## 4 b 1
## 5 b 2
Count the frequency (number of times) that a single subject ID or variable value/level occurs
Below, we have a data frame d1
:
d1 <- data.frame(subjectID = c("a","a","a","b","b"))
d1
## subjectID
## 1 a
## 2 a
## 3 a
## 4 b
## 5 b
We want a new variable that counts how often each subject appears in the data:
library(dplyr)
d2 <- d1 %>%
add_count(subjectID, name = "Frequency")
d2
## subjectID Frequency
## 1 a 3
## 2 a 3
## 3 a 3
## 4 b 2
## 5 b 2
Change variable names to numbers
d <- mtcars
colnames(d) <- seq(1:ncol(d))
Select start of variable name
Select variables that start with the characters StartOfVar1
and apple
library(dplyr)
NewData <- OldData %>% select(someVariableIwant,starts_with(c("StartOfVar1","apple")))
NewData
will now contain the variable someVariableIwant
from OldData
as well as any in OldData
that start with the characters StartOfVar1
or apple
.
If you don’t want someVariableIwant
to be included, you can just do this:
NewData <- OldData %>% select(starts_with(c("StartOfVar1","apple")))
Select numeric variables
Create a copy of olddata
which only contains numeric variables and we save it as newdata
:
library(dplyr)
newdata <- olddata %>% select_if(is.numeric)
Select variables from another data set
mtcars.temp <- mtcars[c("mpg","cyl")]
mtcars.copy1 <- mtcars[names(mtcars.temp)]
Above, mtcars.copy1
contains the same variables that are in mtcars.temp
, taken from mtcars
.
Converting categorical and numeric data
The following code converts a numeric variable to a categorical one:
DataSet$NewVariable <- as.factor(DataSet$OldVariable)
The following code converts a factor (categorical) variable to a numeric one:
DataSet$NewVariable <- as.numeric(as.character(DataSet$OldVariable))
Check what type of data is in each variable:
DataSet$OldVariable
DataSet$NewVariable
Here is how we can convert a numeric variable to a categorical variable and then relabel the values:
mtcarscopy <- mtcars
mtcarscopy$amfactor <- as.factor(mtcarscopy$am)
library(plyr)
plyr::revalue
## function (x, replace = NULL, warn_missing = TRUE)
## {
## if (!is.null(x) && !is.factor(x) && !is.character(x)) {
## stop("x is not a factor or a character vector.")
## }
## mapvalues(x, from = names(replace), to = replace, warn_missing = warn_missing)
## }
## <bytecode: 0x00000236f16f9c08>
## <environment: namespace:plyr>
mtcarscopy$amlabeled <- revalue(mtcarscopy$amfactor, c("0"="automatic", "1"="manual"))
head(mtcarscopy[c("am", "amfactor", "amlabeled")], n=10)
## am amfactor amlabeled
## Mazda RX4 1 1 manual
## Mazda RX4 Wag 1 1 manual
## Datsun 710 1 1 manual
## Hornet 4 Drive 0 0 automatic
## Hornet Sportabout 0 0 automatic
## Valiant 0 0 automatic
## Duster 360 0 0 automatic
## Merc 240D 0 0 automatic
## Merc 230 0 0 automatic
## Merc 280 0 0 automatic
Converting likert responses to numeric and summing up totals
Initial data, called d.original
:
d.original <- data.frame(
name = c("Abby","Beeta","Chock"),
Q1 = c("2 - somewhat bad","3 - neutral","5 - very good"),
Q2 = c("4 - somewhat good","1 - very bad","3 - neutral"),
Q3 = c("2 - somewhat bad","5 - very good","2 - somewhat bad")
)
Make likertFix
function to handle data like this:
likertFix <- function(df, questionPrefix="", variablePrefix=""){
# df is the dataset you're starting with
# questionPrefix is the characters in quotation marks that label relevant columns for conversion to numbers and totaling up
# variablePrefix is a short word in quotation marks that you want to label the columns being fixed and added
df.q <- df %>% select(starts_with(questionPrefix))
df.q[] <- lapply(df.q, function(x) substring(x,1,1))
df.q[] <- lapply(df.q, function(x) as.numeric(x))
df.q$total <- rowSums(df.q)
colnames(df.q) <- paste(variablePrefix, colnames(df.q), sep = ".")
df <- cbind(df,df.q)
df <- df %>% select(-starts_with(questionPrefix))
return(df)
}
Run the likertFix
function on the initial data and save the result
dfixed <- likertFix(df = d.original, questionPrefix = "Q", variablePrefix = "Pretest")
dfixed
## name Pretest.Q1 Pretest.Q2 Pretest.Q3 Pretest.total
## 1 Abby 2 4 2 8
## 2 Beeta 3 1 5 9
## 3 Chock 5 3 2 10
Assign reference group or reference category in factor variables
Generic code:
mydata$variableToFix <- relevel(as.factor(mydata$variableToFix), ref = "Label of reference group")
Example in which we want to use the cyl
variable in mtcars
(which we’ll re-save as d
) as a factor variable with 6
as the reference group:
d<-mtcars
d$newCyl <- relevel(as.factor(d$cyl), ref = "6")
summary(lm(mpg~newCyl,d))
##
## Call:
## lm(formula = mpg ~ newCyl, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2636 -1.8357 0.0286 1.3893 7.2364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.743 1.218 16.206 4.49e-16 ***
## newCyl4 6.921 1.558 4.441 0.000119 ***
## newCyl8 -4.643 1.492 -3.112 0.004152 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
## F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
As we can see above, cars with 6 cylinders were omitted because they’re in the reference group.
Assign values in dataframe based on matched criteria
Below, set mpg
equal to 200 for all observations for which am = 0
, gear = 0
, and carb = 2
.
d <- mtcars
d[which(d$am==0 & d$gear==3 & d$carb==2),]$mpg <- 200
mtcars
Save table as dataframe
tableAMxCYL <- with(mtcars, table(am, cyl, useNA = 'ifany'))
dfAMxCYL <- as.data.frame.matrix(tableAMxCYL)
dfAMxCYL
Add am
column as variable:
dfAMxCYL$am <- rownames(dfAMxCYL)
dfAMxCYL
Convert all columns in data frame to numeric
dNew <- sapply(dOriginal, as.numeric)
Read more:
Make a copy of an object
Make a copy of df
called dfcopy
:
This can be done for any object like a dataset, stored value, regression object, and so on.
Remove missing data
Remove rows in dataset df
that contain any missing data:
df <- na.omit(df)
Reverse code a numeric variable
If you have a variable like LevelOfInterest
in the dataset d
below…
d <- data.frame(Name = c("Aweo","Been","Cida","Deleek","Erga"), LevelOfInterest = c(3,1,2,3,2))
d
## Name LevelOfInterest
## 1 Aweo 3
## 2 Been 1
## 3 Cida 2
## 4 Deleek 3
## 5 Erga 2
…you might want to change it so that:
- 3 becomes 1
- 2 remains 2
- 1 becomes 3
This might help you do that:
d$LevelOfInterest.reverse <- (max(d$LevelOfInterest, na.rm = T)+1)-d$LevelOfInterest
Use a two-way table to see if this worked:
with(d, table(LevelOfInterest, LevelOfInterest.reverse))
## LevelOfInterest.reverse
## LevelOfInterest 1 2 3
## 1 0 0 1
## 2 0 2 0
## 3 2 0 0
This code should work even for a variable with greater or less than three levels.
This issue is discussed more here:
Put row names into a variable
d <- mtcars # example data
d$carName <- rownames(d) # make a new variable called carName containing the row names of d
View(d) # check if it worked
Check if values in one variable are in another variable
entireGroup <- data.frame(name = c("Beebo","Brakaansha","Bettle","Bo","Erl"), age = c(23,45,93,23,4))
signedUpList <- data.frame(writtenName = c("Bettle","Bo"), profession = c("Sword swallower swallower", "Anti snake charming activist"))
entireGroup$signedUp <- ifelse(entireGroup$name %in% signedUpList$writtenName, 1,0)
entireGroup
## name age signedUp
## 1 Beebo 23 0
## 2 Brakaansha 45 0
## 3 Bettle 93 1
## 4 Bo 23 1
## 5 Erl 4 0
Unique identifiers
Make a unique identification (ID) number for observations or groups of observations.
Simple ID number by row
Add a new variable with the row number of each observation:
YourDataFrame$IDnum <- seq(1:nrow(YourDataFrame))
More complicated ID numbers
Sample data to practice with:
d <- data.frame(name = c("Aaaaaaron","Beela","Cononan","Duh","Eeena","Beela","Eeena","Beela"), age = c(1,2,3,4,1,2,1,2), occupation = c("hunter","vegan chef","plumber","plumbing destroyer","omnivore chef","vegan chef","omnivore chef","vegan chef"), day = c(1,1,1,1,15,15,15,15), month = c("January","February","March","April","January","February","March","April"), year = rep(2020,8), result = seq(1,8))
d
## name age occupation day month year result
## 1 Aaaaaaron 1 hunter 1 January 2020 1
## 2 Beela 2 vegan chef 1 February 2020 2
## 3 Cononan 3 plumber 1 March 2020 3
## 4 Duh 4 plumbing destroyer 1 April 2020 4
## 5 Eeena 1 omnivore chef 15 January 2020 5
## 6 Beela 2 vegan chef 15 February 2020 6
## 7 Eeena 1 omnivore chef 15 March 2020 7
## 8 Beela 2 vegan chef 15 April 2020 8
Generate variable ID
in dataset d
containing a unique identification number for each person:
if (!require(udpipe)) install.packages('udpipe')
library(udpipe)
d$ID <- unique_identifier(d, c("name"))
d
## name age occupation day month year result ID
## 1 Aaaaaaron 1 hunter 1 January 2020 1 1
## 2 Beela 2 vegan chef 1 February 2020 2 2
## 3 Cononan 3 plumber 1 March 2020 3 3
## 4 Duh 4 plumbing destroyer 1 April 2020 4 4
## 5 Eeena 1 omnivore chef 15 January 2020 5 5
## 6 Beela 2 vegan chef 15 February 2020 6 2
## 7 Eeena 1 omnivore chef 15 March 2020 7 5
## 8 Beela 2 vegan chef 15 April 2020 8 2
More sample data for practice:
d2 <- data.frame(name = c("Aaaaaaron","Beela","Cononan","Duh","Eeena","Fewe","Graam","Hiol"), number = c(1,1,1,1,0,0,0,0), color = c("green","brown","green","brown","green","brown","green","brown"))
d2
## name number color
## 1 Aaaaaaron 1 green
## 2 Beela 1 brown
## 3 Cononan 1 green
## 4 Duh 1 brown
## 5 Eeena 0 green
## 6 Fewe 0 brown
## 7 Graam 0 green
## 8 Hiol 0 brown
Generate variable group
in dataset d2
containing a unique identification number for each number-color pair:
if (!require(udpipe)) install.packages('udpipe')
library(udpipe)
d2$group <- unique_identifier(d2, c("number","color"))
d2
## name number color group
## 1 Aaaaaaron 1 green 4
## 2 Beela 1 brown 3
## 3 Cononan 1 green 4
## 4 Duh 1 brown 3
## 5 Eeena 0 green 2
## 6 Fewe 0 brown 1
## 7 Graam 0 green 2
## 8 Hiol 0 brown 1
Within-group count of observations and group size, within-group ID number
If you have multiple observations in one group or for one person and you need to count each observation’s number within each group or person, the code below should help.
# Install and load the required package
library(dplyr)
# Create an example dataframe
df <- data.frame(
person = c("John", "John", "Mary", "Mary", "Mary", "Peter"),
age = c(25, 30, 35, 40, 45, 50)
)
# Create a new variable with the count of rows for each person
df <- df %>%
dplyr::group_by(person) %>%
dplyr::mutate(row_count = row_number()) %>%
ungroup()
# Print the modified dataframe
print(df)
The code and comments above were generated by ChatGPT on May 22 2023, with minor modifications by Anshul.
If you want to generate a group size variable instead, this might help:
# Install and load the required package
library(dplyr)
# Create an example dataframe
df <- data.frame(
person = c("John", "John", "Mary", "Mary", "Mary", "Peter"),
age = c(25, 30, 35, 40, 45, 50)
)
# Create a new variable with the count of rows for each person
df <- df %>%
dplyr::group_by(person) %>%
dplyr::mutate(row_count = n()) %>%
ungroup()
# Print the modified dataframe
print(df)
The code and comments above were generated by ChatGPT on May 22 2023.
Make new character variable or summary report based on other variables
Example data:
d <- data.frame(
flavor = c("chocolate","vanilla","strawberry","other"),
numberEaten = c(1,2,3,4)
)
d$joinedVariable <- paste0("Flavor and number: ",as.character(d$flavor)," ", as.character(d$numberEaten))
## flavor numberEaten joinedVariable
## 1 chocolate 1 Flavor and number: chocolate 1
## 2 vanilla 2 Flavor and number: vanilla 2
## 3 strawberry 3 Flavor and number: strawberry 3
## 4 other 4 Flavor and number: other 4
Source: Solution by user sandeep. Concatenating two string variables in r. https://stackoverflow.com/questions/26321702/concatenating-two-string-variables-in-r.
keywords: concatenate, join
Converting time variables in R
You might have data in which there are time stamps which you need to convert into a continuous variable that you can put into a regression. For example, maybe you have data in which each row (observation) is a patient and then you have a variable (column) for the date and time on which the patient came into the hospital. To analyze this data as a continuous variable, maybe you want to calculate how many seconds after midnight in each day a patient came in.
How can we convert a time to a number of seconds? There are helper functions in R that help us do this. Let’s start with an example:
if (!require(lubridate)) install.packages('lubridate')
## Warning: package 'lubridate' was built under R version 4.2.3
library(lubridate) # this package has the period_to_seconds function
# example of what the data looks like
ExampleTimestamp <- "01-01-2019 09:04:58"
# extract just the time (remove the date)
(ExampleTimeOnly <- substr(ExampleTimestamp,12,19))
## [1] "09:04:58"
# convert from time to number of seconds
(TotalSeconds <- period_to_seconds(hms(ExampleTimeOnly)))
## [1] 32698
As you can see above, the time “09:04:59” was converted into 32698 seconds. But we only did it for a single stored value, ExampleTimestamp
. How do we do it for the entire variable in a dataset? Let’s say you have a datset called d
with a variable with a time stamp called timestamp
and you want to make a new variable (column) in the dataset called seconds
. Here’s how you can do it:
if (!require(lubridate)) install.packages('lubridate')
library(lubridate)
d$seconds <- period_to_seconds(hms(substr(d$timestamp,12,19)))
Making and modifying lists
This is not complete.
l <- list() # make empty list
l <- append(l, "bob") # add something to the list
one <- 1
two <- c(1,2,3,4)
three <- list("byron","anshul")
four <- "bob"
t <- list(one, two, three) # make a list
t <- append(t, four) # add something to the list
Categorize variable into quantiles
Make a new variable called quintile
that identifies each observation’s quintile for the variable mpg
:
d <- mtcars
d$mpg[4] <- NA # create missing data for illustration only
library(dplyr)
d$mpg.quintile <- ntile(d$mpg, 5)
Check if it worked:
table(d$mpg.quintile, useNA = "always")
class(d$mpg.quintile)
As an option, recode the new variable as a factor and label NA
values as missing
, so that the computer doesn’t know they are NA anymore (the NA
observations won’t get thrown out of an analysis):
d$mpg.quintile.fac <- ifelse(is.na(d$mpg.quintile), "missing",as.character(d$mpg.quintile))
table(d$mpg.quintile.fac, useNA = "always")
class(d$mpg.quintile.fac)
Combine similar levels into groups
This is incomplete. See this reference:
Merging and joining datasets together
left <- data.frame(name=c("A. Onlyleft","B. Onlyleft","C. Both","D. Both", NA), skill=c("pig latin","latin","pig farming","pig surgery","pig liberating"))
left
## name skill
## 1 A. Onlyleft pig latin
## 2 B. Onlyleft latin
## 3 C. Both pig farming
## 4 D. Both pig surgery
## 5 <NA> pig liberating
right <- data.frame(name = c("E. Onlyright","F. Onlyright","C. Both","D. Both", NA), skill=c("reading","writing","speling","boating","gloating"))
right
## name skill
## 1 E. Onlyright reading
## 2 F. Onlyright writing
## 3 C. Both speling
## 4 D. Both boating
## 5 <NA> gloating
leftjoined <- dplyr::left_join(left, right, by="name")
leftjoined
## name skill.x skill.y
## 1 A. Onlyleft pig latin <NA>
## 2 B. Onlyleft latin <NA>
## 3 C. Both pig farming speling
## 4 D. Both pig surgery boating
## 5 <NA> pig liberating gloating
rightjoined <- dplyr::right_join(left, right, by="name")
rightjoined
## name skill.x skill.y
## 1 C. Both pig farming speling
## 2 D. Both pig surgery boating
## 3 <NA> pig liberating gloating
## 4 E. Onlyright <NA> reading
## 5 F. Onlyright <NA> writing
fulljoined <- dplyr::full_join(left, right, by="name")
fulljoined
## name skill.x skill.y
## 1 A. Onlyleft pig latin <NA>
## 2 B. Onlyleft latin <NA>
## 3 C. Both pig farming speling
## 4 D. Both pig surgery boating
## 5 <NA> pig liberating gloating
## 6 E. Onlyright <NA> reading
## 7 F. Onlyright <NA> writing
merged <- merge(left, right, by="name", all = TRUE, incomparables = NA)
# all.x or all.y also possible to do only left or right joins
# change to all=F to only include observations that match
merged
## name skill.x skill.y
## 1 A. Onlyleft pig latin <NA>
## 2 B. Onlyleft latin <NA>
## 3 C. Both pig farming speling
## 4 D. Both pig surgery boating
## 5 E. Onlyright <NA> reading
## 6 F. Onlyright <NA> writing
## 7 <NA> pig liberating <NA>
## 8 <NA> <NA> gloating
What if we wanted to make a new variable called dataSource
which identifies who all came from which dataset? I’m not sure of the very best way to do it, but here’s one way that seems to work:
left$inLeft <- 1
right$inRight <- 1
merged2 <- merge(left, right, by="name", all = TRUE, incomparables = NA)
merged2$dataSource <- NA
merged2$dataSource[merged2$inLeft==1 & merged2$inRight==1] <- "Present in both left and right"
merged2$dataSource[merged2$inLeft==1 & is.na(merged2$inRight)] <- "Present in left only"
merged2$dataSource[is.na(merged2$inLeft) & merged2$inRight==1] <- "Present in right only"
merged2
## name skill.x inLeft skill.y inRight
## 1 A. Onlyleft pig latin 1 <NA> NA
## 2 B. Onlyleft latin 1 <NA> NA
## 3 C. Both pig farming 1 speling 1
## 4 D. Both pig surgery 1 boating 1
## 5 E. Onlyright <NA> NA reading 1
## 6 F. Onlyright <NA> NA writing 1
## 7 <NA> pig liberating 1 <NA> NA
## 8 <NA> <NA> NA gloating 1
## dataSource
## 1 Present in left only
## 2 Present in left only
## 3 Present in both left and right
## 4 Present in both left and right
## 5 Present in right only
## 6 Present in right only
## 7 Present in left only
## 8 Present in right only
General R use and processes
Run R in web browser without signing in
This should work on your phone, too!
Check R version and packages
See which version of R you are using, which packages are loaded, and other information about the current session of R:
Remove an object from the environment
Remove df
from the environment in R:
Remove all objects from the environment
Fix variable names automatically
With a data set called d
:
names(d) <- make.names(names(d))
See the changes:
More information:
List all variables in a dataset
The following code outputs a list of all variables in the dataset mtcars
, in quotation marks and separated by commas:
for(n in names(mtcars)){cat('"',n,'",', sep='')}
## "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb",
- In the code above, replace
mtcars
with the name of your own dataset.
- There will be an extra—likely unwanted—comma at the end of the list.
This code can give you a list of your variables to manually modify and then paste into a regression formula:
for(n in names(mtcars)){cat(' + ',n, sep='')}
## + mpg + cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
This does the same thing:
(VarListString <- paste(names(mtcars), collapse="+"))
## [1] "mpg+cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb"
Without quotation marks:
cat(paste(names(mtcars), collapse="+"))
## mpg+cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb
List all objects in R environment
## [1] "age" "allthethingswithoutspaces"
## [3] "allthethingswithspaces" "anotherthing"
## [5] "bw" "c1"
## [7] "c2" "cc"
## [9] "d" "d.original"
## [11] "d1" "d2"
## [13] "demographics" "descriptivetable"
## [15] "descriptivetable.abbreviated" "descriptivetable.transpose"
## [17] "df" "dfcopy"
## [19] "dfixed" "dNumericOnly"
## [21] "dpartial" "Engine"
## [23] "entireGroup" "eqn"
## [25] "evenmorethings" "ExampleTimeOnly"
## [27] "ExampleTimestamp" "first"
## [29] "flavor" "flavor_counts"
## [31] "flavor_percentages" "flavors"
## [33] "four" "fulljoined"
## [35] "GroupMeanData" "hgA"
## [37] "hgB" "l"
## [39] "left" "leftjoined"
## [41] "likertFix" "med_df"
## [43] "merged" "merged2"
## [45] "mtcars.copy1" "mtcars.partial"
## [47] "mtcars.temp" "mtcarscopy"
## [49] "n" "name"
## [51] "one" "pca1"
## [53] "post" "pre"
## [55] "reg" "right"
## [57] "rightjoined" "s"
## [59] "s1" "s2"
## [61] "searchStrings" "second"
## [63] "signedUpList" "something"
## [65] "somethingelse" "summary_df"
## [67] "t" "test"
## [69] "three" "total_respondents"
## [71] "TotalSeconds" "Transmission"
## [73] "two" "unique_flavors"
## [75] "VarListString" "x"
## [77] "zipcodes"
If you see character(0)
as the output, that means the environment is empty.
List all files in current working directory or a selected directory
Working directory:
A selected directory:
list.files("C:/Users/MyUserName/Path/To/Desired/Folder")
Assigning new values or making copies
Here’s how you make a copy of anything in R:
The code above is doing the following:
- Create a new object called
thing1
- The
<-
operator assigns whatever is on the right to whatever is on the left.
- Assign
thing1
to be the value of thing2
. thing2
still exists too, note.
You can also make a copy of a dataset like this:
library(car)
d <- GSSvocab
This makes a copy of the dataset GSSvocab
called d
. Then you can just use this dataset without having to type GSSvocab
each time.
Run code from one R file in another R file
It is possible to run load functions and run code from other files in one file.
If…
- Your first file is called
one.R
- Your second file is called
two.R
Then you can run the command
source('one.R')
within the file called two.R
and all of the code in the file one.R
will be run as if it was run from within the file two.R
.
Make sure that the file one.R
is in your working directory.
More information:
Count how long a process in R takes
To measure or count the duration of how long it takes for R to run something for you, you can put the code
(start <- Sys.time())
and
(end <- Sys.time())
(duration <- end-start)
around the code you want to measure/count.
Let’s say you want to load a data file into R using the code d <- read.csv("SomeData.csv")
and you want to measure how long it takes to load the file. You would write this code:
(start <- Sys.time())
d <- read.csv("SomeData.csv")
(end <- Sys.time())
(duration <- end-start)
Above, the stored object duration
contains the amount of time it took to load the data file.
Save R objects to a folder on the computer
Let’s say you have an R object called MyRobject
, which could be a dataframe, saved regression model, table, or anything else.
You can save that R object to your computer like this (it will go to the working directory or directory where your code file is):
saveRDS(MyRobject, file = "ChooseAname.rds")
With the code above, MyRobject
will be saved as a file on your computer with the name ChooseAname.rds
. You can of course change it to a name other than ChooseAname.rds
.
Later, when you want to open the saved object again in RStudio, you can run this code:
MyRobjectFromBefore <- readRDS("ChooseAname.rds")
Above, we load the object saved in ChooseAname.rds
into R, where it will have the name MyRobjectFromBefore
. Of course you can choose to call it anything other than MyRobjectFromBefore
.
Here is the code above once again, for easy copying and pasting:
saveRDS(obj, file = "obj.rds")
obj <- readRDS("obj.rds")
Get the name of an object or dataframe as a string
Below, the name of the dataset mtcars
will be printed out as a string:
deparse(substitute(mtcars))
## [1] "mtcars"
We could also save the string:
theSavedString <- deparse(substitute(mtcars))
theSavedString
## [1] "mtcars"
Get the R code for a saved object
Let’s say we have a stored object and we want the R code that was used to create it, we can use dput()
:
## c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,
## 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5,
## 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4)
Generating R code using R
Parsing and running code from saved strings
Below, the eval(...)
function treats a string as if it is R code, and tries to run it.
target <- "myvector"
values <- "c(1,2,3,4)"
eval(parse(text=paste(target, " <- ", values, sep = "")))
myvector
## [1] 1 2 3 4
The code above made the computer run the following:
target
and values
are saved strings of characters.
- Replace
target, " <- ", values
with anything you want and R will run it as if it is R code.
- More than just
target
, <-
, and values
can be added to the comma separated list within the paste(...)
function. The list can be endless.
Check if one string is contained in another string
##
## Attaching package: 'stringr'
## The following objects are masked from 'package:expss':
##
## fixed, regex
str_detect("this is the text that the computer will search within","this is the string the computer is trying to find within the other text")
## [1] FALSE
Replace/substitute text in strings
gsub("replacement text", "text to be replaced", "string of text in which there is some text to be replaced")
new <- "blue"
old <- "red"
oldtext <- "On Mondays, my favorite color is red"
newtext <- gsub(new, old, oldtext)
newtext
Read more (including a more complex and useful example):
Using packages
Install and load a package
if (!require(packagename)) install.packages('packagename')
library(packagename)
- Replace
packagename
with the name of the package you want to use.
Fixing a package
If a package is not loading, you can try this:
install.packages("packagename", dependencies = TRUE)
- Replace
packagename
with the name of the package you want to use.
Cite a package
citation(package = "packagename")
- Replace
packagename
with the name of the package for which you seek the citation.
Remove a loaded package
detach("package:NAMEOFPACKAGE", unload = TRUE)
- Replace
NAMEOFPACKAGE
with the name of the already-loaded package that you want to remove.
Update all installed packages
update.packages(ask = FALSE, checkBuilt = TRUE)
For loop
k <- 5
for(i in 1:k) {
cat("\n\n=== Repetition",i,"===\n\n")
cat(i^2)
}
c <- 0
for (i in 1:5){
c <- c+1
print(paste(c,"hello",i))
}
15.6.14 Comment out portions of a single R command
In the examples below, you can add the
#
in front of any line to rapidly remove a variable from the code. You can then delete the#
to once again include that variable.15.6.14.1 Adding variables together example
Initial code:
Easily comment out the
mpg
variable:15.6.14.2 Regression formula example
Initial code:
Easily remove the
hp
anddrat
variables: