A Appendix
A.1 Git
This appendix provides a concise reference to essential Git concepts and commands, tailored for data analysts and researchers managing code and collaboration. For extended learning, explore the following resources:
- Git Cheat Sheet (PDF)
- Git Cheat Sheets in Other Languages
- Interactive Git Tutorial
- Visual Git Cheat Sheet
- Happy Git with R (for R Users)
A.1.1 Basic Setup
Configure your Git environment using the git config
command:
-
Set your name and email (used in commits):
-
Set your preferred text editor (e.g., for writing commit messages):
A.1.2 Creating a Repository
To create a new Git repository in your project directory:
This creates a .git
directory where Git stores all version control information.
A.1.3 Tracking Changes
Git tracks changes through a three-tier structure:
- Working Directory: your local folder with files.
- Staging Area: where you prepare changes before committing.
- Local Repository: stores committed snapshots of your code.
Common commands:
-
Check status:
-
Add files to the staging area:
-
Commit staged changes:
A.1.4 Viewing History and Changes
-
Show changes not yet staged:
-
Show committed changes:
-
Restore previous versions of files:
A.1.5 Ignoring Files
To prevent certain files from being tracked by Git, create a .gitignore
file. For example:
-
View contents using:
A.1.6 Remote Repositories
Git supports linking local and remote repositories (e.g., GitHub):
-
Add a remote:
-
Push changes to remote:
-
Pull changes from remote:
A.1.7 Collaboration
-
Clone a remote repository:
This creates a local copy and sets up a remote named
origin
.
A.1.8 Branching and Merging
-
Create and switch to a new branch:
-
Switch back to main branch:
-
Merge another branch into the current one:
A.1.9 Handling Conflicts
Merge conflicts occur when multiple changes affect the same lines of a file. Git will:
- Mark the conflict in the file.
- Require manual resolution before committing.
Always review and test code after resolving conflicts.
A.1.10 Licensing
Understanding software licensing is essential in open-source collaboration:
- GPL (General Public License): Requires derivative software to also be GPL-licensed.
- Creative Commons: Offers flexible combinations of attribution, sharing, and commercial use restrictions.
Choose licenses aligned with your intended use and contributions.
A.2 Short-cut
These are shortcuts that you probably you remember when working with R. Even though it might take a bit of time to learn and use them as your second nature, but they will save you a lot of time.
Just like learning another language, the more you speak and practice it, the more comfortable you are speaking it.
function | short-cut |
---|---|
navigate folders in console | " " + tab |
pull up short-cut cheat sheet | ctrl + shift + k |
go to file/function (everything in your project) | ctrl + . |
search everything | cmd + shift + f |
navigate between tabs | Crtl + shift + . |
type function faster | snip + shift + tab |
type faster | use tab for fuzzy match |
cmd + up |
|
ctrl + . |
Sometimes you can’t stage a folder because it’s too large. In such case, use Terminal
pane in Rstudio then type git add -A
to stage all changes then commit and push like usual.
A.3 Function short-cut
apply one function to your data to create a new variable: mutate(mod=map(data,function))
instead of using i in 1:length(object)
: for (i in seq_along(object))
apply multiple function: map_dbl
apply multiple function to multiple variables:map2
autoplot(data)
plot times series datamod_tidy = linear(reg) %>% set_engine('lm') %>% fit(price ~ ., data=data)
fit lm model. It could also fit other models (stan, spark, glmnet, keras)
- Sometimes, data-masking will not be able to recognize whether you’re calling from environment or data variables. To bypass this, we use
.data$variable
or.env$variable
. For exampledata %>% mutate(x=.env$variable/.data$variable
- Problems with data-masking:
- Unexpected masking by data-var: Use
.data
and.env
to disambiguate
- Data-var cant get through:
- Tunnel data-var with {{}} + Subset
.data
with [[]]
- Unexpected masking by data-var: Use
- Passing Data-variables through arguments
library("dplyr")
mean_by <- function(data,by,var){
data %>%
group_by({{{by}}}) %>%
summarise("{{var}}":=mean({{var}})) # new name for each var will be created by tunnel data-var inside strings
}
mean_by <- function(data,by,var){
data %>%
group_by({{{by}}}) %>%
summarise("{var}":=mean({{var}})) # use single {} to glue the string, but hard to reuse code in functions
}
- Trouble with selection:
library("purrr")
name <- c("mass","height")
starwars %>% select(name) # Data-var. Here you are referring to variable named "name"
starwars %>% select(all_of((name))) # use all_of() to disambiguate when
averages <- function(data,vars){ # take character vectors with all_of()
data %>%
select(all_of(vars)) %>%
map_dbl(mean,na.rm=TRUE)
}
x = c("Sepal.Length","Petal.Length")
iris %>% averages(x)
# Another way
averages <- function(data,vars){ # Tunnel selectiosn with {{}}
data %>%
select({{vars}}) %>%
map_dbl(mean,na.rm=TRUE)
}
x = c("Sepal.Length","Petal.Length")
iris %>% averages(x)
A.4 Citation
To cite the R packages used during this session, the following code prints BibTeX-formatted citations:
# List all non-base packages loaded in the session
packages <- ls(sessionInfo()$loadedOnly)
# Print BibTeX citations for each package
for (pkg in packages) {
print(toBibtex(citation(pkg)))
}
You may wish to redirect this output to a .bib
file for integration with LaTeX or R Markdown documents using writeLines()
.
A.5 Install All Necessary Packages on Your Local Machine
To replicate the environment used in this book or session on another machine, you can follow these steps.
A.5.1 Step 1: Export Installed Packages from Your Current Session
# Get all installed packages
installed <- as.data.frame(installed.packages())
# Preview the installed packages
head(installed)
#> Package LibPath Version Priority
#> abind abind C:/Program Files/R/R-4.4.3/library 1.4-8 <NA>
#> ade4 ade4 C:/Program Files/R/R-4.4.3/library 1.7-23 <NA>
#> ADGofTest ADGofTest C:/Program Files/R/R-4.4.3/library 0.3 <NA>
#> admisc admisc C:/Program Files/R/R-4.4.3/library 0.38 <NA>
#> AER AER C:/Program Files/R/R-4.4.3/library 1.2-14 <NA>
#> afex afex C:/Program Files/R/R-4.4.3/library 1.4-1 <NA>
#> Depends
#> abind R (>= 1.5.0)
#> ade4 R (>= 3.5.0)
#> ADGofTest <NA>
#> admisc R (>= 3.5.0)
#> AER R (>= 3.0.0), car (>= 2.0-19), lmtest, sandwich (>= 2.4-0),\nsurvival (>= 2.37-5), zoo
#> afex R (>= 3.5.0), lme4 (>= 1.1-8)
#> Imports
#> abind methods, utils
#> ade4 graphics, grDevices, methods, stats, utils, MASS, pixmap, sp,\nRcpp
#> ADGofTest <NA>
#> admisc methods
#> AER stats, Formula (>= 0.2-0)
#> afex pbkrtest (>= 0.4-1), lmerTest (>= 3.0-0), car, reshape2,\nstats, methods, utils
#> LinkingTo
#> abind <NA>
#> ade4 Rcpp, RcppArmadillo
#> ADGofTest <NA>
#> admisc <NA>
#> AER <NA>
#> afex <NA>
#> Suggests
#> abind <NA>
#> ade4 ade4TkGUI, adegraphics, adephylo, adespatial, ape, CircStats,\ndeldir, lattice, spdep, splancs, waveslim, progress, foreach,\nparallel, doParallel, iterators, knitr, rmarkdown
#> ADGofTest <NA>
#> admisc QCA (>= 3.7)
#> AER boot, dynlm, effects, fGarch, forecast, foreign, ineq,\nKernSmooth, lattice, longmemo, MASS, mlogit, nlme, nnet, np,\nplm, pscl, quantreg, rgl, ROCR, rugarch, sampleSelection,\nscatterplot3d, strucchange, systemfit (>= 1.1-20), truncreg,\ntseries, urca, vars
#> afex emmeans (>= 1.4), coin, xtable, parallel, plyr, optimx,\nnloptr, knitr, rmarkdown, R.rsp, lattice, latticeExtra,\nmultcomp, testthat, mlmRev, dplyr, tidyr, dfoptim, Matrix,\npsychTools, ggplot2, MEMSS, effects, carData, ggbeeswarm, nlme,\ncowplot, jtools, ggpubr, ggpol, MASS, glmmTMB, brms, rstanarm,\nstatmod, performance (>= 0.7.2), see (>= 0.6.4), ez,\nggResidpanel, grid, vdiffr
#> Enhances License License_is_FOSS License_restricts_use
#> abind <NA> MIT + file LICENSE <NA> <NA>
#> ade4 <NA> GPL (>= 2) <NA> <NA>
#> ADGofTest <NA> GPL <NA> <NA>
#> admisc <NA> GPL (>= 3) <NA> <NA>
#> AER <NA> GPL-2 | GPL-3 <NA> <NA>
#> afex <NA> GPL (>= 2) <NA> <NA>
#> OS_type MD5sum NeedsCompilation Built
#> abind <NA> <NA> no 4.4.1
#> ade4 <NA> <NA> yes 4.4.3
#> ADGofTest <NA> <NA> <NA> 4.4.0
#> admisc <NA> <NA> yes 4.4.3
#> AER <NA> <NA> no 4.4.3
#> afex <NA> <NA> no 4.4.3
# Export the list to a CSV file
write.csv(installed$Package, file = file.path(getwd(), "installed.csv"), row.names = FALSE)
A.5.2 Step 2: Install Packages on a New Machine
Once you have transferred the installed.csv
file to the new machine, run the following code to install any missing packages.
# Read the list of required packages
required <- read.csv("installed.csv", stringsAsFactors = FALSE)$Package
# Get the list of already installed packages on the current machine
current <- installed.packages()[, "Package"]
# Identify packages that are not yet installed
missing <- setdiff(required, current)
# Install the missing packages
install.packages(missing)
⚠️ Note: This approach assumes that all packages are available from CRAN. For packages from GitHub or Bioconductor, use
devtools::install_github()
orBiocManager::install()
as appropriate.
This approach ensures a reproducible computational environment, which is essential for robust data analysis and collaboration.