Chapter 7 Creating your first script

A script can be broken down into a few layers which we are going to go through in detail, but here’s a general overview.

  1. Load/install required packages
  2. Load your data
  3. Perform data wrangling tasks
  4. Then what?
    1. create figures
    2. run statistics
    3. create tables
  5. Save outputs

Figures are saved in your /images directory, whereas the statistics and tables are saved in /data for future use. Each section is coded so that I can fold/unfold a given section. This allows me to only focus on the section of code that is important at that given point and time.

7.1 Headers

Before we get started you’ll see “Headers” throughout the script which are used to keep things organized. I use the following code within an .R file (not an .Rmd)

# Header 1 ----------
# ▐ Header 2 --------
# ▐ ▬ Header 3 --------

7.2 Part 1: Loading/installing packages needed

At the very top of my script I will load the packages I need to get things done. This will vary slightly depending on the script. For example some of my packages (e.g., lme4 or easystats) are only loaded when I run statistics. There’s a few example below, I suggest skipping to the section that applies to your situation.

# Load/Install required packages ---------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(conflicted,readxl, ggplot2, esquisse, Rmisc, tidyverse, car, easystats, apastats, sjlabelled, rio) #p_load This function is a wrapper for library and require. It checks to see if a package is installed, if not it will install it.

#install_formats  #Run once to install rio wrappers

7.3 Part 2: Loading your data

You should have data that you are trying to manipulate. Below I show the most common examples. I generally always call my data df. This allows me to easily copy/paste code between projects. Its good practice to do this if you can. Therefore, as you advance with your R scripts you won’t need to spend precious time using Find/Replace.

I have recently switched my philosophy to use rio which greatly reduces the students requirements.

7.3.1 xlsx

# Load your dataset ---------------------
df <- read_excel("raw/CC_Body_FA.xlsx", sheet = "Sheet1" ) # import your dataset - uses 'readxl'
df <- import("raw/data.xlsx", which = "Sheet1") # Uses the `rio` package

7.3.2 csv

When I have particularly large files to write from MATLAB, I prefer to use *.csv files over *.xlsx because they write faster. If you are dealing with datasets that are larger than 1GB in size you should consider using data.table instead of data.frame.

# Load your dataset ---------------------
df <- import("raw/data.csv") # Import using `rio`

7.3.3 Google Sheets

It is also possible to read from a Google Sheet using the googlesheets4 package.

df <- read_sheet("https://docs.google.com/spreadsheets/d/1V99DMca-Qdy3G7kyg9zTONvBVagtnBrj4nm78Fj1vU8", sheet = "Head Measures & Information") # requires 'googlesheets4' library

7.3.4 sav (SPSS)

This is the general data format for SPSS. With this filetype “attributes” are also imported, which normally I like to remove. In my own experience, some functions don’t play nicely with dataframes that have labels. You can read more on this here.

df <- rio::import("data/data.sav") 

A full tutorial on importing other data types can be seen here on DataCamp. In general, try and stick to the formats shown above. If you are importing data from another statistical program (e.g., SPSS, STATA or SAS) you will often get a ton of attributes that are imported in the data.frame. This can be a good thing at times because it may give you additional information on the column variable. However, some statistical functions tend to get fussy when your data.frame contains these attributes. These are shown below.

A dataframe with attributes being imported

Figure 7.1: A dataframe with attributes being imported

df.auto <- import("http://www.stata-press.com/data/r13/auto.dta") %>% 
  labelled::remove_attributes(c("label", "format.stata"))

df.auto <- import("http://www.stata-press.com/data/r13/auto.dta") %>% 
  haven::zap_formats()

Finally, its possible you want to open data from other forms including 1. .txt 2. . 3. SPSS 4. Mini-table

7.4 Cleaning your imported data.

It’s possible to clean up your dataset as it comes in by using the janitor package. Click the link for a couple examples. In essence it will scan through the column names and fix them according to a notation you specify.

Now that you have your df loaded, lets take a look and see what we have. There are 4 types of data that can be held in a data.frame, in R these are referred to as class.

  1. Numeric
  2. Characters
  3. Factors
  4. Dates

You can view the type within a particular column by running the following code

sapply(df, class)

The class of your columns may not seem important right now, but later on when we manipulate the data, it will be crucial to make sure these are accurate. Below is an example of an xlsx file which is imported. We expected dti_value to be numeric, but due to a dash in one of the cells, it was imported as character.

7.5 Part 3: Saving Outputs

Once you are done running your R Scripts you will want to save some outputs (notably statistical models and dataframes) so they can become part of your RMarkdown document (manuscript.Rmd).

We will want to save our results as an *.RData file. You can save outputs a few different ways. The first uses the default save function

# Save your environment ------------
    # Save it to .RData -----------
    save(journey_time,modsum, model, file = "data/analyzedData.RData") #Save a list of tables that I'll use in the .Rmd file.

    # Save the tables into data/tables.RData using "patterns" ==================
    save(list=ls(pattern="table"), file = "data/tables.RData") #Save a list of tables that I'll use in the .Rmd file.
    save(list=ls(pattern="mod"), file = "data/stats.RData")

However, this function will overwrite every time you run it. What if you want to add environment variables to an existing RData file? We can use the resave function from cgwtools.

# Save your environment ------------
    # Save the tables into data/tables.RData by listing them individually
    cgwtools::resave(tbl.demo.mios, tbl.demo.acap, file = "data/tables.RData") #resave a list of tables that I'll use in the .Rmd file.
    
    
    # Save the tables into data/tables.RData using "patterns" ==================
    cgwtools::resave(list=ls(pattern="tbl"), file = "data/tables.RData")

Finally, its possible that you need to export the data for a colleague into a more useable file extension (because they aren’t cool enough to run their analyses in R yet…). We once again use the rio package to accomplish this.

# Optional - Save df as xlsx --------
export(list(mtcars = mtcars, iris = iris), "multi.xlsx")
export(processed_data, "processed_data.xlsx") # uses the rio package

# Other Options (not recommended) --------
xlsx::write.xlsx(tmp2, "data/interactions.xlsx", sheetName = "Interaction2", append = TRUE) # uses the xlsx package

openxlsx::write.xlsx(daily, "data/daily.xlsx") # uses the openxlsx package but you can't append sheets with this package as far as I know.