1 Week1: Introduction to R

1.1 Reading for this lecture

1.2 Two ways to read lecture notes

  • On the shinyapps server
    • You can access lecture notes on the web through the link provided in cyber campus
  • On you local computer
    • You can compile week1_introduction_to_R.Rmd file in RStudio.
    • I will post how to compile week1_introduction_to_R.Rmd file in our cyber campus.

1.3 Welcome

  • Hi everyone, welcome to the course :) This is the introduction to R course at Ewha Womans University in 2020.

  • R is a great programming language for statistical analysis and data science. I hope you enjoy R in this course and find many useful applications for your own field.

  • This course is designed for those who don’t have any programming background, especially in social science major.

  • Today, I will just introduce R, RStudio, and some basic terminologies

  • In this lecture note, this font represents R keyword/packages and this font represents hyperlink you can click.

1.4 Why R?

  • R is a language designed for statistical analysis.
    • R is very popular especially in academia. Many researchers first implement their experimental methods in statistics and machine learning using R. So you can use many cutting-edge algorithms in R by downloading many packages in R.
  • R provides many tools for publication-quality data visualization.
    • Data visualization help you to understand your data and present your findings to others. In this lecture, you will learn ggplot2 package for data visualization.
  • R provides many tools for data wrangling.
    • You need to transform your own data into a useful form for visualization and modeling. This process is often called data wrangling. In this lecture, you will learn many packages in tidyverse (e.g., dplyr, tidyr, purrr, stringr) for data wrangling.
    • Data wrangling is an important skill in these days to effectively handle data from diverse sources (e.g., facebook, twitter, fMRI, eyetracker, EEG). For example, many servers store user data in the JSON format for data exchange, and so you need to handle JSON files if you want to use user data for your analysis.
  • R has active and supportive communities.
    • You can find many useful resources from those communities. For example, the bookdown is an R package for writing books, and bookdown.org provides useful R books written with the bookdown package for free. In the bookdown.org, you will find our textbook for this class, which is R for Data Science.
  • R is a free open source software.

1.5 R vs Python

  • Python is another great programming language. You can find many interesting debates over R vs Python on the internet. Here is an example.

  • In short, Python is a general-purpose programming language used for a wide variety of applications (e.g., data science, web development, database), whereas R is a programming language focusing on data science.

  • One of the goals of this course is to give you an opportunity for exploring R so that you can make a decision about which language is best for your purpose. You may need both languages :)

“Generally, there are a lot of people who talk about R versus Python like it’s a war that either R or Python is going to win. I think that is not helpful because it is not actually a battle. These things exist independently and are both awesome in different ways.”

— Hadley Wickham

1.6 Let’s install R to your local computer

  • Please install base R for your own operating system from https://cloud.r-project.org/

  • You will find lots of videos demonstrating installing process on YouTube.

1.7 Let’s install RStudio to your local computer

  • Please install FREE version of RStudio for your own operating system from https://rstudio.com/products/rstudio/download/.

  • RStudio is an integrated development environment (IDE) for R.

    • R will do all the calculations for you.
    • RStudio is just an convenient interface between you and R.
    • So, you can just launch RStudio to use R.
  • You will find lots of videos demonstrating installing process on YouTube.

1.8 Three ways you can use R in this course

  • You can use RStudio in your local computer.
    • You can launch RStudio you downloaded above from your Windows or Mac computers.
  • You can use RStudio Cloud for free.
    • RStudio Cloud allows you to use RStudio in the Web browser.
    • You can get free account in RStudio Cloud here.
  • You can use the interactive code chunk in this lecture note.
    • The interactive code chunk in this lecture note was created using learnr package

    • If you see Do It! throughout this semester, it means that I want you to actually do something :) The Do It! below contains the interactive code chunk. You can actually run R commands in this interactive code chunk and see the results instantly.

  • Do It (1-1)
    • Please type the following R commands in the interactive code chunk and click Run Code button on the right side to see what happen: ggplot(data = diamonds, aes(x = carat, y = price, color = color)) + geom_point(). This code uses the ggplot2 package, an R package for data visualization. You will learn the ggplot2 package next week. For now, just type the code in the interactive code chunk.
# type `ggplot(data = diamonds, aes(x = carat, y = price, color = color)) + geom_point()`
# in R, # is used for comments, which R will ignore

1.9 Useful R resources

  • You can find many useful resources (many of them are FREE) for learning R on the internet.

  • Free ebooks

    • bookdown is an R package that helps you to write and publish books using RStudio. Please check wonderful books on the site. You can freely read those books online.

    • In this site, you can find our textbook, R for Data Science (Hadley Wickham).

    • If you are interested in writing and publishing books on the internet like the ones on the bookdown website, please read bookdown: Authoring Books and Technical Documents with R Markdown (Yihui Xie). (This class will not cover this topic)

    • As you may already know, many researchers around the world create R packages and share with others through R package systems. If you are interested in creating your own R package, please read R Packages (Hadley Wickham). (This class will not cover this topic)

    • If you think you want to learn more advanced R at the end of this course, please read Advanced R (Hadley Wickham). (This class will not cover this topic)

    • If you prefer to read those books in Korean, try to translate books using the Chrome web browser: open the ebooks using Chrome web browser, click right mouse button, and choose “translate in Korean.”

    • You may already notice that the name “Hadley Wickham” appears here and there. Hadley Wickham contributes a lot to R community as a chief scientist at RStudio, creator of tidyverse package (ggplot2, dplyr, tidyr, stringr, etc.), and authors of many books. You can find more about Hadley in Hadley’s website.

  • RStudio Cheatsheets

    • One of the strengths of R is its package system. There are more than 16,000 packages that extend the functionality of base R. However, it’s difficult to remember all the details of such large number of packages.

    • RStudio provides RStudio Cheatsheets which summarize features of some important R packages in one or two pages. The cheatsheets will be nice references when you actually work with R for your own project.

1.10 R Packages

  • R packages are the collections of functions and data sets developed by the R community. Currently, the Comprehensive R Archive Network(CRAN) repository contains more than 16,000 packages. A list of R packages are available here. Packages increase the power of R by improving and extending base R.

  • Installing vs Loading

    • Installing means you download the package files from the CRAN repository to your local computer. You only need to install a package once. An R function named install.packages() is used to install packages.
    • Loading means you load functions and data in the package onto your computer memory. Loading functions and data of all packages to memory is inefficient in terms of computers’ memory management. So you only load a package onto memory when you need it. In other word, whenever you want to use functions in a package, you need to load the package that contains the functions. An R function named library() is used to load packages.
    • For example, we will learn the ggplot2 package for data visualization from week 2. In order to use functions in ggplot2, you need to install the ggplot2 package by running install.packages("ggplot") once, and load the package by running library(ggplot2) whenever you need the package.

1.11 Base R vs Tidyverse

  • Base R
    • Base R refers to the collection of functions and packages that come with a clean install of R.
    • Many packages extend Base R.
    • You will learn Base R later in this class.
    • Here’s a cheatsheet for Base R
  • Tidyverse
    • The tidyverse package is a collection of packages for more efficient data science in R.
    • In the tidyverse package, the dplyr, tidyr, ggplot2, and purrr packages provide many useful functions for efficient data transformation, data tidying, data visualization, and iteration, respectively.
    • Our textbook R for Data Science (Hadley Wickham) is all about tidyverse.
    • A significant portion of this class will be devoted to learning various packages in tidyverse.
    • You can quickly check the features of core tidyverse packages here.

1.12 R script file vs R markdown file

  • Both R script and markdown files are just a plain text file, meaning that you can open and edit with Windows’ notepad or any other text editor.

  • In an R script, you can only include R codes (with comments), whereas in an R markdown, you can include interactive code chunk and your text. In this class, we will mainly use and practice R markdown.

1.13 R markdown

  • Along with tidyverse, R markdown is one of my favorite part in R.

  • R markdown is a file format that can be converted into many different documents such as HTML, PDF, Microsoft Word, and other dynamic documents. That is, it’s a nice example of one source multi use.

  • For example, this R markdown text file can be converted into this nice HTML.

  • This lecture notes were also made with R markdown :)

  • Let’s watch a short introductory video for R markdown.

  • You will be amazed how many nice documents can be created from a simple R markdown text file. Please check nice documents from R markdown here.

  • These days, how you present your work is just as important as what you present. If you learn R markdown, you can present your contents in many different wonderful formats.

  • Here’s R markdown cheatsheet.

  • If you are serious in R markdown, please read R Markdown: The Definitive Guide(Yihui Xie, J. J. Allaire, Garrett Grolemund).

  • In this class, we will cover the basic of R markdown.

  • Let’s explore R markdown. You can open a new R markdown template by going File > New File > R Markdown...

R Markdown template in RStudio

  • Notice that R markdown contains white and gray areas. The white area is where your text goes, whereas the gray area is where your R code goes.
    • You can create a code chunk by
      • clicking insert and choose R on the menu (check the green circle on the above image) or
      • typing a short cut key or
        • Windows: Ctrl + Alt + I
        • Mac: Cmd + Option + I
    • You can run the code in a code chunk by
      • clicking Run on the menu (check the green circle on the above image) or
      • typing Ctrl + Enter when you place your cursor at the line of the code you want to run
    • You can find more information on code chunks here
  • Again, the biggest benefit of using R markdown is that you can create multiple output formats (e.g.s, HTML, PDF, MS Word, Beamer slide, shiny applications, websites).
    • For example, you can easily create HTML, PDF, MS Word from your R markdown document by simply clicking knit button and choose the format you want (check the green circle on the below image).
    • To knit to PDF, you need a Tex typesetting system.
# installing tinytex 
install.packages('tinytex')
tinytex::install_tinytex()

R Markdown template in RStudio

  • You can even publish your R markdown to web very easily. I will talk about this another class.

1.14 Let’s get familiar with RStudio

  • You may not be familiar with RStudio menus and buttons when you launch RStudio for the first time. However, you will get familiar with RStudio very soon.

  • Let’s watch a short introductory video for RStudio here

  • Notice that RStudio consists of four panels (or panes)

    • Source editor pane
      • This is where you can edit your code in R script or R markdown files
    • Console pane
      • This is where you can execute your code instantly
    • Environment pane
      • This is where you can see the objects (or variables)
    • File/Plots/Packages/Helps pane
      • This is where you can see files, plots, packages, and help documents.

Four main panes in RStudio IDE

  • You can find more RStudio features here.

1.15 Things that you can do with R and RStudio

  • In Ch1 of R for Data Science, the author illustrates the process in typical data science project as follows

    • import: read data into R
    • tidy, transform: get the original data into our desired form
    • visualize: create plots to understand data or to present findings
    • model: detect patterns from data
    • communicate: tell your findings to others

The process in typical data science project

  • Let’s go over this process with a simple example using a built-in mtcars data.
    • For now, just go over each step even though you don’t understand functions used in the code. The only goal of this example is to get you familiar with R and RStudio.
  • mtcars is a built-in data in R. The data come from 1974 Motor Trend US magazine, and include fuel consumption and 10 other features of 32 automobiles. You can type mtcars in your RStudio console to see the data, and type ?mtcars to see the help document for the data.
# typing the name of the data will display the actual data in the data
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
  • Do It (1-2)
    • In the interactive code chunk below, type mtcars and click Run Code to see the data, and type ?mtcars and click Run Code to see the help document for the data.
# type `mtcars` to see the data
# type `?mtcars` to see the help document
  • (transform) If you see the help document of mtcars, you will see cyl is the number of cylinders of the cars. Suppose that we are only interested in the cars with 4 and 6 cylinders. You can first check the frequency of the number of cylinders and then filter the cars with 4 and 6 cylinders as follows:
# Check the frequency of the number of cylinders
# count() is used to get the distribution of values 
# count() is a function in `dplyr` package
# So in order to use the count(), you should install and run `dplyr` package first. 
# installing dplyr package: install.packages("dplyr")
# loading dplyr package: library(dplyr)
# library(tidyverse) will load core packages in the tidyverse including dplyr
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.2     √ purrr   0.3.4
## √ tibble  3.0.1     √ dplyr   1.0.0
## √ tidyr   1.1.0     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
mtcars %>% count(cyl)
##   cyl  n
## 1   4 11
## 2   6  7
## 3   8 14
  • Do It (1-3)
    • Type mtcars %>% count(cyl)
# filter() subsets (or selects) observations
# filter() is also a function in `dplyr` package
# %>% is called a `pipe`, can you guess the meaning? 
# function A %>% function B:  
# the pipe delivers the outcome of a function A as an input of funciton B
# %>% is a feature of `magrittr` package. We will talk this later. 
mtcars %>% 
  filter(cyl %in% c(4,6)) %>% 
  count(cyl)
##   cyl  n
## 1   4 11
## 2   6  7
  • Do It (1-4)
    • Type mtcars %>% filter(cyl %in% c(4,6)) %>% count(cyl)
  • (visualize) We transformed our data mtcars into our desired form, and now we want to visualize our relationship of interest. Suppose that we are interested in the relationship between mpg (miles per gallon or kilometers per liter in Korean) and disp (engine size). We can create a scatter plot between mpg and disp for our visual inspection of the relationship using the ggplot2 package.
# you will learn `ggplot2` package next week. 
# for not, it's ok to just know ggplot2 is a package in tidyverse for data visualization
mtcars %>% 
  filter(cyl %in% c(4,6)) %>% 
  ggplot(aes(x = disp, y = mpg)) + geom_point() 

# you can even fit a line to the data and plot it by adding `geom_smooth()`
mtcars %>% 
  filter(cyl %in% c(4,6)) %>% 
  ggplot(aes(x = disp, y = mpg)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  • (model) We visualized our relationship of interest, and now we want to model the relationship with a line. The code below run a simple regression to the filtered data (i.e., cars with only 4 and 6 cylinders), and report regression coefficients.
# lm() is a function in Base R to fit linear models (or simply regression)

mtcars2 <- mtcars %>% 
  filter(cyl %in% c(4,6)) 
  
summary(lm(mpg ~ disp, data = mtcars2))
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7873 -3.0124 -0.8294  1.7969  6.5373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.05456    2.30665  14.764 9.69e-11 ***
## disp        -0.07439    0.01599  -4.651 0.000266 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.345 on 16 degrees of freedom
## Multiple R-squared:  0.5748,	Adjusted R-squared:  0.5483 
## F-statistic: 21.63 on 1 and 16 DF,  p-value: 0.0002663
  • Do It (1-5)
    • Run above regression in the interactive code chunk below

1.16 Homework (this HW will not be graded !!)

  • Do the following:
    1. create an R markdown template by going File > New File > R Markdown... and
    2. replicate Do It (1-2), Do It (1-3), Do It (1-4), and Do It (1-5) (You need to create a code chunk. Check 1.12 R Markdown section to create a code chunk)
    3. create an HTML, PDF, and MS word documents by kniting the R markdown (Check 1.12 R Markdown to knit your R markdown)
    • note that you need Tex system for PDF
    1. You can add your own text in white space for your own report.
    2. Next week, I will post a video demonstrating how I did for this HW :)

1.17 Comments of Week1 lecture

  • The goal of week 1 lecture is just to introduce basic concepts and terminology in R and RStudio, and to get you familiar with RStudio interface. The only way you can improve your programming skill is actually write codes. Please play with RStudio as much as you can, and practice the code in the lecture material.

  • Next week, we will talk about our first tidyverse package ggplot2 for data visualization. From next week, I will provide you more videos demonstrating how actually run codes in RStudio.

  • To give you enough time to get familiar with R materials and lecture formats, we will not have any quiz until week 3.

  • Please feel free to post any question regarding course materials on the cyber campus.