Chapter 2 Basic and Useful R Coding

Before we start learning about plotting great-looking and professional visuals with RStudio, let me show you a few codes that might help with handling your data. For simplicity, I am going to use a couple of RStudio’s built-in data sets. That’s right, RStudio has an array of built-in data sets that you can use to practice your coding skills. If you already know the basics of R, feel free to skip ahead to Chapters 3 and 4 where I show you how to create awesome, clean and elegant plots.

Open RStudio for a new session and create a new script by using the shortcut Ctrl+Shift+N. You can also create a new script by accessing the option File/New File/R Script from the main menu. Once your session window is open, type the following codes to get you started. Remember to save your scripts; they are the depository of your lines of code.

TIP: You can use a hashtag (#) to write notes in your scripts. R will not execute any line of code that contains a # in front of it.

2.1 Available Built-in Datasets

To run a line of code, use the keyboard shortcut Ctrl+Enter or simply click on the button “Run” located at the top right corner of your script pane.

This is how you can see the data sets that are available in R:

data()

Running the code above should give you a list of built-in data sets that you can play with.

2.2 Looking Into the Data

For our first steps, I am going to use the mtcars data set. Let’s see what’s inside the mtcars data set by using the function head() - I use this function all the time. It will give you the first six records of any data set. In your console, or in a new script file, type:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now, if you’d like to see more than six records, you can specify how many records you’d like to see by adding n = # of records. Let’s try displaying the first ten records.

head(mtcars, n = 10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Similarly, you can use the function tail() to see the last records of the data set. Again, six is the default in R but you can specify how many you’d like to see. Let’s check the end of this data set, more specifically, the last eight records.

We do this by running this code:

tail(mtcars, n = 8)

##                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2    26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa     30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L   15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino     19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora    15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E       21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Ok, you now know how to call up a built-in data set in RStudio, and display its basic information using the head() and tail() functions. Later on I will show you how to create data in RStudio and upload files such as .xlsx and .csv (common Excel files used in data analysis).

2.3 Checking the Structure of the Data

Another important check you might want to run on any data you are working with is the verification of the structure of your data. There are different types of data structures within the RStudio environment, for example, characters, numeric, double, date, and the like. It is very important to learn how to check for the type of data you are dealing with and how to manipulate the data for plotting purposes. You will see later on what I mean, when we are learning how to plot data.

For now, I just wanted to show you a simple code to check for the structure of the data set. Here it is:

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

By running the code str(mtcars) you can see what type of data each of the variables actually hosts. In this example, they are all numeric. There will be other examples throughout the book.

2.4 Columns and Rows Names

Another useful thing we can do is to check the names of the columns and rows. For that, we can simply run this code for columns:

options(width = 60)
colnames(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"  
##  [9] "am"   "gear" "carb"

And this code for rows:

rownames(mtcars)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"      
##  [3] "Datsun 710"          "Hornet 4 Drive"     
##  [5] "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"          
##  [9] "Merc 230"            "Merc 280"           
## [11] "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"        
## [15] "Cadillac Fleetwood"  "Lincoln Continental"
## [17] "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"     
## [21] "Toyota Corona"       "Dodge Challenger"   
## [23] "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"          
## [27] "Porsche 914-2"       "Lotus Europa"       
## [29] "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

Now we know that the columns in this data set are the variables mpg, cyl, disp, and so on. And we know that the rows are composed of car models: Mazda RX4, Mazda RX4 Wag, and so on.

As a good practice, it is important for you to always run these codes and get to know your data frame or data sets before you start working with them, especially if you are working with data that you do not know how was built or collected.

2.5 Basic Statistics

Let’s talk about basic statistics, something very important for any Continuous Improvement professional. R can give us quick summary statistics on any data set, or specific variables.

To “slice” the data set, and select one variable at a time, we can use the dollar sign after the data set’s name inside the function summary.

For example, let’s look at the summary statistics for the variable mpg in the mtcars data set. Here’s the code:

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Let’s do it again, at this time for the variable disp:

summary(mtcars$disp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.1   120.8   196.3   230.7   326.0   472.0

Super easy. We can quickly see here the 5-number summary (minimum, first quartile, median, third quartile, maximum) plus the mean. This is a very important set of statistics you should always learn about your data.

2.6 Single-value Statistics

Sometimes all you need is a quick check on a specific statistic. Instead of running the entire summary statistics from the previous section, we can achieve this task by executing specific codes. Let’s have a look at some of the specific statistics for the mtcars data set.

2.6.1 Minimum Value

We are going to use the function min and then specify within the data set what exactly we want R to apply the function to. Once again, let’s use the dollar sign to “slice” the data. First, notice that we call the data set (mtcars in this case), then we use $ to bring in one specific vector mpg.

min(mtcars$mpg)

## [1] 10.4

The result is 10.4, the minimum value found in the vector mpg from the mtcars data set. Similarly, we can run the following codes for other single-value statistics such as maximum value, mean, median, and standard deviation.

2.6.2 Maximum Value

max(mtcars$mpg)

## [1] 33.9

2.6.3 Mean (Average)

mean(mtcars$mpg)

## [1] 20.09062

2.6.4 Median

median(mtcars$mpg)

## [1] 19.2

2.6.5 Standard Deviation

sd(mtcars$mpg)

## [1] 6.026948

2.6.6 First and Third Quartiles

We can also use the summary function to display only the values that we want to see. We do this in this example. Notice that out of the six statistics we would normally see from the summary function (see previous section) we are only displaying the second and the fifth which are the first and the third quartiles of the data set.

summary(mtcars$mpg)[c(2,5)]

## 1st Qu. 3rd Qu. 
##  15.425  22.800

2.7 Creating Vectors

A vector in R is, simply put, a single data set, a storage unit so to speak. I will be using the terms vector, object, data set, and list of values interchangeably throughout the book. As you start working with data in R you’ll get more familiar with these terms.

There are different ways of creating vectors (and data frames), but I’ll stick to the simplest possible ones.

The basic syntax to create a vector, an object or a data frame in R is:

name of your vector <- function and arguments that define values

To recap: the name of your vector, object or data frame goes on the left, then we use <-, then the functions and arguments that define your vector, object or data frame on the right side of this expression.

Here’s how you can create a simple list of values:

values <- c(2,5,4,3,7,2,6,7)

Let’s check if this works. Simply type values in your script or console:

values

## [1] 2 5 4 3 7 2 6 7

As you can see, when I execute the line values, which is the object I created previously, I see the values I put in it.

Now let’s create another list of values, this time using the built-in sample function in R:

values2 <- sample(1:10,8,replace = TRUE)

In this case, I am using the function sample which I will be using other times throughout the book. For the object values I used the combine function (that’s what the lower case c means). For values2, I used the sample function.

Checking the contents of the object values2:

values2

## [1] 10  1  6  4  2 10  8 10

We can also create a list of characters, like this:

chr <- c("A","B","C","D","E","F","G","H")

Checking chr:

chr

## [1] "A" "B" "C" "D" "E" "F" "G" "H"

As mentioned earlier, there are different ways of creating data in R. I will be using different methods for the plotting chapters. I will also show you how to upload files.

2.8 Creating a Data Frame

Data frames are just like a regular matrix you’d see in Excel (rows and columns). We can create a data frame from scratch or we can “grab” vectors that have been already created to form a data frame. Let’s explore these two (and not at all limited) options.

From scratch:

df <- data.frame(Beatle = c("John", "Paul", "Ringo", "George"),
                 Born = c(1940,1942,1940,1943))

Let’s see what this simple data frame looks like:

df

##   Beatle Born
## 1   John 1940
## 2   Paul 1942
## 3  Ringo 1940
## 4 George 1943

From previously created vectors:

Let’s use the vectors created at the beginning of this section:

df2 <- data.frame(chr,values,values2)

Once again, let’s check what this newly created data frame looks like:

df2

##   chr values values2
## 1   A      2      10
## 2   B      5       1
## 3   C      4       6
## 4   D      3       4
## 5   E      7       2
## 6   F      2      10
## 7   G      6       8
## 8   H      7      10

2.9 Creating Data with R Functions

The two most helpful functions I use to generate data to practice plotting in R are the rnorm and sample functions.

2.9.1 The rnorm Function

The rnorm function creates normally distributed random data according to three pieces of information: sample size, mean, and standard deviation.

Let’s try an example, and place the values created by the rnorm function in an object named n1.

n1 <- rnorm(30,72,2.5)

In this example, I created a data set with 30 data points, a mean of 72, and a standard deviation of 2.5. Let’s check what this data set looks like in a descriptive fashion:

n1

##  [1] 72.51060 73.31964 67.66270 72.51043 70.32431 72.55095
##  [7] 70.87872 72.84726 70.95460 75.28300 71.69999 75.51884
## [13] 72.09144 70.70468 72.91325 73.47660 68.16062 70.00517
## [19] 68.43992 67.79581 68.20192 68.59704 73.19116 71.89066
## [25] 73.29277 69.49930 72.69001 75.77247 70.76243 74.65174

As you can see from the table above, I have 30 data points created in a completely random fashion.

2.9.2 The sample Function

Now let’s try creating another set of 30 data points, but this time our code will look like this:

n2 <- sample(70:74,30,replace = TRUE)

Checking the contents of the object n2:

n2

##  [1] 73 74 72 72 73 72 73 72 71 74 72 71 72 71 71 71 72 72
## [19] 72 74 71 74 74 72 70 71 73 74 72 73

Notice a couple of things here. Firstly, I set a specific range for my sampling, 70 to 74. Secondly, I used the replace = TRUE argument so that my sampling method is done with replacement; that is, each number sampled is placed back into the sampling process.

2.10 Uploading a File

The simplest way to upload a file into RStudio is via the Files tab, located at the bottom right corner of your RStudio window (for a quick refresher, refer back to “RStudio Quick Tour” in Chapter 1). For most of the codes you will need to run for DMAIC-based projects, you will most likely need a few data source files, and my guess is that in most cases, they will be Excel-based files.

TIP: Always keep files of interest in the same folder when using RStudio.

Uploading a file into RStudio means running a code for the task, but you don’t necessarily need to run the code in the script or console for this. Once you start using R and start to get more familiar with how it works, you may find it useful writing your line of code to upload files but before that, you can simply use the Files tab.

In the Files tab, go to the three dots all the way to the right and make sure that you select the directory you want to be in. From there, look for your file on the list, click on it once and select Import Dataset. Notice that a window pops up to show you a preview of the file. Notice also that on the bottom right corner is the actual code doing that job.

TIP: As you learn how to upload files into RStudio, always work with “clean” files - that is, simple matrix-based Excel files: no merged cells, no missing values, no special characters where you should have a number. Simple and clean rows-and-columns data sources will help you significantly as you get more used to working in RStudio.