# Chapter 2 Basic and Useful R Coding

Before we start learning about plotting great-looking and professional visuals with RStudio, let me show you a few codes that might help with handling your data. For simplicity, I am going to use a couple of RStudio’s built-in data sets. That’s right, RStudio has an array of built-in data sets that you can use to practice your coding skills. If you already know the basics of R, feel free to skip ahead to Chapters 3 and 4 where I show you how to create awesome, clean and elegant plots.

Open RStudio for a new session and create a new script by using the shortcut Ctrl+Shift+N. You can also create a new script by accessing the option File/New File/R Script from the main menu. Once your session window is open, type the following codes to get you started. Remember to save your scripts; they are the depository of your lines of code.

TIP: You can use a hashtag (#) to write notes in your scripts. R will not execute any line of code that contains a # in front of it.

## 2.1 Available Built-in Datasets

To run a line of code, use the keyboard shortcut Ctrl+Enter or simply click on the button “Run” located at the top right corner of your script pane.

This is how you can see the data sets that are available in R:

data()

Running the code above should give you a list of built-in data sets that you can play with.

## 2.2 Looking Into the Data

For our first steps, I am going to use the mtcars data set. Let’s see what’s inside the mtcars data set by using the function head() - I use this function all the time. It will give you the first six records of any data set. In your console, or in a new script file, type:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now, if you’d like to see more than six records, you can specify how many records you’d like to see by adding n = # of records. Let’s try displaying the first ten records.

head(mtcars, n = 10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Similarly, you can use the function tail() to see the last records of the data set. Again, six is the default in R but you can specify how many you’d like to see. Let’s check the end of this data set, more specifically, the last eight records.

We do this by running this code:

tail(mtcars, n = 8)
##                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2    26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa     30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L   15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino     19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora    15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E       21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Ok, you now know how to call up a built-in data set in RStudio, and display its basic information using the head() and tail() functions. Later on I will show you how to create data in RStudio and upload files such as .xlsx and .csv (common Excel files used in data analysis).

## 2.3 Checking the Structure of the Data

Another important check you might want to run on any data you are working with is the verification of the structure of your data. There are different types of data structures within the RStudio environment, for example, characters, numeric, double, date, and the like. It is very important to learn how to check for the type of data you are dealing with and how to manipulate the data for plotting purposes. You will see later on what I mean, when we are learning how to plot data.

For now, I just wanted to show you a simple code to check for the structure of the data set. Here it is:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ##$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $disp: num 160 160 108 258 360 ... ##$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ##$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $qsec: num 16.5 17 18.6 19.4 17 ... ##$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $am : num 1 1 1 0 0 0 0 0 0 0 ... ##$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $carb: num 4 4 1 1 2 1 4 2 2 4 ... By running the code str(mtcars) you can see what type of data each of the variables actually hosts. In this example, they are all numeric. There will be other examples throughout the book. ## 2.4 Columns and Rows Names Another useful thing we can do is to check the names of the columns and rows. For that, we can simply run this code for columns: options(width = 60) colnames(mtcars) ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" ## [9] "am" "gear" "carb" And this code for rows: rownames(mtcars) ## [1] "Mazda RX4" "Mazda RX4 Wag" ## [3] "Datsun 710" "Hornet 4 Drive" ## [5] "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" ## [9] "Merc 230" "Merc 280" ## [11] "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" ## [15] "Cadillac Fleetwood" "Lincoln Continental" ## [17] "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" ## [21] "Toyota Corona" "Dodge Challenger" ## [23] "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" ## [27] "Porsche 914-2" "Lotus Europa" ## [29] "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" Now we know that the columns in this data set are the variables mpg, cyl, disp, and so on. And we know that the rows are composed of car models: Mazda RX4, Mazda RX4 Wag, and so on. As a good practice, it is important for you to always run these codes and get to know your data frame or data sets before you start working with them, especially if you are working with data that you do not know how was built or collected. ## 2.5 Basic Statistics Let’s talk about basic statistics, something very important for any Continuous Improvement professional. R can give us quick summary statistics on any data set, or specific variables. To “slice” the data set, and select one variable at a time, we can use the dollar sign after the data set’s name inside the function summary. For example, let’s look at the summary statistics for the variable mpg in the mtcars data set. Here’s the code: summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   10.40   15.43   19.20   20.09   22.80   33.90

Let’s do it again, at this time for the variable disp:

min(mtcars$mpg) ## [1] 10.4 The result is 10.4, the minimum value found in the vector mpg from the mtcars data set. Similarly, we can run the following codes for other single-value statistics such as maximum value, mean, median, and standard deviation. ### 2.6.2 Maximum Value max(mtcars$mpg)
## [1] 33.9

mean(mtcars$mpg) ## [1] 20.09062 ### 2.6.4 Median median(mtcars$mpg)
## [1] 19.2

sd(mtcars$mpg) ## [1] 6.026948 ### 2.6.6 First and Third Quartiles We can also use the summary function to display only the values that we want to see. We do this in this example. Notice that out of the six statistics we would normally see from the summary function (see previous section) we are only displaying the second and the fifth which are the first and the third quartiles of the data set. summary(mtcars$mpg)[c(2,5)]
## 1st Qu. 3rd Qu.
##  15.425  22.800

## 2.7 Creating Vectors

A vector in R is, simply put, a single data set, a storage unit so to speak. I will be using the terms vector, object, data set, and list of values interchangeably throughout the book. As you start working with data in R you’ll get more familiar with these terms.

There are different ways of creating vectors (and data frames), but I’ll stick to the simplest possible ones.

The basic syntax to create a vector, an object or a data frame in R is:

name of your vector <- function and arguments that define values

To recap: the name of your vector, object or data frame goes on the left, then we use <-, then the functions and arguments that define your vector, object or data frame on the right side of this expression.

Here’s how you can create a simple list of values:

values <- c(2,5,4,3,7,2,6,7)

Let’s check if this works. Simply type values in your script or console:

values
## [1] 2 5 4 3 7 2 6 7

As you can see, when I execute the line values, which is the object I created previously, I see the values I put in it.

Now let’s create another list of values, this time using the built-in sample function in R:

values2 <- sample(1:10,8,replace = TRUE)

In this case, I am using the function sample which I will be using other times throughout the book. For the object values I used the combine function (that’s what the lower case c means). For values2, I used the sample function.

Checking the contents of the object values2:

values2
## [1] 10  1  6  4  2 10  8 10

We can also create a list of characters, like this:

chr <- c("A","B","C","D","E","F","G","H")

Checking chr:

chr
## [1] "A" "B" "C" "D" "E" "F" "G" "H"

As mentioned earlier, there are different ways of creating data in R. I will be using different methods for the plotting chapters. I will also show you how to upload files.

## 2.8 Creating a Data Frame

Data frames are just like a regular matrix you’d see in Excel (rows and columns). We can create a data frame from scratch or we can “grab” vectors that have been already created to form a data frame. Let’s explore these two (and not at all limited) options.

From scratch:

df <- data.frame(Beatle = c("John", "Paul", "Ringo", "George"),
Born = c(1940,1942,1940,1943))

Let’s see what this simple data frame looks like:

df
##   Beatle Born
## 1   John 1940
## 2   Paul 1942
## 3  Ringo 1940
## 4 George 1943

From previously created vectors:

Let’s use the vectors created at the beginning of this section:

df2 <- data.frame(chr,values,values2)

Once again, let’s check what this newly created data frame looks like:

df2
##   chr values values2
## 1   A      2      10
## 2   B      5       1
## 3   C      4       6
## 4   D      3       4
## 5   E      7       2
## 6   F      2      10
## 7   G      6       8
## 8   H      7      10

## 2.9 Creating Data with R Functions

The two most helpful functions I use to generate data to practice plotting in R are the rnorm and sample functions.

### 2.9.1 The rnorm Function

The rnorm function creates normally distributed random data according to three pieces of information: sample size, mean, and standard deviation.

Let’s try an example, and place the values created by the rnorm function in an object named n1.

n1 <- rnorm(30,72,2.5)

In this example, I created a data set with 30 data points, a mean of 72, and a standard deviation of 2.5. Let’s check what this data set looks like in a descriptive fashion:

n1
##  [1] 72.51060 73.31964 67.66270 72.51043 70.32431 72.55095
##  [7] 70.87872 72.84726 70.95460 75.28300 71.69999 75.51884
## [13] 72.09144 70.70468 72.91325 73.47660 68.16062 70.00517
## [19] 68.43992 67.79581 68.20192 68.59704 73.19116 71.89066
## [25] 73.29277 69.49930 72.69001 75.77247 70.76243 74.65174

As you can see from the table above, I have 30 data points created in a completely random fashion.

### 2.9.2 The sample Function

Now let’s try creating another set of 30 data points, but this time our code will look like this:

n2 <- sample(70:74,30,replace = TRUE)

Checking the contents of the object n2:

n2
##  [1] 73 74 72 72 73 72 73 72 71 74 72 71 72 71 71 71 72 72
## [19] 72 74 71 74 74 72 70 71 73 74 72 73

Notice a couple of things here. Firstly, I set a specific range for my sampling, 70 to 74. Secondly, I used the replace = TRUE argument so that my sampling method is done with replacement; that is, each number sampled is placed back into the sampling process.