Your Code Basics

This section covers the fundamentals of the R language. Things like data types and structures, functions and libraries, data formats, loops and more, are all the horrible things that you need to have some understanding of right now. I wish we could skip this shit and go straight to good stuff, but unfortunately, we can’t. These topics will come up again and again and you will just fuck yourself up if you don’t at least know about them. I will try to keep it as interesting as possible so bear with me on the adult material.

Remember, you’re still sitting in your cubicle at work familiarizing yourself with R. Your boss is out for a while and gave you a few days to settle down and get more comfortable with R in particular. Nobody else has really bothered you up until now, but you suspect that it might change soon. You don’t want to look like an incompetent degenerate, so you jump back into learning R.

0.7 Data Types & Structures

We have already written a few lines of code. Let’s just erase all that trash and start over.

Erasing and starting over is very useful in the beginning because it helps you get better. Rewriting the same stuff over and over with some minor improvements really cements the fundamentals in your head.

Start a new script and write the following lines of code:

There are four simple data types in R. Characters, numbers, booleans (logical), and stupid factors. Nerds will tell you that there are more types. There are, they are called complex, integers and some other nonsense. Leave these types to them, you do not need them right now. Let’s go in order.

In this book, I’ve tried to illustrate code the way it looks in real life. However, sometimes a line of code can be too long and won’t fit on a single line of a book (even digital), resulting in a scroll view, which isn’t a good reading experience. Therefore, from time to time, you will see me breaking a single code line in two or three lines. Don’t worry, it will work just the same in Rstudio. It might just look weird at first.

Example:

Something that looks like this:

Will sometimes look like this:

0.7.1 Characters

Character type is just that – characters, letters, text, sentences, names and so forth. There is not much to it. Any text that you print or store in a variable or table is a character. Numbers can be characters, dates can be characters, factors and logicals can be converted to characters. A simple example is the line of code we wrote earlier:

The character string ‘Trump’ is stored inside of the variable ‘president’. What is a character string? It is just a bunch of characters put together. The, character string ‘Trump’ is a combination of characters ‘T’,’r’,’u’,’m’,’p’. What can we do with characters? Lots of things, actually. Characters store names, descriptions, text, and other things of that nature, nothing special here. Some interesting things that we can do with characters are counting them, chopping them up and combining different strings together; we can also filter by characters or by a number of characters in a string and we can group by characters. Let’s just look at some of those. In the following chunk, we will paste two strings together and print the result.

## [1] "Donald Trump"

We just created two-character strings, stored them into two separate variables, and pasted them together into the variable as one string. Simple and cool. We did not necessarily need to store the strings in the variables. You can also do it like this:

## [1] "Donald Trump"
## [1] "Donald Trump"

As you can see, the results of all three methods are the same. Here, we used the function paste(). It’s a very useful function, and we will be using it a lot. Have you noticed that when we pasted those strings together, there was a space automatically added? We pasted ‘Donald’ and ‘Trump’ together and got ‘Donald Trump’? If you are not a programmer, this is something you would expect. Because, why the hell not? In programming, you have to be precise, because a computer does not know what the fuck is going on. It just interprets your commands. This time, it just so happened, that this particular function paste() automatically adds a space between the things that you are trying to put together. But what if you did not want that space there? For that, there is a second variation of the paste() function called paste0(). Let’s see how it works.

## [1] "DonaldTrump"

Very nice.

You probably won’t understand why at the moment, but the paste0() function is much cooler and we’ll use it more often than ‘paste,’ because it gives us more control. It allows us to add a space whenever we want, and by avoiding that default setting it runs faster.

Now, look at how you can add that space using paste0().

## [1] "Donald Trump"

Pasting things together is very useful and simple thing. I want to show you a couple more things that we can do with characters. Chopping character strings up will be something we will be doing a lot. Here is how it works. Let’s say we have ‘Donald’ and ‘Trump’, but we want to convert it into ‘Dump’.

To get ‘Dump’ out of ‘Donald’ and ‘Trump’, we are going to need to get the first letter of the first name and the last three letters of the last name. We will be using function substr() for this.

## [1] "D"

Function substr() takes a character string, a character’s position to start chopping, and a finish position. So, we basically said: take ‘Donald’, start with ‘D’ and finish at ‘D’ and store that in the variable ‘D’. Let’s do it again, but without storing and variables, and in one line:

## [1] "D"

Good, now we need the ‘ump’ part. Same thing, but different numbers.

## [1] "ump"

Now, we need to paste these together and we already know how to do it.

## [1] "Dump"

Substr() will be a big part of your day to day programming. It is a very simple but very important function. It can be a little confusing in the beginning. Are these positions inclusive or exclusive? They are inclusive, but you’re likely to forget that. Don’t worry, it will become natural later.

As you might have noticed, every time I execute a function, I print the result. I’m only doing it for you, and you don’t really have to do it. In real life, I don’t print the results of every single operation, because I understand what the code output will be. However, it’s still a good practice to print your result, at least in the beginning. Let’s move on.

One last function I want to show you that you will be using a lot is trimws(). This function eliminates white spaces around a character string that you are passing to it. As you should remember from our paste() exercise, R can treat an empty space as a separate character. Imagine that you are dealing with some hand typed data; instead of typing ‘Donald’, someone typed ‘Donald’ or’ Donald’. These empty spaces make those two entries almost unusable because they won’t match the proper ‘Donald’ entry. That is where we’ll absolutely need functions like trimws(). Let’s simulate a situation to see how it works:

Now, let’s check if they are the same shit; for this we need a logical equality operator ‘==’. It works just like a regular equal ‘=’ sign but returns true or false. This is how we can compare characters and other non-numbers. Do not worry about it now, we will talk about it in depth later. So, let’s see if ‘Donald’ equals ‘Donald’:

## [1] FALSE

As you can see, they are not. Lets fix it.

## [1] "Donald"

Lets check again:

## [1] TRUE

Now they are, because we trimmed that empty space. Perfect.

This sums up characters for now; I showed you a few things that we can do with them. There are many more functions, and many interesting things we can do with characters, but the main point of this introduction is to get you to see the distinction between different data types. As we move along, you will see these functions more often along with other new functions.

0.7.2 Numbers

In R, numbers are called numerics. There are also integers and complex numbers, but don’t even worry about this right now as we won’t be working with them. For us, right now, numbers are numbers, and that is it. As far as you are concerned, 15.5 is a number, 5 is a number, 0 is a number. Everything you can do with numbers anywhere else; you can also do here. Let’s take a look at some basic operations. One distinction to get out of the way. Check this out:

## [1] "numeric"
## [1] "character"

Sage Tip: Adding parenthesis around a number will make it a character. Remember!

Now, back to the operatons:

Adding the two variables to get Trump’s age by the end of the first term.

## [1] 74

Or simply:

## [1] 74

Lets print his year of birth:

## [1] 1949

Hopefully, you get the idea; hopefully I don’t have to teach you basic math or statistics. You can definitely do more with numbers in R, but the main point of this part is to show you that numbers in R are the same as the numbers anywhere else.

0.7.3 Booleans or Logicals

You probably haven’t noticed, but we already used this variable type when checking if ‘Donald’ was equal to ‘Donald’ with space in the end. Booleans are also called ‘Logicals,’ and they are quite simple, because there are only two of them: ‘True’ and ‘False’. Also, you should know that, both, True and False have the corresponding numbers 1 and 0 that basically mean the same thing. Don’t worry about the numbers part; I’ll show you how that works.

First, let’s see if TRUE and 1 and FALSE and 0 are the same things.

## [1] TRUE
## [1] TRUE

As you can see, both returned as TRUE, which means they’re equal to each other. Now, let’s prove that not every number is equal to TRUE:

## [1] FALSE
## [1] FALSE

See? We are not going to be using 0 and 1 as true or false really, but you should be aware that there is such a thing.

Let’s see how we WILL be using Booleans:

## [1] TRUE
## [1] FALSE
## [1] TRUE

Now, to the big one:

I’m going to show you an {if else} operation here. I will briefly explain what it does, but you don’t have to remember it right now. We’ll get into ‘if else’ later.

## [1] "Trump is the president"
## [1] TRUE

What the fuck just happened here? Number 0: we stored ‘Trump’ in the president variable. Number 1: in the {if else statement} we basically asked: ‘does variable president equal ’Trump’?’ and if it does, run the Number 2 and Number 3. Number 2: paste together the contents of the variable president and the character string ‘is the president’ and print it. Number 3 Print a logical expression comparing the contents of the variable president to ‘Trump,’ it only has 2 options as you remember. In Number 4 we basically check what should happen if Number 1 is FALSE. If the variable president doesn’t equal ‘Trump’, Number 5 should kick in. Number 5 here is pretty much the same as Number 2. The main point of the {if else} statements is that only one of these two options can be TRUE and therefore, only one gets executed, in this case it is the Number 2 and Number 3.

Now, let’s change the president variable to good old Obama, and see if he is still president.

## [1] "Obama is not the president"

If all of that sounded like a lot of nonsense to you, do not worry about it. The ‘if else’ example is too much for now anyway. Just remember that Booleans (Logicals) consist of only two values, TRUE and FALSE. They exist so we could compare not only numbers but characters and other data types as well.

0.7.4 Factors

As I mentioned before, factors are stupid and we are going to try to avoid them as much as possible. If you want to learn about factors, you’re not going to do it here. I am going to give you a basic explanation and some examples; I will also explain why I think they are stupid.

Think of factors as categories or levels. Genders would be a factor; colors would be factors as well.

We haven’t reached ‘vectors’ and other data structures yet, but I need to use a vector here to show you factors. So, don’t worry if you can’t follow 100%.

Let’s store some colors in a vector of character strings.

This string is not a factor yet, but it’s a good candidate to be one. There are a limited number of colors and colors can be treated as categories.

Let’s first check what is colors.

## [1] "character"

As you can see, it says ‘character’, more like a group of characters, but it will say character. Fine.

Let’s convert it to a factor. I will use the function ‘factor()’, but do not try to remember it, we will never use it. Never!

## [1] "factor"

As you can see, it is a factor now. Finally, I want to print them side by side to show the difference:

## [1] "red"   "blue"  "green" "red"   "blue"  "green" "red"   "blue"  "green"
## [1] red   blue  green red   blue  green red   blue  green
## Levels: blue green red

Factors group categories into levels, characters do not.

Factors are useful for some advanced statistical operations. Once you reach that level, by all means, go ahead and start using them. Right now, factors will only get in the way. They will mess up your code. I am pretty sure; you’ll be hating them just like I do. Whatever factors are accomplishing with their levels can be accomplished by grouping regular characters without the worries of a messed-up code. Again, we will be avoiding factors as much as possible, but at least you are aware of them.

0.7.5 Dates & Times

This one is a big topic by itself, and we will spend good time diving deeper into it later in this book. Dates are important. One of the best things about R is how well it handles dates and how much flexibility it offers when dealing with them. That flexibility, though, does not come without cost. The cost is complexity. You will be dealing with dates and times a lot, and it will become a big source of frustration for you. However, once you master R and start looking into some other languages, you will appreciate how many options and how much flexibility R gives you. There are so many packages that deal with dates that if I start going over all of them, you will close this book. Therefore, I will show you just the one that I found to be the most universal. It is called ‘lubridate’. Let me show you a few examples of dates and what we can do with them. First thing we need to do is to install the package.

Now, load it.

You are doing great! Just messing with you, you have not really done shit yet!

Lets first create a date:

We just stored a character string that looks like date inside of the date variable. Let’s double check:

## [1] "character"

Just a character right now. Let’s convert using the function ymd() from the package ‘lubridate’:

Ymd here stands for year month and day. Lubridate has other variations as well.

## [1] "Date"

You would ask, ‘How is it different from having a date as a character?’ Sometimes it isn’t, but, at some point, you’ll want to do some math with your dates. For example, adding a day to a date. It’s impossible to do with a character, unless you want to manually retype shit every time.

Let’s see how it works with dates:

## [1] "2019-01-02"

This was just a short introduction to the topic of dates. I only showed you one function of one package. Dates are perfect for working with charts and graphs, projections and other calculations involving time. We are going to dive deeper into dates later in this book, but not untill we really need that knowledge.

0.7.6 Type Conversions

You are going to encounter many instances when you will be trying to match one dataset with another and it just won’t match. You look at your data, and it seems fine and clean. You check the column names, and they seem good for matching. What is going on? In many cases, it is just your data types are different. For example, a column where you have dates got converted to a character type, or some column where you had colors stored as character got converted to a factor type. It’s super annoying and it happens more often than you think. You need to be able to deal with that shit. Besides fixing types when they are causing you problems, there are going to be even more cases when you’re going to want to convert types for your analysis or something, so, don’t worry, it is not all bad. Let’s see how it works:

::: {.infobox .caution data-latex=“{caution}”} Luckily, conversions are quite simple and very uniform across different types, meaning that the functions are kind of similar.

  1. Number to Character.
## [1] "numeric"
## [1] "character"
  1. Character back to Number.
## [1] "numeric"
  1. Stupid Factor to Character and Numbers (will be doing this a lot).
## [1] "factor"
## [1] "character"

I want you to pay attention here:

## [1] 1

The number that we stored in the factor was 5, so why is it printing 1 now? Because it’s a fucking factor. It will mess you up! I’ll tell you why. Instead of printing the number that we stored, it printed the level associated with that number. The only thing that you need to remember, besides not using factors is the following:

## [1] 5
## [1] "numeric"

If you are converting a number that is a factor back into a number, you must first convert it into a character!

  1. Dates to Characters and Back
## [1] "2020-04-06"
## [1] "Date"
## [1] "2020-04-06"
## [1] "character"

I want to show you two ways to convert it back to a date. The first one is the one that we used before - ymd() form the ‘lubridate’ package. It is the most intuitive so I will insist on using it. The second is from the base R, meaning that you do not need any external packages to use it.

## [1] "2020-04-06"
## [1] "Date"
## [1] "2020-04-06"
## [1] "Date"

As you can see, they do the same shit. However, as we progress, we will be doing more sophisticated date and time operations, and you’ll see why I am insisting on lubridate.

These conversions were basic, but even this basic stuff will cover 95% of what you will ever need when dealing with data type conversions. The only part that we will need to spend more time on in the next chapters is the dates and times part. Other than that, your data type conversion foundation is built.

This, sort of, concludes the introduction to the basic data types that I want to cover. So far, you got to play with characters, numbers, booleans, stupid factors, and dates. These are the building blocks for the next part where we are going to look at the more complex data structures like vectors, lists, and data tables. You have already seen some of them, so it will not be anything special or complicated, but still, something that we just can’t skip. Strap in.

0.7.7 Vectors

If you are new to programming, vectors will be hard to wrap your head around right away. They are quite simple, though. However, because you are not used to working with data in that format, it will take some time to get used to. At least it was for me. I don’t know, maybe, I am so good of a teacher that you will get it right away. We will see.

Anyway, we have already seen a few vectors so far. Now, I think, the easiest way to understand vectors right away is like this:

  • step 1: imagine a table with some data;
  • step 2: each column in that table is a vector;
  • step 3: that is it!

If you ever worked with excel tables, you should be able to picture one. Every column with data in that table is a separate vector, where headers are just names for those vectors. It doesn’t matter what type of data are in those columns. If it just numbers, it is a numeric vector; if it’s characters, a character vector; it can even be mixed. Let’s quickly take a look.

Lets create three vectors. To create a vector you need to use the following syntax: c(…).

  1. With characters:
## [1] "blue"   "yellow" "green"  "red"
  1. With Numbers:
## [1] 1 2 3 4
  1. Mixed:
## [1] "1"    "dog"  "55"   "tree"

That’s it. You are basically just storing your data in bigger data structure. If you want, you can also apply functions to them. As an example, lets convert a numeric vector to a character vector.

## [1] "1" "2" "3" "4"
## [1] "character"

Whatever we did to a single number before, we are doing to every number in that vector.

Now, do you remember that comparison to the columns of a data table that I brought up? Let me show you. Let’s create a table out these vectors:

##   charVector numVector mixVector numCharVector
## 1       blue         1         1             1
## 2     yellow         2       dog             2
## 3      green         3        55             3
## 4        red         4      tree             4

We were able to create this table because the number of records in all these vectors is the same (4). Don’t worry about it now, we will cover data tables soon. Another note though, data tables and data frames in R are used interchangeably.

This is about all you need to know about vectors at the moment. I don’t know about you, but for me, it took quite some time to get them. Anyway, we will not be doing anything crazy with vectors anytime soon. I showed you that we can apply functions to them, but won’t be doing that either. We kind of will, but it will be in the context of applying them to columns of a dataframe and not just separate vectors. I think, this is enough for now.

0.7.8 Lists

Lists are similar to vectors in that they also store data inside. They are more complicated and are harder to understand right away. Lists store data hierarchically and apart from storing things like characters and numbers, they can also store vectors, data tables, as well as even other lists. They become very useful when you start working with loops. To give you an example, if you had fifty data tables and you wanted to do the same operation to all of them (something like converting all headers to upper case), instead of doing that shit fifty times, you would store all these data tables in a list, loop over it and apply that operation not just once, but to all elements of the list. It is an intermediate technic, so I don’t expect you to follow too much right now. Let me show you a few examples instead, so you could see how lists are different:

Now, do me a favor and create the same four vectors that we did in the previous section. Also, create a table with those four vectors, just like we did. Name it ‘table’.

Now that you have prepared everything, lets create some lists.

Just with simple data types:

## [1] "list"
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] "b"
## 
## [[4]]
## [1] 55
## 
## [[5]]
## [1] "100"

The following structure is a bit more complex compared to the ones we looked at so far. Double square brackets there indicate the number of the item in the list. It lets us access that element for example for looping. On top of that, you can go even deeper and access the variables inside of the list’s item. To access ‘a’ we need to do the following:

## [1] "a"

Double beackets access the fist element of the list and the single brackets give us the first element inside, which is ‘a’.

It will only get more complex from now, so I will stop printing the results to save space. You should still poke around and see what the following lists are all about:

List of vectors:

List of vectors + data tables:

List of tables:

List of lists:

As you can see, lists can get quite complex very quickly. They can be very useful when you are ready to use them. We aren’t ready and won’t using them any time soon. The only reason I showed them to you is for your overall understanding of the R data types. We’ll definitely come back to lists in time. For now, let’s move on to the topic of data tables. That is the one we will be using right away and a lot.

0.7.9 Data Tables (Data Frames)

Remember, a few topics ago, I told you to think of vectors as just columns inside data tables? Well, now, hear my (not only) definition of data frames (or data tables, as you might have noticed, we are using them interchangeably).

Data frame is a collection of vectors of the same lengths. That’s it. Why the same number of rows? Because, imagine a table, it is a rectangle, right? Even if some rows are empty, you, still, have the same number of them.

One of the best things about working with R, if not the best, is its handling of data tables. If you have ever worked with Excel or some other tabular data handler, then R’s way of dealing with such data will be very intuitive for you as well. We will be working with tables a lot. We will use them for everything, even for things that we don’t really need them for. For example, sometimes, it is much faster and efficient to do things with vectors, instead, we will be converting vectors into tables and then doing things to them. But why? Because tables are intuitive and vectors are confusing. I am trying to explain things to you the way I wish they were explained to me. If you are a genius who gets everything right away, then you don’t need this book. Get out of here. For the rest of us tabular analysis is key to understanding everything. Through working with tables, you will eventually get the rest of the data structures and will decide when and what to use on your own. Let’s look at the tables that we have already created.

We already have a data frame. Lets use it:

##   charVector numVector mixVector numCharVector
## 1       blue         1         1             1
## 2     yellow         2       dog             2
## 3      green         3        55             3
## 4        red         4      tree             4

When you use the function print(), it prints everything in the console (view 3). You can review your outputs like that, but it is not very practical. The better way is to use the Environment (view 2). Select the ‘newTable’ in your Environment. You should see the same table, but as a separate full screen tab. Point at the column names and hold your pointer there for a second, you should see the data type of that column. There isn’t much more that you can do with this view, but that’s why it’s great. It does not have a ton of options to confuse you. Close it.

Let’s do a couple of things to our new table.

Removing first row and fourth column:

It’s confusing in the beginning, but this syntax: ‘table[rows , columns]’ is standard and the most intuitive for dealing with tables. Within square brackets: the left side of the comma deals with rows and the right side with columns. So, if we wanted to eliminate the second row and the second column we would write: ‘newTable[-2,-2]’ and if we wanted to eliminate the first and the second rows and the first column, we would write: ‘newTable[-c(1,2),-2]’. You get the idea.

Now, let’s rename the columns:

If you just want to rename just one column:

##     col1 Sherman col3
## 2 yellow       2  dog
## 3  green       3   55
## 4    red       4 tree

Lets also count the number of rows and columns:

## [1] 3
## [1] 3

These were some of the basics of data frames. It’s all you need to know for now. If I start to show you more shit now, I will lose your attention, because you’ll stop following at some point. I am telling you; we are going to be working with data frames so much that it will be the first thing that you master.

Let’s summarize what we have looked at, and hopefully learned, in this section. First, we looked at R’s basic data types, which are characters, numbers, booleans, factors, as well as dates. We also applied a few functions to them and saw how we can convert one type to another. Then, we looked at the main data structures. These are vectors, lists, and data frames (also known as data tables). We created them, used one as a part of the other, and saw how we can modify them at will. You might have found some of that boring as fuck, but you need to know this stuff to proceed. In the next section, we’re going to talk about functions (not the ones that you write for yourself, but the existing ones), libraries (packages), file formats, loops, and SQL queries.

0.8 Coding Tools

We are done talking about the data types and structures. In this section, I would like to cover some ground on coding tools. I wasn’t sure what to name this section so I picked coding tools. Things like functions, libraries, loops, file formats, and SQL queries aren’t really data types or structures, nor are they coding, strictly speaking. Whatever, just fucking stick with coding tools.

0.8.1 Functions

We are going to split functions in two types: ones that you wrote, and ones that were written for you. Forget about writing your own functions for now. Here, we will be covering the second type. R has tons of functions. One of the best things about R is that it’s open source. Anybody can write a function, package it, and release. Because of that, there are functions almost for anything. The flip side of that, is that there is a bunch of functions that do same shit. This makes it almost impossible to have one guiding rule for language usage - there is no one right way to do things. It also creates a lot of competition between package creators, as each tries to be the shit who created the best tool for the job.

I personally enjoy to know more than one way to skin a cat (This is a joke!!! I do not support animal cruelty!!! I am serious!!!), but you might be different. There are distinct schools of thought in R, though. It’s not complete chaos so don’t worry. We will be following the two biggest ones.

Anyway, there are also two types of functions (out of those that are written for you): built-in and from the outside. The built-in (also known as base) are the functions that come pre-installed with R. You can do a lot with just them. You can do all the math and stats operations, dates, basic plots, and more. That would be fucking stupid though. I am sure, there are purists who do that. Fuck them. R is powerful and amazing because of the wealth of third-party functions that are available for free to everyone. There are functions for everything. I, obviously, can’t show you all the functions, and there is no need to dive into them now, but I will show you a few as an example.

We already installed and loaded the libraries that we will be using. But, in case you started a new session or something, lets load them again.

We have already applied a few functions here and there before: print(), substr(), trimws(), ymd(), as.Date, colnames(), and others are all functions.

Let’s use a few more. First, let’s create a table with three columns. One with numbers, second with characters, and third with numeric dates.

##   a....letters.1.4. b....seq.1.345..4.345. c....seq.20190101..20190104.
## 1                 a                  1.345                     20190101
## 2                 b                  2.345                     20190102
## 3                 c                  3.345                     20190103
## 4                 d                  4.345                     20190104

We used four functions here:

Ok, two things right away. Column names are messed up, and I also want to show you how to access columns of data tables in R.

Fixing column names:

To access a column of a dataframe, you will use the $ operator between the dataframe’s name and the column’s name. Like this:

Working with numbers. Rounding all numbers in that column to two decimals:

This section is a good place to talk about arguments in functions. Arguments are the things that you pass to a function. There can be any number of arguments in a function, it depends on the person who wrote the function. For example, the function round() that we just used, accepts two arguments: Number 1: a number or a vector of numbers Number 2: the number of decimal points to round to. In this function, if you do not specify the second argument it defaults to 0. That is called a default argument. The first argument here is, of course, mandatory. Without it, R will kick an error. Functions can have different number of arguments. Some are more important than others. Every time you’re about to use a new function, google it first to see its main arguments and avoid unintended results.

Let’s go back to the examples.

Creating another column and rounding all numbers down to the whole number using the function floor().

Same, but rounded up. Using the function ceiling().

##   letters numbers    dates roundDown roundUp
## 1       a    1.34 20190101         1       2
## 2       b    2.34 20190102         2       3
## 3       c    3.34 20190103         3       4
## 4       d    4.34 20190104         4       5

Here, we are using the function mean() to print the mean of the column ‘numbers’. We are not storing the result anywhere. Instead, we printing it right away.

## [1] 2.84

Using the function sum() to print the sum without storing.

## [1] 11.36

Printing min/max without storing.

## [1] 1.34
## [1] 4.34

Now, let’s do something to the character column.

Using the function toupper(), we are changing the column ‘letters’ to upper case.

##   letters numbers    dates roundDown roundUp
## 1       A    1.34 20190101         1       2
## 2       B    2.34 20190102         2       3
## 3       C    3.34 20190103         3       4
## 4       D    4.34 20190104         4       5

Switching back by using the function tolower().

Lets add the string ’ test’ to our letters by using the function paste0().

##   letters numbers    dates roundDown roundUp
## 1  a test    1.34 20190101         1       2
## 2  b test    2.34 20190102         2       3
## 3  c test    3.34 20190103         3       4
## 4  d test    4.34 20190104         4       5

Splitting one column into two with an empty space in the middle (little advanced).

##   columnA columnB numbers    dates roundDown roundUp
## 1       a    test    1.34 20190101         1       2
## 2       b    test    2.34 20190102         2       3
## 3       c    test    3.34 20190103         3       4
## 4       d    test    4.34 20190104         4       5

This is a good time to introduce the ‘%>%’ (pipe operator). A pipe operator will be very important for us. We will be using it in this book a lot and will continue using it all the way through the last book where we will be doing some very advanced asynchronous programming. This glues all that. Very important.

Now, here is what it means. Think of %>% (pipe) operator as the word ‘THEN’. Let’s look at the operation we executed above:

  • STEP 2) data <-
  • STEP 1) data %>% separate(letters, into = c(“columnA”, “columnB”), by = ’ ’)
  • STEP 3) print(data).

Step by step:

    1. take data, then, separate the column ‘letters’ into the columns ‘columnA’ and ‘columnB’ by emply space.
    1. store the result in data.
    1. print data.

Apart from the ordering being not from left to right, it is pretty fucking straightforward. Now, on top of being somewhat easy to understand, this type of syntax is also very efficient. You can chain lots of operations like that. I know that you don’t give a shit about efficiency at the moment, you are just trying to get this to work. Therefore, here is a very simple scheme:

dataframe <- dataframe %>% function(column) %>% function(column) %>% ….

You can basically chain as many as you want. We will see that in action later.

Finally, lets convert a numeric date column into an actual date:

##   columnA columnB numbers      dates roundDown roundUp
## 1       a    test    1.34 2019-01-01         1       2
## 2       b    test    2.34 2019-01-02         2       3
## 3       c    test    3.34 2019-01-03         3       4
## 4       d    test    4.34 2019-01-04         4       5

I just showed you a fraction of what I usually use, and what we are going to be using in this book. Some of these functions were from the base R and some from the external packages like lubridate, tidyr, and diplyr. The concept is simple: you need to do something, you look up a function for that, you install and load the package, you use the function like this: ‘result <- function(arguments)’. Next, we will look into packages.

0.8.2 Packages (Libraries)

Packages or Libraries are just containers for functions. There are tons of libraries out there, and that is great. We pretty much covered this whole topic of how packages and functions are amazing and how they do all these different things. So, in this part, I am just going to give you a list of libraries that you are going to install and drag with you every time you launch a new project. You will see some people who will be like “Bro, it is wrong to have so many packages loaded all the time.” Do not listen, they do not know what they are talking about. Dragging a bunch of packages, even if you will not use some of them, will save you a lot of time. So, here:

There are many more libraries we will use. I will be introducing them gradually, rather than all at once. For now, just install all of these and start dragging them from script to script when you start a new project. You don’t need to look into them now.

0.8.3 Data Formats

If you have ever seen a computer, then data formats shouldn’t be anything new to you. You don’t have to be a programmer to know the term. There are a bunch of different formats out there. Formats like pdf, xlsx, txt, csv, doc, docx, and others are a day to day thing that you see anyway. Why do we care? You will be saving a lot of your analysis using some sort of files, right? Also, you need to get your data from somewhere, because you almost never generate your own. Here, I will show you the formats that we are going to be using the most.

First, let’s create a table. Same one we did before.

0.8.4 CSV

In my experience, CSV is, by far, the most used format to store data in R. It’s compact, fast, supported by excel and other similar editors, and well supported by R. The fact that it is accessible by an Excel kind of software is very important. A lot of times you will be doing some analysis for your superiors. They might not know R, but they will know Excel. You should be able to send them your data in the format that they can consume.

This is how you save your table as a CSV:

  • 1.Get the path of where you want to store the file. If you do not know how - on windows go to that folder and on top you will see the path, copy it.
  • 2.Use that path as the second argument to the function. Like this: fwrite(data, ‘path’).
  • 3.When you paste the path you will see that you have single forward slashed separating the folders. Like this : /a/b/c. You must change this to //a//b//c or \a\b\c for it to work.

Here is my example:

Lets read it back.

##    a....letters.1.4. b....seq.1.345..4.345. c....seq.20190101..20190104.
## 1:                 a                  1.345                     20190101
## 2:                 b                  2.345                     20190102
## 3:                 c                  3.345                     20190103
## 4:                 d                  4.345                     20190104

0.8.5 XLSX (EXCEL)

Similar to csv but can have multiple tables. Basically, an excel spreadsheet. We won’t be writing this format, but, sometimes, you have to work with this format and you need to know how to read it in.

## Note: zip::zip() is deprecated, please use zip::zipr() instead

0.8.6 FST

This one is extremely fast. I think it was developed by Facebook to store huge data fast. We will be using it down the line in the next books.

There are also other popular formats that we will not be working with. They are RDS, JSON, and XML. You can look them up yourself.

As you can see, reading and writing these formats is easy. On top of that, all these functions are pretty much the same. CSV will be our number one, fst will be the distant second. You get the idea. All these file types are good for saving and sharing your data on small and local scale. You’ll use these to save some data for a future analysis, or to send to your boss, something like that. The way big boys work with data is through databases. In the next and final introductory section, I will show you how some of that is done.

0.8.7 SQL Queries

SQL is extremely important. You won’t go far without it. I won’t teach you the language itself in this book, but I’ll show you the most used commands and how to execute them. Here, I want to show you how to connect to a database and retrieve some data. I’ve set up a practice MySQL database, and, from now on, we will be interacting with it a lot. Let’s connect to it and retrieve some data. Before we do though, I want to say a few things about databases.

This is how I think about the types of databases out there.

  • There are old piece of shit databases that, if you’re unlucky, you will be forced to work with at work. That would happen if your organization has old infrastructure and isn’t planning to change it. There won’t be a difference in SQL language between fast and slow databases so it doesn’t matter. You won’t be wasting time learning it, you will just struggle a lot with speed.
  • If you are lucky, your database will be extremely fast, so fast that you will start to think that all databases are that fast. Such databases are not the most popular, because they are usually designed for some specific tasks and speed. They DO look like regular databases at first, but at some point, you will encounter their limitations - something small, something that you will need to do but it just will not be able to do. They might also be not free to use. I will show you such databases in the future and even teach you to set them up. Not now, though.

First, let’s connect to our database. For this, we will use a function from the library ‘DBI’. Even though, we never loaded this package, it got automatically added when we loaded RMySQL. This happens all the time. When you try to install a package, R checks if that package relies on some other packages to work. If it does, but they are not installed, R will automatically install them.

Without the confusing comments and spaces, it looks like this:

I am not going to explain each argument, because you do not need that at the moment. This setup will work for now. At work, this will be provided for you by a database administrator or something. We WILL be setting up our own databases in the future books, but it is too advanced right now.

So, when you execute the connection, it gets stored in your environment. It kind of just sits there until you disconnect. But now, you can use it to get the data from the database.

First, lets look at the list of tables inside:

Now, lets take the table ‘book_table’ and check the first 5 rows in there.

I want to spend some time here and take a look at what we just did. The overall syntax should be more or less clear to you now: we are applying some function on the right side of the arrow ( dbGetQuery(connection,“SELECT * FROM book_table limit 5”) ) and storing the result inside of the variable on the left side (data). The function that we are using is ‘dbGetQuery’. It has two arguments: the database connection, and the actual SQL command. The connection should be clear to you: we connected to the database with the credentials and stored that connection in the variable connection. That’s it. Now, about the actual SQL command - “SELECT * FROM book_table limit 5”. This is actual SQL language. Basically, we just used another programming language inside of R. This is how SQL works: there are key commands, and there are inputs. In our case, key commands are SELECT, FROM, LIMIT. The inputs are: *, book_table, 5.

Lets go line by line:

  • ‘SELECT’ - every sql query will start with SELECT
  • ’*’ - means all or everything
  • ‘FROM’ - will also be there every time
  • ‘book_table’ - specifying the table
  • ‘LIMIT’ - lets us set the number of rows to pull
  • ‘5’ - the number for the limit

The key commands let us filter the database to only extract what we need. the usual pattern is like this: SELECT something FROM datatable WHERE something GROUP BY something.

If you got that, perfect. If not, does not matter. It took me a while to even be able to execute a query. I am not even talking about understanding commands. For now, we will only be using SQL to pull the entire tables from it, without filtering anything. Later, we will gradually start adding key commands to our queries. Let’s pull the whole table:

If we can just pull the entire table like that, why did I show you the limit thing, and why do we need SQL filtering and all that extra shit? Even with 100,000 rows, the table that we just pulled is considered tiny by the world of big data standards. This table took a second to pull and about 20mb of your RAM. Imagine you are working with a table that stores daily porn searches. It probably has billions of rows added every day. Pulling big tables like that will crash your computer every time in minutes. To avoid that, you must be able to pre-aggregate your pulls using SQL. There you go. We are not doing any porn aggregation yet so we are going to be ok with pulling the entire thing.

Let me show you a few more pulls just for fun.

Pulling just vin, year, and record date. I will be surrounding the name of the columns in ticks. Ticks are used when columns have spaces. Something like ‘vehicle_year’ would be ok without ticks. Generally, avoid using spaces when naming things in programming.

Remember! There is no comma before FROM!

Let’s see the first few rows. head() lets you select the first n rows.

##   vehicle vin number vehicle year last date updated
## 1  4T1BK1EB5FU154526         2015        10/24/2019
## 2  2LMHJ5AT3ABJ10630         2010        10/24/2019
## 3  5TDZK3EH0DS100457         2013        10/24/2019

Let’s count the number of VIN numbers by vehicles’ year.

You should have noticed that some years are messed up. This is called ‘dirty data’. These data are not super ‘dirty’, but it still needs cleaning.

You should always disconnect from the database after you are done using it, because if you and hundreds of parasites like you do not, the database will freeze at some point.

## [1] TRUE

This was a basic introduction to SQL. If you did not get all of it, it is fine. SQL is not hard but the number of new functions, signs, and characters can be overwhelming considering that you are also trying to memorize the rest of the stuff. When I was learning this, all these queries had layers of other functions attached to them. Very confusing. Remember me writing this: print(head(something,5))?

Well, that is a chained function. When it is short like this, you still can decipher it. You are like: ‘Ok, maybe head takes first five rows and then we print it’. But now, imagine something like this:

The funny thing is that this query is exactly the same as the last one that we wrote, but it has all that extra shit on top of it. Misplace one comma or parenthesis and your whole code is fucked. I would be sitting for hours getting bombarded by fucking errors because of some stupid shit like that. When you are new, you just do not know what each dot or whatever means and that is natural. We will practice with it much more in this book.

You are done with the basics of R. Whatever was in this section is enough to get you started. We have looked at data types, structures, type conversions, functions, libraries, file formats, and SQL queries. That is a lot and boring, but absolutely necessary slush. There is a couple more things that I should have covered here but decided to leave them for later, loops and writing your own functions. We will not be using them any time soon. I will still talk about them later in this book, because you do need to know about them eventually. We are going back to the story. Tomorrow is Monday, you are going back to work. You have covered a lot of R ground and you think that whatever your boss throws at you tomorrow will be a piece of cake.


Creative Commons License
R, Not the Best Practices by Nikita Voevodin is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.