Chapter 8 Basic programming
Machine dreams hold a special vertigo.
–William Gibson133
Up to this point in the book I’ve tried hard to avoid using the word “programming” too much because – at least in my experience – it’s a word that can cause a lot of fear. For one reason or another, programming (like mathematics and statistics) is often perceived by people on the “outside” as a black art, a magical skill that can be learned only by some kind of super-nerd. I think this is a shame. It’s certainly true that advanced programming is a very specialised skill: several different skills actually, since there’s quite a lot of different kinds of programming out there. However, the basics of programming aren’t all that hard, and you can accomplish a lot of very impressive things just using those basics.
With that in mind, the goal of this chapter is to discuss a few basic programming concepts and how to apply them in R. However, before I do, I want to make one further attempt to point out just how non-magical programming really is, via one very simple observation: you already know how to do it. Stripped to its essentials, programming is nothing more (and nothing less) than the process of writing out a bunch of instructions that a computer can understand. To phrase this slightly differently, when you write a computer program, you need to write it in a programming language that the computer knows how to interpret. R is one such language. Although I’ve been having you type all your commands at the command prompt, and all the commands in this book so far have been shown as if that’s what I were doing, it’s also quite possible (and as you’ll see shortly, shockingly easy) to write a program using these R commands. In other words, if this is the first time reading this book, then you’re only one short chapter away from being able to legitimately claim that you can program in R, albeit at a beginner’s level.
8.1 Scripts
Computer programs come in quite a few different forms: the kind of program that we’re mostly interested in from the perspective of everyday data analysis using R is known as a script. The idea behind a script is that, instead of typing your commands into the R console one at a time, instead you write them all in a text file. Then, once you’ve finished writing them and saved the text file, you can get R to execute all the commands in your file by using the source()
function. In a moment I’ll show you exactly how this is done, but first I’d better explain why you should care.
8.1.1 Why use scripts?
Before discussing scripting and programming concepts in any more detail, it’s worth stopping to ask why you should bother. After all, if you look at the R commands that I’ve used everywhere else this book, you’ll notice that they’re all formatted as if I were typing them at the command line. Outside this chapter you won’t actually see any scripts. Do not be fooled by this. The reason that I’ve done it that way is purely for pedagogical reasons. My goal in this book is to teach statistics and to teach R. To that end, what I’ve needed to do is chop everything up into tiny little slices: each section tends to focus on one kind of statistical concept, and only a smallish number of R functions. As much as possible, I want you to see what each function does in isolation, one command at a time. By forcing myself to write everything as if it were being typed at the command line, it imposes a kind of discipline on me: it prevents me from piecing together lots of commands into one big script. From a teaching (and learning) perspective I think that’s the right thing to do… but from a data analysis perspective, it is not. When you start analysing real world data sets, you will rapidly find yourself needing to write scripts.
To understand why scripts are so very useful, it may be helpful to consider the drawbacks to typing commands directly at the command prompt. The approach that we’ve been adopting so far, in which you type commands one at a time, and R sits there patiently in between commands, is referred to as the interactive style. Doing your data analysis this way is rather like having a conversation … a very annoying conversation between you and your data set, in which you and the data aren’t directly speaking to each other, and so you have to rely on R to pass messages back and forth. This approach makes a lot of sense when you’re just trying out a few ideas: maybe you’re trying to figure out what analyses are sensible for your data, or maybe just you’re trying to remember how the various R functions work, so you’re just typing in a few commands until you get the one you want. In other words, the interactive style is very useful as a tool for exploring your data. However, it has a number of drawbacks:
It’s hard to save your work effectively. You can save the workspace, so that later on you can load any variables you created. You can save your plots as images. And you can even save the history or copy the contents of the R console to a file. Taken together, all these things let you create a reasonably decent record of what you did. But it does leave a lot to be desired. It seems like you ought to be able to save a single file that R could use (in conjunction with your raw data files) and reproduce everything (or at least, everything interesting) that you did during your data analysis.
It’s annoying to have to go back to the beginning when you make a mistake. Suppose you’ve just spent the last two hours typing in commands. Over the course of this time you’ve created lots of new variables and run lots of analyses. Then suddenly you realise that there was a nasty typo in the first command you typed, so all of your later numbers are wrong. Now you have to fix that first command, and then spend another hour or so combing through the R history to try and recreate what you did.
You can’t leave notes for yourself. Sure, you can scribble down some notes on a piece of paper, or even save a Word document that summarises what you did. But what you really want to be able to do is write down an English translation of your R commands, preferably right “next to” the commands themselves. That way, you can look back at what you’ve done and actually remember what you were doing. In the simple exercises we’ve engaged in so far, it hasn’t been all that hard to remember what you were doing or why you were doing it, but only because everything we’ve done could be done using only a few commands, and you’ve never been asked to reproduce your analysis six months after you originally did it! When your data analysis starts involving hundreds of variables, and requires quite complicated commands to work, then you really, really need to leave yourself some notes to explain your analysis to, well, yourself.
It’s nearly impossible to reuse your analyses later, or adapt them to similar problems. Suppose that, sometime in January, you are handed a difficult data analysis problem. After working on it for ages, you figure out some really clever tricks that can be used to solve it. Then, in September, you get handed a really similar problem. You can sort of remember what you did, but not very well. You’d like to have a clean record of what you did last time, how you did it, and why you did it the way you did. Something like that would really help you solve this new problem.
It’s hard to do anything except the basics. There’s a nasty side effect of these problems. Typos are inevitable. Even the best data analyst in the world makes a lot of mistakes. So the chance that you’ll be able to string together dozens of correct R commands in a row are very small. So unless you have some way around this problem, you’ll never really be able to do anything other than simple analyses.
It’s difficult to share your work other people. Because you don’t have this nice clean record of what R commands were involved in your analysis, it’s not easy to share your work with other people. Sure, you can send them all the data files you’ve saved, and your history and console logs, and even the little notes you wrote to yourself, but odds are pretty good that no-one else will really understand what’s going on (trust me on this: I’ve been handed lots of random bits of output from people who’ve been analysing their data, and it makes very little sense unless you’ve got the original person who did the work sitting right next to you explaining what you’re looking at)
Ideally, what you’d like to be able to do is something like this… Suppose you start out with a data set myrawdata.csv
. What you want is a single document – let’s call it mydataanalysis.R
– that stores all of the commands that you’ve used in order to do your data analysis. Kind of similar to the R history but much more focused. It would only include the commands that you want to keep for later. Then, later on, instead of typing in all those commands again, you’d just tell R to run all of the commands that are stored in mydataanalysis.R
. Also, in order to help you make sense of all those commands, what you’d want is the ability to add some notes or comments within the file, so that anyone reading the document for themselves would be able to understand what each of the commands actually does. But these comments wouldn’t get in the way: when you try to get R to run mydataanalysis.R
it would be smart enough would recognise that these comments are for the benefit of humans, and so it would ignore them. Later on you could tweak a few of the commands inside the file (maybe in a new file called mynewdatanalaysis.R
) so that you can adapt an old analysis to be able to handle a new problem. And you could email your friends and colleagues a copy of this file so that they can reproduce your analysis themselves.
In other words, what you want is a script.
8.1.2 Our first script
Okay then. Since scripts are so terribly awesome, let’s write one. To do this, open up a simple text editing program, like TextEdit (on a Mac) or Notebook (on Windows). Don’t use a fancy word processing program like Microsoft Word or OpenOffice: use the simplest program you can find. Open a new text document, and type some R commands, hitting enter after each command. Let’s try using x <- "hello world"
and print(x)
as our commands. Then save the document as hello.R
, and remember to save it as a plain text file: don’t save it as a word document or a rich text file. Just a boring old plain text file. Also, when it asks you where to save the file, save it to whatever folder you’re using as your working directory in R. At this point, you should be looking at something like Figure 8.1. And if so, you have now successfully written your first R program. Because I don’t want to take screenshots for every single script, I’m going to present scripts using extracts formatted as follows:
## --- hello.R
x <- "hello world"
print(x)
The line at the top is the filename, and not part of the script itself. Below that, you can see the two R commands that make up the script itself. Next to each command I’ve included the line numbers. You don’t actually type these into your script, but a lot of text editors (including the one built into Rstudio that I’ll show you in a moment) will show line numbers, since it’s a very useful convention that allows you to say things like “line 1 of the script creates a new variable, and line 2 prints it out”.
So how do we run the script? Assuming that the hello.R
file has been saved to your working directory, then you can run the script using the following command:
source( "hello.R" )
If the script file is saved in a different directory, then you need to specify the path to the file, in exactly the same way that you would have to when loading a data file using load()
. In any case, when you type this command, R opens up the script file: it then reads each command in the file in the same order that they appear in the file, and executes those commands in that order. The simple script that I’ve shown above contains two commands. The first one creates a variable x
and the second one prints it on screen. So, when we run the script, this is what we see on screen:
source("./rbook-master/scripts/hello.R")
## [1] "hello world"
If we inspect the workspace using a command like who()
or objects()
, we discover that R has created the new variable x
within the workspace, and not surprisingly x
is a character string containing the text "hello world"
. And just like that, you’ve written your first program R. It really is that simple.
8.1.3 Using Rstudio to write scripts
In the example above I assumed that you were writing your scripts using a simple text editor. However, it’s usually more convenient to use a text editor that is specifically designed to help you write scripts. There’s a lot of these out there, and experienced programmers will all have their own personal favourites. For our purposes, however, we can just use the one built into Rstudio. To create new script file in R studio, go to the “File” menu, select the “New” option, and then click on “R script”. This will open a new window within the “source” panel. Then you can type the commands you want (or code as it is generally called when you’re typing the commands into a script file) and save it when you’re done. The nice thing about using Rstudio to do this is that it automatically changes the colour of the text to indicate which parts of the code are comments and which are parts are actual R commands (these colours are called syntax highlighting, but they’re not actually part of the file – it’s just Rstudio trying to be helpful. To see an example of this, let’s open up our hello.R
script in Rstudio. To do this, go to the “File” menu again, and select “Open…”. Once you’ve opened the file, you should be looking at something like Figure 8.2. As you can see (if you’re looking at this book in colour) the character string “hello world” is highlighted in green.
Using Rstudio for your text editor is convenient for other reasons too. Notice in the top right hand corner of Figure 8.2 there’s a little button that reads “Source”? If you click on that, Rstudio will construct the relevant source()
command for you, and send it straight to the R console. So you don’t even have to type in the source()
command, which actually I think is a great thing, because it really bugs me having to type all those extra keystrokes every time I want to run my script. Anyway, Rstudio provide several other convenient little tools to help make scripting easier, but I won’t discuss them here.134
8.1.4 Commenting your script
When writing up your data analysis as a script, one thing that is generally a good idea is to include a lot of comments in the code. That way, if someone else tries to read it (or if you come back to it several days, weeks, months or years later) they can figure out what’s going on. As a beginner, I think it’s especially useful to comment thoroughly, partly because it gets you into the habit of commenting the code, and partly because the simple act of typing in an explanation of what the code does will help you keep it clear in your own mind what you’re trying to achieve. To illustrate this idea, consider the following script:
## --- itngscript.R
# A script to analyse nightgarden.Rdata_
# author: Dan Navarro_
# date: 22/11/2011_
# Load the data, and tell the user that this is what we're
# doing.
cat( "loading data from nightgarden.Rdata...\n" )
load( "./rbook-master/data/nightgarden.Rdata" )
# Create a cross tabulation and print it out:
cat( "tabulating data...\n" )
itng.table <- table( speaker, utterance )
print( itng.table )
You’ll notice that I’ve gone a bit overboard with my commenting: at the top of the script I’ve explained the purpose of the script, who wrote it, and when it was written. Then, throughout the script file itself I’ve added a lot of comments explaining what each section of the code actually does. In real life people don’t tend to comment this thoroughly, but the basic idea is a very good one: you really do want your script to explain itself. Nevertheless, as you’d expect R completely ignores all of the commented parts. When we run this script, this is what we see on screen:
## --- itngscript.R
# A script to analyse nightgarden.Rdata
# author: Dan Navarro
# date: 22/11/2011
# Load the data, and tell the user that this is what we're
# doing.
cat( "loading data from nightgarden.Rdata...\n" )
## loading data from nightgarden.Rdata...
load( "./rbook-master/data/nightgarden.Rdata" )
# Create a cross tabulation and print it out:
cat( "tabulating data...\n" )
## tabulating data...
itng.table <- table( speaker, utterance )
print( itng.table )
## utterance
## speaker ee onk oo pip
## makka-pakka 0 2 0 2
## tombliboo 1 0 1 0
## upsy-daisy 0 2 0 2
Even here, notice that the script announces its behaviour. The first two lines of the output tell us a lot about what the script is actually doing behind the scenes (the code do to this corresponds to the two cat()
commands on lines 8 and 12 of the script). It’s usually a pretty good idea to do this, since it helps ensure that the output makes sense when the script is executed.
8.1.5 Differences between scripts and the command line
For the most part, commands that you insert into a script behave in exactly the same way as they would if you typed the same thing in at the command line. The one major exception to this is that if you want a variable to be printed on screen, you need to explicitly tell R to print it. You can’t just type the name of the variable. For example, our original hello.R
script produced visible output. The following script does not:
## --- silenthello.R
x <- "hello world"
x
It does still create the variable x
when you source()
the script, but it won’t print anything on screen.
However, apart from the fact that scripts don’t use “auto-printing” as it’s called, there aren’t a lot of differences in the underlying mechanics. There are a few stylistic differences though. For instance, if you want to load a package at the command line, you would generally use the library()
function. If you want do to it from a script, it’s conventional to use require()
instead. The two commands are basically identical, the only difference being that if the package doesn’t exist, require()
produces a warning whereas library()
gives you an error. Stylistically, what this means is that if the require()
command fails in your script, R will boldly continue on and try to execute the rest of the script. Often that’s what you’d like to see happen, so it’s better to use require()
. Clearly, however, you can get by just fine using the library()
command for everyday usage.
8.1.6 Done!
At this point, you’ve learned the basics of scripting. You are now officially allowed to say that you can program in R, though you probably shouldn’t say it too loudly. There’s a lot more to learn, but nevertheless, if you can write scripts like these then what you are doing is in fact basic programming. The rest of this chapter is devoted to introducing some of the key commands that you need in order to make your programs more powerful; and to help you get used to thinking in terms of scripts, for the rest of this chapter I’ll write up most of my extracts as scripts.
8.2 Loops
The description I gave earlier for how a script works was a tiny bit of a lie. Specifically, it’s not necessarily the case that R starts at the top of the file and runs straight through to the end of the file. For all the scripts that we’ve seen so far that’s exactly what happens, and unless you insert some commands to explicitly alter how the script runs, that is what will always happen. However, you actually have quite a lot of flexibility in this respect. Depending on how you write the script, you can have R repeat several commands, or skip over different commands, and so on. This topic is referred to as flow control, and the first concept to discuss in this respect is the idea of a loop. The basic idea is very simple: a loop is a block of code (i.e., a sequence of commands) that R will execute over and over again until some termination criterion is met. Looping is a very powerful idea. There are three different ways to construct a loop in R, based on the while
, for
and repeat
functions. I’ll only discuss the first two in this book.
8.2.1 The while
loop
A while
loop is a simple thing. The basic format of the loop looks like this:
while ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
}
The code corresponding to CONDITION needs to produce a logical value, either TRUE
or FALSE
. Whenever R encounters a while
statement, it checks to see if the CONDITION is TRUE
. If it is, then R goes on to execute all of the commands inside the curly brackets, proceeding from top to bottom as usual. However, when it gets to the bottom of those statements, it moves back up to the while
statement. Then, like the mindless automaton it is, it checks to see if the CONDITION is TRUE
. If it is, then R goes on to execute all … well, you get the idea. This continues endlessly until at some point the CONDITION turns out to be FALSE
. Once that happens, R jumps to the bottom of the loop (i.e., to the }
character), and then continues on with whatever commands appear next in the script.
To start with, let’s keep things simple, and use a while
loop to calculate the smallest multiple of 17 that is greater than or equal to 1000. This is a very silly example since you can actually calculate it using simple arithmetic operations, but the point here isn’t to do something novel. The point is to show how to write a while
loop. Here’s the script:
## --- whileexample.R
x <- 0
while ( x < 1000 ) {
x <- x + 17
}
print( x )
When we run this script, R starts at the top and creates a new variable called x
and assigns it a value of 0. It then moves down to the loop, and “notices” that the condition here is x < 1000
. Since the current value of x
is zero, the condition is true, so it enters the body of the loop (inside the curly braces). There’s only one command here135 which instructs R to increase the value of x
by 17. R then returns to the top of the loop, and rechecks the condition. The value of x
is now 17, but that’s still less than 1000, so the loop continues. This cycle will continue for a total of 59 iterations, until finally x
reaches a value of 1003 (i.e., \(59 \times 17 = 1003\)). At this point, the loop stops, and R finally reaches line 5 of the script, prints out the value of x
on screen, and then halts. Let’s watch:
source( "./rbook-master/scripts/whileexample.R" )
## [1] 1003
Truly fascinating stuff.
8.2.2 The for
loop
The for
loop is also pretty simple, though not quite as simple as the while
loop. The basic format of this loop goes like this:
for ( VAR in VECTOR ) {
STATEMENT1
STATEMENT2
ETC
}
In a for
loop, R runs a fixed number of iterations. We have a VECTOR which has several elements, each one corresponding to a possible value of the variable VAR. In the first iteration of the loop, VAR is given a value corresponding to the first element of VECTOR; in the second iteration of the loop VAR gets a value corresponding to the second value in VECTOR; and so on. Once we’ve exhausted all of the values in VECTOR, the loop terminates and the flow of the program continues down the script.
Once again, let’s use some very simple examples. Firstly, here is a program that just prints out the word “hello” three times and then stops:
## --- forexample.R
for ( i in 1:3 ) {
print( "hello" )
}
This is the simplest example of a for
loop. The vector of possible values for the i
variable just corresponds to the numbers from 1 to 3. Not only that, the body of the loop doesn’t actually depend on i
at all. Not surprisingly, here’s what happens when we run it:
source( "./rbook-master/scripts/forexample.R" )
## [1] "hello"
## [1] "hello"
## [1] "hello"
However, there’s nothing that stops you from using something non-numeric as the vector of possible values, as the following example illustrates. This time around, we’ll use a character vector to control our loop, which in this case will be a vector of words
. And what we’ll do in the loop is get R to convert the word to upper case letters, calculate the length of the word, and print it out. Here’s the script:
## --- forexample2.R
#the words_
words <- c("it","was","the","dirty","end","of","winter")
#loop over the words_
for ( w in words ) {
w.length <- nchar( w ) # calculate the number of letters_
W <- toupper( w ) # convert the word to upper case letters_
msg <- paste( W, "has", w.length, "letters" ) # a message to print_
print( msg ) # print it_
}
And here’s the output:
source( "./rbook-master/scripts/forexample2.R" )
## [1] "IT has 2 letters"
## [1] "WAS has 3 letters"
## [1] "THE has 3 letters"
## [1] "DIRTY has 5 letters"
## [1] "END has 3 letters"
## [1] "OF has 2 letters"
## [1] "WINTER has 6 letters"
Again, pretty straightforward I hope.
8.2.3 A more realistic example of a loop
To give you a sense of how you can use a loop in a more complex situation, let’s write a simple script to simulate the progression of a mortgage. Suppose we have a nice young couple who borrow $300000 from the bank, at an annual interest rate of 5%. The mortgage is a 30 year loan, so they need to pay it off within 360 months total. Our happy couple decide to set their monthly mortgage payment at $1600 per month. Will they pay off the loan in time or not? Only time will tell.136 Or, alternatively, we could simulate the whole process and get R to tell us. The script to run this is a fair bit more complicated.
## --- mortgage.R
# set up
month <- 0 # count the number of months
balance <- 300000 # initial mortgage balance
payments <- 1600 # monthly payments
interest <- 0.05 # 5% interest rate per year
total.paid <- 0 # track what you've paid the bank
# convert annual interest to a monthly multiplier
monthly.multiplier <- (1+interest) ^ (1/12)
# keep looping until the loan is paid off...
while ( balance > 0 ) {
# do the calculations for this month
month <- month + 1 # one more month
balance <- balance * monthly.multiplier # add the interest
balance <- balance - payments # make the payments
total.paid <- total.paid + payments # track the total paid
# print the results on screen
cat( "month", month, ": balance", round(balance), "\n")
} # end of loop
# print the total payments at the end
cat("total payments made", total.paid, "\n" )
To explain what’s going on, let’s go through it carefully. In the first block of code (under #set up
) all we’re doing is specifying all the variables that define the problem. The loan starts with a balance
of $300,000 owed to the bank on month
zero, and at that point in time the total.paid
money is nothing. The couple is making monthly payments
of $1600, at an annual interest
rate of 5%. Next, we convert the annual percentage interest into a monthly multiplier. That is, the number that you have to multiply the current balance by each month in order to produce an annual interest rate of 5%. An annual interest rate of 5% implies that, if no payments were made over 12 months the balance would end up being \(1.05\) times what it was originally, so the annual multiplier is \(1.05\). To calculate the monthly multiplier, we need to calculate the 12th root of 1.05 (i.e., raise 1.05 to the power of 1/12). We store this value in as the monthly.multiplier
variable, which as it happens corresponds to a value of about 1.004. All of which is a rather long winded way of saying that the annual interest rate of 5% corresponds to a monthly interest rate of about 0.4%.
Anyway… all of that is really just setting the stage. It’s not the interesting part of the script. The interesting part (such as it is) is the loop. The while
statement on tells R that it needs to keep looping until the balance
reaches zero (or less, since it might be that the final payment of $1600 pushes the balance below zero). Then, inside the body of the loop, we have two different blocks of code. In the first bit, we do all the number crunching. Firstly we increase the value month
by 1. Next, the bank charges the interest, so the balance
goes up. Then, the couple makes their monthly payment and the balance
goes down. Finally, we keep track of the total amount of money that the couple has paid so far, by adding the payments
to the running tally. After having done all this number crunching, we tell R to issue the couple with a very terse monthly statement, which just indicates how many months they’ve been paying the loan and how much money they still owe the bank. Which is rather rude of us really. I’ve grown attached to this couple and I really feel they deserve better than that. But, that’s banks for you.
In any case, the key thing here is the tension between the increase in balance
on and the decrease. As long as the decrease is bigger, then the balance will eventually drop to zero and the loop will eventually terminate. If not, the loop will continue forever! This is actually very bad programming on my part: I really should have included something to force R to stop if this goes on too long. However, I haven’t shown you how to evaluate “if” statements yet, so we’ll just have to hope that the author of the book has rigged the example so that the code actually runs. Hm. I wonder what the odds of that are? Anyway, assuming that the loop does eventually terminate, there’s one last line of code that prints out the total amount of money that the couple handed over to the bank over the lifetime of the loan.
Now that I’ve explained everything in the script in tedious detail, let’s run it and see what happens:
source( "./rbook-master/scripts/mortgage.R" )
## month 1 : balance 299622
## month 2 : balance 299243
## month 3 : balance 298862
## month 4 : balance 298480
## month 5 : balance 298096
## month 6 : balance 297710
## month 7 : balance 297323
## month 8 : balance 296934
## month 9 : balance 296544
## month 10 : balance 296152
## month 11 : balance 295759
## month 12 : balance 295364
## month 13 : balance 294967
## month 14 : balance 294569
## month 15 : balance 294169
## month 16 : balance 293768
## month 17 : balance 293364
## month 18 : balance 292960
## month 19 : balance 292553
## month 20 : balance 292145
## month 21 : balance 291735
## month 22 : balance 291324
## month 23 : balance 290911
## month 24 : balance 290496
## month 25 : balance 290079
## month 26 : balance 289661
## month 27 : balance 289241
## month 28 : balance 288820
## month 29 : balance 288396
## month 30 : balance 287971
## month 31 : balance 287545
## month 32 : balance 287116
## month 33 : balance 286686
## month 34 : balance 286254
## month 35 : balance 285820
## month 36 : balance 285385
## month 37 : balance 284947
## month 38 : balance 284508
## month 39 : balance 284067
## month 40 : balance 283625
## month 41 : balance 283180
## month 42 : balance 282734
## month 43 : balance 282286
## month 44 : balance 281836
## month 45 : balance 281384
## month 46 : balance 280930
## month 47 : balance 280475
## month 48 : balance 280018
## month 49 : balance 279559
## month 50 : balance 279098
## month 51 : balance 278635
## month 52 : balance 278170
## month 53 : balance 277703
## month 54 : balance 277234
## month 55 : balance 276764
## month 56 : balance 276292
## month 57 : balance 275817
## month 58 : balance 275341
## month 59 : balance 274863
## month 60 : balance 274382
## month 61 : balance 273900
## month 62 : balance 273416
## month 63 : balance 272930
## month 64 : balance 272442
## month 65 : balance 271952
## month 66 : balance 271460
## month 67 : balance 270966
## month 68 : balance 270470
## month 69 : balance 269972
## month 70 : balance 269472
## month 71 : balance 268970
## month 72 : balance 268465
## month 73 : balance 267959
## month 74 : balance 267451
## month 75 : balance 266941
## month 76 : balance 266428
## month 77 : balance 265914
## month 78 : balance 265397
## month 79 : balance 264878
## month 80 : balance 264357
## month 81 : balance 263834
## month 82 : balance 263309
## month 83 : balance 262782
## month 84 : balance 262253
## month 85 : balance 261721
## month 86 : balance 261187
## month 87 : balance 260651
## month 88 : balance 260113
## month 89 : balance 259573
## month 90 : balance 259031
## month 91 : balance 258486
## month 92 : balance 257939
## month 93 : balance 257390
## month 94 : balance 256839
## month 95 : balance 256285
## month 96 : balance 255729
## month 97 : balance 255171
## month 98 : balance 254611
## month 99 : balance 254048
## month 100 : balance 253483
## month 101 : balance 252916
## month 102 : balance 252346
## month 103 : balance 251774
## month 104 : balance 251200
## month 105 : balance 250623
## month 106 : balance 250044
## month 107 : balance 249463
## month 108 : balance 248879
## month 109 : balance 248293
## month 110 : balance 247705
## month 111 : balance 247114
## month 112 : balance 246521
## month 113 : balance 245925
## month 114 : balance 245327
## month 115 : balance 244727
## month 116 : balance 244124
## month 117 : balance 243518
## month 118 : balance 242911
## month 119 : balance 242300
## month 120 : balance 241687
## month 121 : balance 241072
## month 122 : balance 240454
## month 123 : balance 239834
## month 124 : balance 239211
## month 125 : balance 238585
## month 126 : balance 237958
## month 127 : balance 237327
## month 128 : balance 236694
## month 129 : balance 236058
## month 130 : balance 235420
## month 131 : balance 234779
## month 132 : balance 234136
## month 133 : balance 233489
## month 134 : balance 232841
## month 135 : balance 232189
## month 136 : balance 231535
## month 137 : balance 230879
## month 138 : balance 230219
## month 139 : balance 229557
## month 140 : balance 228892
## month 141 : balance 228225
## month 142 : balance 227555
## month 143 : balance 226882
## month 144 : balance 226206
## month 145 : balance 225528
## month 146 : balance 224847
## month 147 : balance 224163
## month 148 : balance 223476
## month 149 : balance 222786
## month 150 : balance 222094
## month 151 : balance 221399
## month 152 : balance 220701
## month 153 : balance 220000
## month 154 : balance 219296
## month 155 : balance 218590
## month 156 : balance 217880
## month 157 : balance 217168
## month 158 : balance 216453
## month 159 : balance 215735
## month 160 : balance 215014
## month 161 : balance 214290
## month 162 : balance 213563
## month 163 : balance 212833
## month 164 : balance 212100
## month 165 : balance 211364
## month 166 : balance 210625
## month 167 : balance 209883
## month 168 : balance 209138
## month 169 : balance 208390
## month 170 : balance 207639
## month 171 : balance 206885
## month 172 : balance 206128
## month 173 : balance 205368
## month 174 : balance 204605
## month 175 : balance 203838
## month 176 : balance 203069
## month 177 : balance 202296
## month 178 : balance 201520
## month 179 : balance 200741
## month 180 : balance 199959
## month 181 : balance 199174
## month 182 : balance 198385
## month 183 : balance 197593
## month 184 : balance 196798
## month 185 : balance 196000
## month 186 : balance 195199
## month 187 : balance 194394
## month 188 : balance 193586
## month 189 : balance 192775
## month 190 : balance 191960
## month 191 : balance 191142
## month 192 : balance 190321
## month 193 : balance 189496
## month 194 : balance 188668
## month 195 : balance 187837
## month 196 : balance 187002
## month 197 : balance 186164
## month 198 : balance 185323
## month 199 : balance 184478
## month 200 : balance 183629
## month 201 : balance 182777
## month 202 : balance 181922
## month 203 : balance 181063
## month 204 : balance 180201
## month 205 : balance 179335
## month 206 : balance 178466
## month 207 : balance 177593
## month 208 : balance 176716
## month 209 : balance 175836
## month 210 : balance 174953
## month 211 : balance 174065
## month 212 : balance 173175
## month 213 : balance 172280
## month 214 : balance 171382
## month 215 : balance 170480
## month 216 : balance 169575
## month 217 : balance 168666
## month 218 : balance 167753
## month 219 : balance 166836
## month 220 : balance 165916
## month 221 : balance 164992
## month 222 : balance 164064
## month 223 : balance 163133
## month 224 : balance 162197
## month 225 : balance 161258
## month 226 : balance 160315
## month 227 : balance 159368
## month 228 : balance 158417
## month 229 : balance 157463
## month 230 : balance 156504
## month 231 : balance 155542
## month 232 : balance 154576
## month 233 : balance 153605
## month 234 : balance 152631
## month 235 : balance 151653
## month 236 : balance 150671
## month 237 : balance 149685
## month 238 : balance 148695
## month 239 : balance 147700
## month 240 : balance 146702
## month 241 : balance 145700
## month 242 : balance 144693
## month 243 : balance 143683
## month 244 : balance 142668
## month 245 : balance 141650
## month 246 : balance 140627
## month 247 : balance 139600
## month 248 : balance 138568
## month 249 : balance 137533
## month 250 : balance 136493
## month 251 : balance 135449
## month 252 : balance 134401
## month 253 : balance 133349
## month 254 : balance 132292
## month 255 : balance 131231
## month 256 : balance 130166
## month 257 : balance 129096
## month 258 : balance 128022
## month 259 : balance 126943
## month 260 : balance 125861
## month 261 : balance 124773
## month 262 : balance 123682
## month 263 : balance 122586
## month 264 : balance 121485
## month 265 : balance 120380
## month 266 : balance 119270
## month 267 : balance 118156
## month 268 : balance 117038
## month 269 : balance 115915
## month 270 : balance 114787
## month 271 : balance 113654
## month 272 : balance 112518
## month 273 : balance 111376
## month 274 : balance 110230
## month 275 : balance 109079
## month 276 : balance 107923
## month 277 : balance 106763
## month 278 : balance 105598
## month 279 : balance 104428
## month 280 : balance 103254
## month 281 : balance 102074
## month 282 : balance 100890
## month 283 : balance 99701
## month 284 : balance 98507
## month 285 : balance 97309
## month 286 : balance 96105
## month 287 : balance 94897
## month 288 : balance 93683
## month 289 : balance 92465
## month 290 : balance 91242
## month 291 : balance 90013
## month 292 : balance 88780
## month 293 : balance 87542
## month 294 : balance 86298
## month 295 : balance 85050
## month 296 : balance 83797
## month 297 : balance 82538
## month 298 : balance 81274
## month 299 : balance 80005
## month 300 : balance 78731
## month 301 : balance 77452
## month 302 : balance 76168
## month 303 : balance 74878
## month 304 : balance 73583
## month 305 : balance 72283
## month 306 : balance 70977
## month 307 : balance 69666
## month 308 : balance 68350
## month 309 : balance 67029
## month 310 : balance 65702
## month 311 : balance 64369
## month 312 : balance 63032
## month 313 : balance 61688
## month 314 : balance 60340
## month 315 : balance 58986
## month 316 : balance 57626
## month 317 : balance 56261
## month 318 : balance 54890
## month 319 : balance 53514
## month 320 : balance 52132
## month 321 : balance 50744
## month 322 : balance 49351
## month 323 : balance 47952
## month 324 : balance 46547
## month 325 : balance 45137
## month 326 : balance 43721
## month 327 : balance 42299
## month 328 : balance 40871
## month 329 : balance 39438
## month 330 : balance 37998
## month 331 : balance 36553
## month 332 : balance 35102
## month 333 : balance 33645
## month 334 : balance 32182
## month 335 : balance 30713
## month 336 : balance 29238
## month 337 : balance 27758
## month 338 : balance 26271
## month 339 : balance 24778
## month 340 : balance 23279
## month 341 : balance 21773
## month 342 : balance 20262
## month 343 : balance 18745
## month 344 : balance 17221
## month 345 : balance 15691
## month 346 : balance 14155
## month 347 : balance 12613
## month 348 : balance 11064
## month 349 : balance 9509
## month 350 : balance 7948
## month 351 : balance 6380
## month 352 : balance 4806
## month 353 : balance 3226
## month 354 : balance 1639
## month 355 : balance 46
## month 356 : balance -1554
## total payments made 569600
So our nice young couple have paid off their $300,000 loan in just 4 months shy of the 30 year term of their loan, at a bargain basement price of $568,046 (since 569600 - 1554 = 568046). A happy ending!
8.3 Conditional statements
A second kind of flow control that programming languages provide is the ability to evaluate conditional statements. Unlike loops, which can repeat over and over again, a conditional statement only executes once, but it can switch between different possible commands depending on a CONDITION that is specified by the programmer. The power of these commands is that they allow the program itself to make choices, and in particular, to make different choices depending on the context in which the program is run. The most prominent of example of a conditional statement is the if
statement, and the accompanying else
statement. The basic format of an if
statement in R is as follows:
if ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
}
And the execution of the statement is pretty straightforward. If the CONDITION is true, then R will execute the statements contained in the curly braces. If the CONDITION is false, then it dose not. If you want to, you can extend the if
statement to include an else
statement as well, leading to the following syntax:
if ( CONDITION ) {
STATEMENT1
STATEMENT2
ETC
} else {
STATEMENT3
STATEMENT4
ETC
}
As you’d expect, the interpretation of this version is similar. If the CONDITION is true, then the contents of the first block of code (i.e., STATEMENT1, STATEMENT2, ETC) are executed; but if it is false, then the contents of the second block of code (i.e., STATEMENT3, STATEMENT4, ETC) are executed instead.
To give you a feel for how you can use if
and else
to do something useful, the example that I’ll show you is a script that prints out a different message depending on what day of the week you run it. We can do this making use of some of the tools that we discussed in Section 7.11.3. Here’s the script:
## --- ifelseexample.R
# find out what day it is...
today <- Sys.Date() # pull the date from the system clock
day <- weekdays( today ) # what day of the week it is_
# now make a choice depending on the day...
if ( day == "Monday" ) {
print( "I don't like Mondays" )
} else {
print( "I'm a happy little automaton" )
}
## [1] "I'm a happy little automaton"
Since today happens to be a Sunday, when I run the script here’s what happens:
source( "./rbook-master/scripts/ifelseexample.R" )
## [1] "I'm a happy little automaton"
There are other ways of making conditional statements in R. In particular, the ifelse()
function and the switch()
functions can be very useful in different contexts. However, my main aim in this chapter is to briefly cover the very basics, so I’ll move on.
8.4 Writing functions
In this section I want to talk about functions again. Functions were introduced in Section 3.5, but you’ve learned a lot about R since then, so we can talk about them in more detail. In particular, I want to show you how to create your own. To stick with the same basic framework that I used to describe loops and conditionals, here’s the syntax that you use to create a function:
FNAME <- function ( ARG1, ARG2, ETC ) {
STATEMENT1
STATEMENT2
ETC
return( VALUE )
}
What this does is create a function with the name FNAME, which has arguments ARG1, ARG2 and so forth. Whenever the function is called, R executes the statements in the curly braces, and then outputs the contents of VALUE to the user. Note, however, that R does not execute the commands inside the function in the workspace. Instead, what it does is create a temporary local environment: all the internal statements in the body of the function are executed there, so they remain invisible to the user. Only the final results in the VALUE are returned to the workspace.
To give a simple example of this, let’s create a function called quadruple()
which multiplies its inputs by four. In keeping with the approach taken in the rest of the chapter, I’ll use a script to do this:
## --- functionexample.R
quadruple <- function(x) {
y <- x*4
return(y)
}
When we run this script, as follows
source( "./rbook-master/scripts/functionexample.R" )
nothing appears to have happened, but there is a new object created in the workspace called quadruple
. Not surprisingly, if we ask R to tell us what kind of object it is, it tells us that it is a function:
class( quadruple )
## [1] "function"
And now that we’ve created the quadruple()
function, we can call it just like any other function And if I want to store the output as a variable, I can do this:
my.var <- quadruple(10)
print(my.var)
## [1] 40
An important thing to recognise here is that the two internal variables that the quadruple()
function makes use of, x
and y
, stay internal. That is, if we inspect the contents of the workspace,
library(lsr)
## Warning: package 'lsr' was built under R version 3.5.2
who()
## -- Name -- -- Class -- -- Size --
## balance numeric 1
## day character 1
## i integer 1
## interest numeric 1
## itng.table table 3 x 4
## month numeric 1
## monthly.multiplier numeric 1
## msg character 1
## my.var numeric 1
## payments numeric 1
## quadruple function
## speaker character 10
## today Date 1
## total.paid numeric 1
## utterance character 10
## w character 1
## W character 1
## w.length integer 1
## words character 7
## x numeric 1
we see everything in our workspace from this chapter including the quadruple()
function itself, as well as the my.var
variable that we just created.
Now that we know how to create our own functions in R, it’s probably a good idea to talk a little more about some of the other properties of functions that I’ve been glossing over. To start with, let’s take this opportunity to type the name of the function at the command line without the parentheses:
quadruple
## function (x)
## {
## y <- x * 4
## return(y)
## }
As you can see, when you type the name of a function at the command line, R prints out the underlying source code that we used to define the function in the first place. In the case of the quadruple()
function, this is quite helpful to us – we can read this code and actually see what the function does. For other functions, this is less helpful, as we saw back in Section 3.5 when we tried typing citation
rather than citation()
.
8.4.1 Function arguments revisited
Okay, now that we are starting to get a sense for how functions are constructed, let’s have a look at two, slightly more complicated functions that I’ve created. The source code for these functions is contained within the functionexample2.R
and functionexample3.R
scripts. Let’s start by looking at the first one:
## --- functionexample2.R
pow <- function( x, y = 1) {
out <- x^y # raise x to the power y
return( out )
}
and if we type source("functionexample2.R")
to load the pow()
function into our workspace, then we can make use of it. As you can see from looking at the code for this function, it has two arguments x
and y
, and all it does is raise x
to the power of y
. For instance, this command
pow(x=3, y=2)
## [1] 9
calculates the value of \(3^2\). The interesting thing about this function isn’t what it does, since R already has has perfectly good mechanisms for calculating powers. Rather, notice that when I defined the function, I specified y=1
when listing the arguments? That’s the default value for y
. So if we enter a command without specifying a value for y
, then the function assumes that we want y=1
:
pow( x=3 )
## [1] 3
However, since I didn’t specify any default value for x
when I defined the pow()
function, we always need to input a value for x
. If we don’t R will spit out an error message.
So now you know how to specify default values for an argument. The other thing I should point out while I’m on this topic is the use of the ...
argument. The ...
argument is a special construct in R which is only used within functions. It is used as a way of matching against multiple user inputs: in other words, ...
is used as a mechanism to allow the user to enter as many inputs as they like. I won’t talk at all about the low-level details of how this works at all, but I will show you a simple example of a function that makes use of it. To that end, consider the following script:
## --- functionexample3.R
doubleMax <- function( ... ) {
max.val <- max( ... ) # find the largest value in ...
out <- 2 * max.val # double it
return( out )
}
When we type source("functionexample3.R")
, R creates the doubleMax()
function. You can type in as many inputs as you like. The doubleMax()
function identifies the largest value in the inputs, by passing all the user inputs to the max()
function, and then doubles it. For example:
doubleMax( 1,2,5 )
## [1] 10
8.4.2 There’s more to functions than this
There’s a lot of other details to functions that I’ve hidden in my description in this chapter. Experienced programmers will wonder exactly how the “scoping rules” work in R,137 or want to know how to use a function to create variables in other environments138, or if function objects can be assigned as elements of a list139 and probably hundreds of other things besides. However, I don’t want to have this discussion get too cluttered with details, so I think it’s best – at least for the purposes of the current book – to stop here.
8.5 Implicit loops
There’s one last topic I want to discuss in this chapter. In addition to providing the explicit looping structures via while
and for
, R also provides a collection of functions for implicit loops. What I mean by this is that these are functions that carry out operations very similar to those that you’d normally use a loop for. However, instead of typing out the whole loop, the whole thing is done with a single command. The main reason why this can be handy is that – due to the way that R is written – these implicit looping functions are usually about to do the same calculations much faster than the corresponding explicit loops. In most applications that beginners might want to undertake, this probably isn’t very important, since most beginners tend to start out working with fairly small data sets and don’t usually need to undertake extremely time consuming number crunching. However, because you often see these functions referred to in other contexts, it may be useful to very briefly discuss a few of them.
The first and simplest of these functions is sapply()
. The two most important arguments to this function are X
, which specifies a vector containing the data, and FUN
, which specifies the name of a function that should be applied to each element of the data vector. The following example illustrates the basics of how it works:
words <- c("along", "the", "loom", "of", "the", "land")
sapply( X = words, FUN = nchar )
## along the loom of the land
## 5 3 4 2 3 4
Notice how similar this is to the second example of a for
loop in Section 8.2.2. The sapply()
function has implicitly looped over the elements of words
, and for each such element applied the nchar()
function to calculate the number of letters in the corresponding word.
The second of these functions is tapply()
, which has three key arguments. As before X
specifies the data, and FUN
specifies a function. However, there is also an INDEX
argument which specifies a grouping variable.140 What the tapply()
function does is loop over all of the different values that appear in the INDEX
variable. Each such value defines a group: the tapply()
function constructs the subset of X
that corresponds to that group, and then applies the function FUN
to that subset of the data. This probably sounds a little abstract, so let’s consider a specific example, using the nightgarden.Rdata
file that we used in Chapter 7.
gender <- c( "male","male","female","female","male" )
age <- c( 10,12,9,11,13 )
tapply( X = age, INDEX = gender, FUN = mean )
## female male
## 10.00000 11.66667
In this extract, what we’re doing is using gender
to define two different groups of people, and using their ages
as the data. We then calculate the mean()
of the ages, separately for the males and the females. A closely related function is by()
. It actually does the same thing as tapply()
, but the output is formatted a bit differently. This time around the three arguments are called data
, INDICES
and FUN
, but they’re pretty much the same thing. An example of how to use the by()
function is shown in the following extract:
by( data = age, INDICES = gender, FUN = mean )
## gender: female
## [1] 10
## --------------------------------------------------------
## gender: male
## [1] 11.66667
The tapply()
and by()
functions are quite handy things to know about, and are pretty widely used. However, although I do make passing reference to the tapply()
later on, I don’t make much use of them in this book.
Before moving on, I should mention that there are several other functions that work along similar lines, and have suspiciously similar names: lapply
, mapply
, apply
, vapply
, rapply
and eapply
. However, none of these come up anywhere else in this book, so all I wanted to do here is draw your attention to the fact that they exist.
8.6 Summary
In this chapter I talked about several key programming concepts, things that you should know about if you want to start converting your simple scripts into full fledged programs:
- Writing and using scripts (Section 8.1).
- Using loops (Section 8.2) and implicit loops (Section 8.5).
- Making conditional statements (Section 8.3)
- Writing your own functions (Section 8.4)
As always, there are lots of things I’m ignoring in this chapter. It takes a lot of work to become a proper programmer, just as it takes a lot of work to be a proper psychologist or a proper statistician, and this book is certainly not going to provide you with all the tools you need to make that step. However, you’d be amazed at how much you can achieve using only the tools that I’ve covered up to this point. Loops, conditionals and functions are very powerful things, especially when combined with the various tools discussed in Chapters 3, 4 and 7. Believe it or not, you’re off to a pretty good start just by having made it to this point. If you want to keep going, there are (as always!) several other books you might want to look at. One that I’ve read and enjoyed is “A first course in statistical programming with R” Braun and Murdoch (2007), but quite a few people have suggested to me that “The art of programming with R” Matloff and Matloff (2011) is worth the effort too.
References
Braun, John, and Duncan J Murdoch. 2007. A First Course in Statistical Programming with R. Cambridge University Press Cambridge.
Matloff, Norman, and Norman S Matloff. 2011. The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.
The quote comes from Count Zero (1986)↩
Okay, I lied. Sue me. One of the coolest features of Rstudio is the support for R Markdown, which lets you embed R code inside a Markdown document, and you can automatically publish your R Markdown to the web on Rstudio’s servers. If you’re the kind of nerd interested in this sort of thing, it’s really nice. And, yes, since I’m also that kind of nerd, of course I’m aware that iPython notebooks do the same thing and that R just nicked their idea. So what? It’s still cool. And anyway, this book isn’t called Learning Statistics with Python now, is it? Hm. Maybe I should write a Python version…↩
As an aside: if there’s only a single command that you want to include inside your loop, then you don’t actually need to bother including the curly braces at all. However, until you’re comfortable programming in R I’d advise always using them, even when you don’t have to.↩
Okay, fine. This example is still a bit ridiculous, in three respects. Firstly, the bank absolutely will not let the couple pay less than the amount required to terminate the loan in 30 years. Secondly, a constant interest rate of 30 years is hilarious. Thirdly, you can solve this much more efficiently than through brute force simulation. However, we’re not exactly in the business of being realistic or efficient here.↩
Lexical scope.↩
The
assign()
function.↩Yes.↩
Or a list of such variables.↩