11 Coding loops
If you’re newer to coding and haven’t taken a programming class, chances are you haven’t coded a for loop. For loops are part of something called control flow, where by you repeatedly execute a chunk of code as long as some condition or statement is met.
In data science you’ll find that loops pop up in lots of places… for example, you might have a list of URLs that you want to scraped. So, you write a bit of code that starts at the first URL, downloads the html into R, and then extracts certain bits into a data frame. It then does this for the 2nd URL, then the 3rd, and so on until you have a big data frame with all the data from your list of webpages. In this week’s topic we’re using a loop to repeatedly fit a set of models and calculate error across multiple lists of split indexes.
The common theme with both of the above is that there are two main parts to a for loop:
The main body of the statement that contains the code you want to run each time
The header of the statement that controls how many times to run something or what to run the code over or how long you want to run it for.
11.1 A simple loop
Let’s see a basic for loop in action. Here we’re going to simply tell R to take a list of values from 1 to 5 and tell it to print those numbers.
for(i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Here the header statement is the for(i in 1:5)
part, which is saying for each value i in the list of numbers starting at 1 and ending at 5. The code part is just saying print i. So the loop starts with i = 1, goes and prints that back, then starts over with the next i in the list, so i = 2, prints it, and so on.
11.2 Using i with square brackets
Obviously we want to do more than print off a list of numbers. Frequently we want to use i to refer to a specific place in a vector or a row or column in a data frame.
Let’s consider a simple vector of names:
<- c('Nick', 'Candace', 'Spencer', 'Yujia', 'Lauren', 'Abby') names
If you wanted to use get the 3rd name you could just use square brackets to do that (remember that lists are one dimensional so your square brackets only need one value to refer to a position, not two like a data frame).
3] names[
## [1] "Spencer"
But what if you instead make i = 3 and then ask for names[i]
? You get the same thing.
<- 3
i names[i]
## [1] "Spencer"
The point here is that i is just an abstraction for a number in the sequence part of the header statement, and that it can be used to refer to positions in a vector or data frame.
Let’s apply this by using square brackets to print off each name in our names
vector
for(i in 1:6) {
print(names[i])
}
## [1] "Nick"
## [1] "Candace"
## [1] "Spencer"
## [1] "Yujia"
## [1] "Lauren"
## [1] "Abby"
11.3 Adding functionality to our for loop
Alright, the above loop is doing a simple thing by printing out the i’th name in our vector. But the whole point is to actually do something to whatever is in the i’th position of our data. So let’s add a function to the mix.
We’ll use the function tolower()
to convert each name to lower case text. You’ll notice that we’re going to store our converted name in the object lower_case_name
and then print that back. So each time our loop iterates to the next place it’ll update that object as well.
for(i in 1:6) {
<- tolower(names[i])
lower_case_name print(lower_case_name)
}
## [1] "nick"
## [1] "candace"
## [1] "spencer"
## [1] "yujia"
## [1] "lauren"
## [1] "abby"
11.4 Making your header not reliant on a fixed sequence
The above simple loop does the job, but how can me make it so it’s not contingent on us knowing exactly how long the vector is? It’s easy to see that our vector is six items long, but what if we’re dealing with millions of entries, or some dynamic data that’s always changing length?
For example, let’s add a couple new names to our vector
<- c(names, 'Chung-Ting', 'Fahmeda')
names names
## [1] "Nick" "Candace" "Spencer" "Yujia" "Lauren"
## [6] "Abby" "Chung-Ting" "Fahmeda"
Running our loop again gives the same first six results because we’re only telling it to start at i = 1 and end at i = 6. But it really needs to end at i = 8.
for(i in 1:6) {
<- tolower(names[i])
lower_case_name print(lower_case_name)
}
## [1] "nick"
## [1] "candace"
## [1] "spencer"
## [1] "yujia"
## [1] "lauren"
## [1] "abby"
We could update our header to read for(i in 1:8)
, but that’s a temporary and clunky solution. How about instead we use one of our simple functions to ask how long our data is. In this case length()
as names
is a vector, but nrow()
and ncol()
can do the same for number of rows and columns in a data frame.
length(names)
## [1] 8
We can make a sequence using length(names)
as our ending value…
1:length(names)
## [1] 1 2 3 4 5 6 7 8
We can use this to make our header and thus it’ll iterate over the entire vector regardless if it’s length changes.
for(i in 1:length(names)) {
<- tolower(names[i])
lower_case_name print(lower_case_name)
}
## [1] "nick"
## [1] "candace"
## [1] "spencer"
## [1] "yujia"
## [1] "lauren"
## [1] "abby"
## [1] "chung-ting"
## [1] "fahmeda"
11.5 Filling a data frame with a loop
Printing stuff to the console is fine, but normally the goal of us doing a loop is to apply a function a bunch of times and store the results so we can do some additional analysis or plotting. For example, in this class we use loops to calculate error multiple times and then plot. So you need to store each measurement in a data frame to plot with.
In order to do this, we need to do two things. First, generate an empty data frame. Second, fill that data frame with each iteration of our loop.
For the last part of this lesson we’re going to fill a data frame with our names, the lower case version of them, and a final column that contains the number of characters in each.
11.5.1 Making an empty data frame
First thing we need to do is make an empty data frame and give it some column names. We’re going to make a matrix with three columns (specified with the ncol =
argument), and wrap the data.frame()
function around that. You’ll see that calling it returns a data frame with three empty columns
<- data.frame(matrix(ncol = 3))
names_df names_df
## X1 X2 X3
## 1 NA NA NA
We can name those columns with the colnames()
function. We call this function to the left of our assignment arrow, and then assign a list of column names.
colnames(names_df) <- c('original_name', 'lower_name', 'num_char')
names_df
## original_name lower_name num_char
## 1 NA NA NA
11.5.2 Filling your data frame
Now that we have our empty data frame with proper column names, all we have to do is fill it with our values. To do this we need to store our data row-wise in the appropriate column. In this case, we don’t want to overwrite the entry in the data frame, so you need to add each converted value to the i’th position. This way it starts by adding it in the first row, then the second, then third, all the way until the last value i.
for(i in 1:length(names)) {
<- tolower(names[i])
lower_case_name <- nchar(names[i]) #nchar() gets number of characters
name_length
# fill data frame
'lower_name'] <- lower_case_name
names_df[i, 'num_char'] <- name_length
names_df[i, }
See
names_df
## original_name lower_name num_char
## 1 NA nick 4
## 2 NA candace 7
## 3 NA spencer 7
## 4 NA yujia 5
## 5 NA lauren 6
## 6 NA abby 4
## 7 NA chung-ting 10
## 8 NA fahmeda 7
Ah, but how do we fill our original name? Well, remember that our original name is just the i’th position in the names
vector? We can just add that as well!
for(i in 1:length(names)) {
<- tolower(names[i])
lower_case_name <- nchar(names[i]) #nchar() gets number of characters
name_length
# fill data frame
'lower_name'] <- lower_case_name
names_df[i, 'num_char'] <- name_length
names_df[i, 'original_name'] <- names[i]
names_df[i, }
Check
names_df
## original_name lower_name num_char
## 1 Nick nick 4
## 2 Candace candace 7
## 3 Spencer spencer 7
## 4 Yujia yujia 5
## 5 Lauren lauren 6
## 6 Abby abby 4
## 7 Chung-Ting chung-ting 10
## 8 Fahmeda fahmeda 7
Now we have a data frame that we can graph with! We can make this fascinating bar graph to show who has the longest name. Chung-Ting takes home that title.
ggplot(names_df,
aes(x = lower_name, y = num_char)) +
geom_col()
11.6 Conclusion and tips
Hopefully this lesson helps tease apart the different parts that are involved in for loops. There are a bunch of small things to keep track of, which trips people up when they’re starting. My biggest piece of advice is to build up loops step-by-step. Don’t worry about making it work with any length data, or adding to data frames right off the bat. Just make it do a function to your data and print it back. Once that works then make it work with whatever length data. After that’s good then have it add just one part to a data frame. So, start small and build from there!