Chapter 2 Problem set 1

2.1 Intro

For this section I’m planning on giving two solutions - one in base R, one with Tidyverse. I hope to demonstrate that Tidyverse solutions are easier to follow.

2.2 Problem 1

Multiples of 3 and 5

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

2.2.1 Base R

For this I’d like to start with a vector from 1:1000:

x <- 1:999

Then filter for multiples of 3 or 5:

x <- x[x %% 3 == 0 | x %% 5 == 0] # x modulo 3 = 0 or x modulo 5 = 0.

And finally, sum:

sum(x)
## [1] 233168

2.2.2 Tidyverse

This is easy enough to do with a single pipe:

library("tidyverse")
tibble(x = 1:999) %>% # Start with all numbers 1:999
  filter(x %% 3 == 0 | x %% 5 == 0) %>% # keep multiples of 3 or 5
  summarise(solution = sum(x)) %>% # Add them together
  knitr::kable() # Print it as a table
solution
233168

2.2.3 Other Comments

You can also do this as an inclusion/exclusion problem. You can add up the multiples of 3, plus the multiples of 5, but then you’ve double-counted the multiples of 3 & 5, so you need to subtract the multiples of 15.

R has no problems with a 999-element vector, so in this case I prefer starting with every integer and filtering down to reduce the risk of programmer error.

2.3 Problem 2

Even Fibonacci numbers

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.

2.3.1 Base R

We need a loop to define the next Fibonacci number.

fib <- c(1, 1)

while (fib[length(fib)] < 4000000) {
  new_fib <- fib[length(fib)] + fib[length(fib) - 1] # sum of last 2 values
  fib <- c(fib, new_fib) # extend the vector
}

Then filter and sum:

sum(fib[fib %% 2 == 0 & fib < 4000000])
## [1] 4613732

2.3.2 Tidyverse

We still need our loop, but we can use {dplyr}’s last function.

fib <- c(1, 1)

while (last(fib) < 4000000) {
  new_fib <- last(fib) + nth(fib, -2) # sum of last 2 values
  fib <- c(fib, new_fib) # extend the vector
}

Then the filter and sum is a bit clearer:

tibble(fib) %>%
  filter(fib < 4000000, fib %% 2 == 0) %>%
  summarise(sum = sum(fib)) %>%
  knitr::kable()
sum
4613732

2.4 Problem 3

The prime factors of 13195 are 5, 7, 13 and 29.

What is the largest prime factor of the number 600851475143 ?

Largest prime factor

The prime factors of 13195 are 5, 7, 13 and 29.

What is the largest prime factor of the number 600851475143 ?

2.4.1 Base R

We need two tools for this - a test for what numbers are factors of 600851475143, and what numbers are prime.

Base R has %% which gives divisibility. If x %% y == 0 then x divided by y leaves remainder 0, i.e. y is a factor of x.

A prime number is one that only has 1 and itself as a factor. It is sufficient to test up to the square root of the number we’re testing. We can write that as a function:

is_prime <- function(x) {
  # Tests if x is prime
  # RETURNS:
  # True if x is prime
  # False if x is not prime

  if (x != round(x)) return(FALSE) # x is not a whole number
  if (x == 2) return(TRUE) # test below fails on 2

  test <- seq(from = 2,
      to = ceiling(sqrt(x)),
      by = 1)
  
  all(x %% test != 0)
}

I would normally start with a vector 1:600851475143, but my machine refuses to build a vector that will be 4 Tb in size! If 600851475143 is prime, then this question is slightly boring.

target <- 600851475143

is_prime(target)
## [1] FALSE

So 6.008514810^{11} is not prime, so we are looking for a proper factor.

factors <- seq(from = 2,
               to = ceiling(sqrt(target)))

factors <- factors[target %% factors == 0] # which of these numbers are factors?

prime_factors <- c()

for (i in factors) {
  if (is_prime(i)) {
    prime_factors <- c(prime_factors, i)
  }
}

max(prime_factors) # which prime factor is largest?
## [1] 6857

The for-loop was necessary because my is_prime function is not vectorised.

2.4.2 Tidyverse

I will keep is_prime.

tibble(x = seq(from = 2,
               to = ceiling(sqrt(target)))) %>%
         filter(target %% x == 0) %>%
  mutate(prime = map_lgl(x, is_prime)) %>%
  filter(prime) %>%
  select(-prime) %>%
  slice_max(x) %>%
  rename(prime_factor = x) %>%
  knitr::kable()
prime_factor
6857

The map_lgl lets us use the is_prime function which only expects 1 number at a time, and apply it to each factor of 6.008514810^{11} in turn. The rest is filtering that we should be familiar with by now.

2.4.3 Closing

Now that we know that we can write code that tests for prime-ness, I’m going to use the Primes package from now on. This is not cheating! Using library functions reduces the chance of us making errors, and is often faster than what we would think of writing.

2.5 Problem 4

Largest palindrome product

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.

Find the largest palindrome made from the product of two 3-digit numbers.

2.5.1 Initial thoughts

The first thing to note is that the product of two 3-digit numbers must be less than a million, so we expect our palindrome to have 6 digits (although there’s a chance it could have fewer).

There are two ways we might approach this problem. (A) We could search through pairs of 3-digit numbers, multiply them together, and look for the largest result which is also a palindrome. Or (B), we could search through 6-digit palindromes, and look for the largest palindrome which can be factorised into two 3-digit numbers.

Which approach is better? Searching through the palindromes (Method B) has two advantages:

  • It is a much smaller set. There are ~1000 6-digit palindromes, but ~1000 000 pairs of 3-digit numbers.

  • The ordering is clear, so we can stop searching as soon as we’ve found the largest match. It would be more complicated to move through the set of 3-digit pairs in a way that means their product always decreases.

But B has one disadvantage:

  • Checking whether a number can be written as the product of two 3-digit numbers is much more computationally expensive than checking whether a number is a palindrome. Factorising numbers is hard, and we don’t want to have to repeat doing it over and over again. We have a ~1000 step task to complete with every check.

Though based on this, Method B seems like it should still just have the edge. The factor ~1000 differences should cancel out, but B has the early stopping advantage (although the rough order of magnitude estimates in these bullet points need to be made more precise).

Lets try both methods.

2.5.2 Method A

First lets create a function to test whether a number is a palindrome. We will only want to apply this to 6-digit numbers, but if we’re writing a function it’s good practice to make it as generic as possible, if it doesn’t take too much effort. This function can test whether a d-digit number, n, is a palindrome or not, for any even d:

#Is a d-digit number, n, a palindrome? (d-even)
is_palindrome <- function(n, d) {
  if (d %% 2 != 0) {
    stop("The number of digits must be even")
  }
  if (d == 2) {
    return(n %% 11 == 0)
  }
  digits <- array(0, d / 2) #For storing the second half of the digits
  digits[1] <- n %% 10
  for (i in 2:length(digits)) {
    digits[i] <- ((n %% (10**i)) - sum(digits * (10**(0:(length(digits)-1))))) / 10**(i-1)
  }

  #Now compare the second half of digits to the first half
  if (sum(digits * (10**((length(digits) - 1):0))) == (n - sum(digits * (10**(0:(length(digits) - 1))))) / 10**length(digits)) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

With this function, we can now search through pairs of 3-digit numbers, multiply them, and test if the result is a palindrome. Multiplication is commutative, so we actually only have to test half of the pairings. But in R, it’s quicker to use vectorisation where we can, so we test them all anyway. The tensor package lets us take a tensor product, a fast way of getting all the possible products of 3-digit numbers. We use Sys.time() to time how long our code takes to execute.

start_time <- Sys.time()

dplyr::tibble(x = array(tensor::tensor(100:999, 100:999))) %>%
  dplyr::filter(x > 99999) %>% #We only want to test 6-digit answers. We're assuming there will be at least one 6-digit palindrome.
  dplyr::mutate(palindrome = purrr::map_lgl(x, is_palindrome, 6)) %>%
  dplyr::filter(palindrome) %>%
  dplyr::filter(x == max(x)) %>%
  .[1,"x"] %>%
  pull(.) %>%
  print(.)
## [1] 906609
end_time <- Sys.time()

print(end_time - start_time)
## Time difference of 5.965245 secs

We get the answer, 906 609, but it takes a while.

2.5.3 Method B

Method B is implemented below. We test every 6-digit palindrome, largest first, to see whether it can be written as a product of 3-digit numbers. If it can, we stop the loop, and print the result.

start_time <- Sys.time()

n = 999
while (n >= 100) {
  digits <- array(0, 3)
  digits[1] <- n %% 10
  for (i in 2:length(digits)) {
    digits[i] <- ((n %% (10**i)) - sum(digits * (10**(0:2)))) / 10**(i-1)
  }
  palindrome <- n * 1000 + sum(digits * 10**(2:0))
  
  factors <- (100:999)[which(palindrome %% (100:999) == 0)]
  
  if (any(palindrome / factors >= 100 & palindrome / factors <= 999)) {
    print(palindrome)
    break
  }
  
  n <- n - 1
}
## [1] 906609
end_time <- Sys.time()

print(end_time - start_time)
## Time difference of 0.02927542 secs

This was much faster! We can also test whether the early stopping is what made the difference by removing the break from the loop, to see how long the entire loop would have taken.

start_time <- Sys.time()

n = 999
while (n >= 100) {
  digits <- array(0, 3)
  digits[1] <- n %% 10
  for (i in 2:length(digits)) {
    digits[i] <- ((n %% (10**i)) - sum(digits * (10**(0:2)))) / 10**(i-1)
  }
  palindrome <- n * 1000 + sum(digits * 10**(2:0))
  
  factors <- (100:999)[which(palindrome %% (100:999) == 0)]
  
#  if (any(palindrome / factors >= 100 & palindrome / factors <= 999)) {
#    print(palindrome)
#    break
#  }
  
  n <- n - 1
}

end_time <- Sys.time()

print(end_time - start_time)
## Time difference of 0.05148315 secs

This was still a lot faster! Theoretically, it’s hard to understand this, because both methods should have involved similar numbers of steps once you take out the early stopping in Method B.

The reason for the very large difference is down to the way R works. R is an interpreted language, rather than compiled, which means it is slow in general (the code is interpreted as we go along, whereas a compiled language like C can be run directly on the processor). But certain operations in R are performed by compiled functions in the background, which are fast.

Here, we’ve used vectorisation to test the factors of the number in Method B, which is fast, whereas testing for palindromes in Method A was done using a custom function. This custom function becomes extremely slow when repeated ~1000 000 times.

If you are having trouble with code taking a long time to run in R, try to replace custom functions and loops with in-built functions and vectorisation where possible. As this example shows, this can make a much larger difference than trying to work out which approach is theoretically better.