Chapter 9 Useful tips

Following the principles detailed above is guaranteed to make you a better programmer so we hope you will stick to them. To help you along the way just that tiny bit more, here are a few additional tips.

9.1 Break it down

Maybe you find yourself struggling with a task and turn to StackOverflow for help. Maybe you manage to find a ready-made answer to your problem or maybe some helpful soul writes one just for you. And maybe it works but you don’t quite understand why.

These things happen and sometimes a line of code can appear quite obscure. Take for example:

for (i in 1:10) eval(parse(text = paste0(
  "df_", i, " <- as.data.frame(matrix(rnorm(100 * ", i ,", 0, ", i, "), ncol = i))")))

When faced with a bit of code like this, it is generally a good idea to try to reverse-engineer it. Let’s give it a go.

First, we can see that this is a for loop that repeats itself 10 times: it starts by assigning the value of 1 to the iterator object i, then executes the code, increments i by 1 and repeats until i == 10. So to look what the code inside the loop does, we need to set i to some value (1 is a reasonable choice).

i <- 1

Now, let’s start by running the code from the innermost command outwards:

paste0("df_", i, " <- as.data.frame(matrix(rnorm(100 * ", i ,", 0, ", i, "), ncol = i))")
## [1] "df_1 <- as.data.frame(matrix(rnorm(100 * 1, 0, 1), ncol = i))"

Right, so the first command created a character string that looks like a command. Let’s break it down even further:

rnorm(100 * 1, 0, 1)
##   [1] -1.047907263  0.334647380  0.435923495 -0.480989597  1.702518182
##   [6] -0.370730068  1.052210386  1.108929130  0.018052376  1.279100669
##  [11] -1.504444115  0.424034485 -1.204517636 -0.021164629  1.211853449
##  [16]  0.332288148  1.078427383  1.025987337  0.172459867  0.795208031
##  [21]  0.440109993 -0.827218657  0.596764479  0.745033312 -3.025576114
##  [26] -1.762091803  0.413914646  0.845674843 -0.694720934 -0.308303527
##  [31]  1.195282698 -0.911274774  0.115036852  0.539872015  2.341801712
##  [36] -0.190848300  0.461041703 -0.184094516  2.208509955  1.044633826
##  [41] -0.786850302 -0.054910095  0.252311968  1.319205985 -0.364450048
##  [46]  1.001338090 -1.540652875 -0.669952228  0.355237452  0.245582746
##  [51]  0.815361338 -1.470188139  0.922409071  1.359885235  0.047637818
##  [56] -0.139715380  0.536770964  0.714087608 -2.007776828  1.353881648
##  [61]  1.388877526  1.640218197 -0.962296390 -1.649831027 -0.336591424
##  [66] -0.524440172 -1.461580211  1.786026822 -0.081735195  0.550506575
##  [71] -0.790132175 -0.778848364 -1.538555873  0.486646149 -1.480791963
##  [76]  1.274834396 -0.381704736 -1.392629198  1.840397983  1.874917568
##  [81]  0.445369658 -0.485186904  1.255907351  1.095966740 -0.054362077
##  [86]  0.064135598 -0.832767137  1.154166817 -0.009764447 -1.420798097
##  [91] -0.666841665 -0.260189689 -1.241116990 -0.312476256  0.585133232
##  [96]  0.220116142  0.321966905 -0.125602859  0.144848682 -1.150526461

OK, this is easy. The first bit generates \(100 \times i\) random numbers with a mean of zero and a standard deviation of i. Let’s move one layer out:

# printout truncated to first 10 lines
matrix(rnorm(100 * 1, 0, 1), ncol = i)
##               [,1]
##   [1,]  0.03877248
##   [2,] -0.69695493
##   [3,]  0.55118318
##   [4,] -0.40594141
##   [5,] -0.32389321
##   [6,] -0.99335789
##   [7,] -1.93083243
##   [8,] -0.85998697
##   [9,]  1.77928830
##  [10,]  1.37858208
##  [ reached getOption("max.print") -- omitted 90 rows ]

This command put those numbers into a matrix with 100 rows and i columns. Next:

df_1 <- as.data.frame(matrix(rnorm(100 * 1, 0, 1), ncol = i))

This line converts the matrix into a data.frame and stored it in an object called “df_i”. Remember, i takes values of 1-10, increasingly each time the loop is repeated.

All good thus far but why is the command a character string (in “quotes”)? What is that good for? Well, turns out that the parse() function can take a string with a valid R code inside and turn it to an expression:

parse(text = paste0("df_", i, " <- as.data.frame(matrix(rnorm(100 * ",
                    i ,", 0, ", i, "), ncol = i))"))
## expression(df_1 <- as.data.frame(matrix(rnorm(100 * 1, 0, 1), ncol = i)))

 

This expression can be then evaluated using the eval() function:

eval(parse(text = paste0(
  "df_", i, " <- as.data.frame(matrix(rnorm(100 * ", i , ", 0, ", i, "), ncol = i))")))

# printout truncated
df_1
##            V1
## 1  -0.2964653
## 2  -1.1495345
## 3   1.4206396
## 4   0.2983326
## 5   1.8263633
## 6  -0.8628447
## 7   1.1240684
## 8  -2.1283718
## 9  -0.2896466
## 10 -0.5050959
##  [ reached 'max' / getOption("max.print") -- omitted 90 rows ]

So what the entire loop does is create 10 data frames named df_1 to df_10, each containing 100 rows and a different number of columns (1 for df_1, 6 for df_6 etc.) with random numbers. Moreover, each data.frame contains random numbers with different standard deviations.

And so, just like that, with a single line of code we can create 10 (or more!) different R objects with different properties. Cool, isn’t it? Hope this example demonstrates how, using systematic reverse-engineering, you can come to understand even a daunting-looking code with functions you haven’t seen before.

 

9.2 Handy functions that return logicals

Finally, here are some useful functions with which you might want to familiarise yourself. They will make cleaning your data much easier.

  • ==, takes a vector, matrix, or a data.frame and compares every element thereof to a single value. Returns a logical vector with TRUE for elements that are equal to the compared value and FALSE otherwise. Comparing NA returns NA.

    c(1:5, NA) == c(100, 2, 2, 8, 5, 9)
    ## [1] FALSE  TRUE FALSE FALSE  TRUE    NA
  • <, same as ==, but TRUE is returned if element is less than the compared value.
  • >, same as ==, but TRUE is returned if element is greater than the compared value.
  • <=, same as ==, but TRUE is returned if element is less than or equal to the compared value. In other words, it is a negation of (complementary operation to) >.
  • >=, same as ==, but TRUE is returned if element is greater than or equal to the compared value. Negation of <.
  • %in%, same as ==, but can take a vector on the right hand side. Each element of the vector/matrix/data.frame to the left is compared to each element of the vector to the right. For example:

    c(1:5, NA) %in% c(100, 4, 2, 8)
    ## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE
  • all functions that begin with ‘is’, e.g.:
    • is.na(), takes a vector, matrix, or a data.frame and returns a logical vector with TRUE if given element is an NA and FALSE otherwise.
    • is.numeric(), takes any object and returns TRUE if it is a numeric vector and FALSE otherwise.
    • is.factor(), is.matrix(), is.data.frame(), is.list(), same as is.numeric() but return TRUE if the object provided is a factor, matrix, data.frame, or list, respectively.
    • isTRUE(), returns a single TRUE if the expression provided evaluates to TRUE and a single FALSE otherwise. Only isTRUE(TRUE) returns TRUE. isTRUE(FALSE), isTRUE(c(TRUE, TRUE)) and anything else returns FALSE. Works with NAs so can be useful for combining with logical operators that return NA when comparing missing values. For example
    NA > 4
    ## [1] NA
    isTRUE(NA > 4)
    ## [1] FALSE
  • any(), takes a logical vector and returns TRUE if any of its elements equals TRUE, and FALSE otherwise, e.g., any(1:5 > 4) returns TRUE.
  • all(), like any() but returns TRUE only if all of the elements of the vector provided are TRUE.
  • all.equal(), takes two objects and returns TRUE if they are identical and a vector of all discrepancies otherwise. Sensitive to attributes so all.equal(1:5, factor(1:5)) does not return TRUE. Good to use along with isTRUE()!

    all.equal(df, df)
    ## [1] TRUE
    all.equal(df, my_list)
    ##  [1] "Names: 3 string mismatches"                                       
    ##  [2] "Attributes: < names for target but not for current >"             
    ##  [3] "Attributes: < Length mismatch: comparison on first 0 components >"
    ##  [4] "Length mismatch: comparison on first 3 components"                
    ##  [5] "Component 1: Lengths: 5, 20"                                      
    ##  [6] "Component 1: Attributes: < target is NULL, current is list >"     
    ##  [7] "Component 1: target is numeric, current is matrix"                
    ##  [8] "Component 2: Modes: numeric, character"                           
    ##  [9] "Component 2: target is numeric, current is character"             
    ## [10] "Component 3: Modes: numeric, list"                                
    ##  [ reached getOption("max.print") -- omitted 4 entries ]
    # use with isTRUE() if T/F desired
    isTRUE(all.equal(1:5, factor(1:5)))
    ## [1] FALSE
  • &, “AND” takes two Booleans and returns TRUE if both of them are TRUE, NA if either is NA, and FALSE otherwise. Can be applied over two logical vectors of the same length:

    c(T, T, F) & c(T, T, T)
    ## [1]  TRUE  TRUE FALSE
  • |, “OR” is the same as & but returns TRUE if either or both of the two compared elements is TRUE.
  • xor(), “exclusive OR” is same as above but returns TRUE only if either the first or the second, but not both of the two compared elements, is TRUE.

    xor(c(T, F, F), c(T, F, T))
    ## [1] FALSE FALSE  TRUE
  • && and ||, single-element versions of & and |. They only compare the first element of both of the vectors provided (i.e., x[1] vs y[1]):

    c(T, F, F) || c(T, F, T)
    ## [1] TRUE
  • all of the above can be negated using the ‘!’ operator, e.g.:
    • x != y
    • !x > y
    • !is.na(x)
    • !any(is.na(x)) is equivalent to all(!is.na(x))
    • !(x & y) is equivalent to xor(x, y) | (!x & !y)