Chapter 4 Putting it all together

Now that we’ve discussed the key principles of talking to computers, let’s solidify this new understanding using an example you will often encounter. According to our second principle, if we want to keep it for later, we must put it in an object. Let’s have a look at some health and IQ data stored in some data frame object called df:

head(df)

##   ID Agegroup ExGroup        IQ
## 1  1        1       1  89.95680
## 2  2        1       1 104.35860
## 3  3        1       1  98.56626
## 4  4        1       1 118.60979
## 5  5        1       1 116.72901
## 6  6        1     999 107.97970

Now, let’s replace all values of IQ that are further than $\pm 2$ standard deviations from the mean of the variable with NAs.

First, we need to think conceptually and algorithmically about this task: What does it actually mean for a data point to be further than $\pm 2$ standard deviations from the mean? Well, that means that if $Mean(x) = 100$ and $std.dev(x) = 15.34$, we want to select all data points (elements of x) that are either smaller than $100 - 2 \times 15.34 = 69.32$ or larger than $100 + 2 \times 15.34 = 130.68$.

# let's start by calculating the mean
# (the outer brackets are there for instant printing)
# na.rm = T is there to disregard any potential NAs
(m_iq <- mean(df$IQ, na.rm = T))

## [1] 99.99622

# now let's get the standard deviation
(sd_iq <- sd(df$IQ, na.rm = T))

## [1] 15.34238

# now calculate the lower and upper critical values
(crit_lo <- m_iq - 2 * sd_iq)

## [1] 69.31145

(crit_hi <- m_iq + 2 * sd_iq)

## [1] 130.681

This tells us that we want to replace all elements of df$IQ that are smaller than 69.31 or larger than 130.68. Let’s do this!

# let's get a logical vector with TRUE where IQ is larger then crit_hi and
# FALSE otherwise
condition_hi <- df$IQ > crit_hi
# same for IQ smaller than crit_lo
condition_lo <- df$IQ < crit_lo

Since we want all data points that fulfil either condition, we need to use the OR operator. The R symbol for OR is a vertical bar “|” (see bottom of document for more info on logical operators):

# create logical vector with TRUE if df$IQ meets
# condition_lo OR condition_hi
condition <- condition_lo | condition_hi

Next, we want to replace the values that fulfil the condition with NAs, in other words, we want to do a little subsetting. As we’ve discussed, there are only two ways of doing this: indices and logicals. If we heed principles 5 and 6, think of our code in terms of its output and know what to expect, we will understand that the code above returns a logical vector of length(df$IQ) with TRUEs in places corresponding to positions of those elements of df$IQ that are further than $\pm 2SD$ from the mean and FALSEs elsewhere. Let’s check:

condition

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [210] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [243] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE FALSE

# now let's use this object to index out elements of df$IQ which
# fulfil the condition
df$IQ[condition]

##  [1]  20.00000 132.35251  58.85790 132.98465  58.24403 135.92026 132.92835
##  [8] 130.98682  61.39313  67.00030 135.09655  55.77873  68.37715

Finally, we want to replace these values with NAs. That’s easy right? All we need to do is to put this vector into []s next to df$IQ (or df[["IQ"]], df[ , "IQ"], df[ , 4], or different still, df[[4]]) and assign the value of NA to them:

df$IQ[condition] <- NA
# see the result (only rows with NAs in IQ)
df[is.na(df[[4]]), ]

##      ID Agegroup ExGroup IQ
## 10   10        1       1 NA
## 11   11        1       1 NA
## 71   71        1       2 NA
## 77   77        1       2 NA
## 122 122        1       2 NA
## 125 125        1       2 NA
## 133 133        1       2 NA
## 135 135        1       2 NA
## 171 171        2       1 NA
## 212 212        2       1 NA
## 247 247        2       2 NA
## 270 270        2       2 NA
## 288 288        2       2 NA

SUCCESS!

We replaced outlying values of IQ with NAs. Or, to be pedantic (and that is a virtue when talking to computers), we took the labels identifying the elements mat[c(FALSE, FALSE, TRUE, FALSE, FALSE), ] of the df$IQ vector, put those labels on a bunch of NAs and burned the original elements. All that because you cannot really change an R object.

4.1 Are there quicker ways?

You might be wondering if there are other ways of achieving the same outcome, perhaps with fewer steps. Well, aren’t you lucky, there are indeed! For instance, you can put all of the code above in a single command, like this:

# IQ[(IQ is smaller than mean - 2SD) OR (IQ is larger than mean + 2SD)] <- NA
df$IQ[(df$IQ < mean(df$IQ, na.rm = T) - 2 * sd(df$IQ, na.rm = T)) |
        (df$IQ > mean(df$IQ, na.rm = T) + 2 * sd(df$IQ, na.rm = T))] <- NA

Of course, fewer commands doesn’t necessarily mean better code. The above has the benefit of not creating any additional objects (m_iq, condition_lo, etc.) and not cluttering your environment. However, it may be less intelligible to a novice R user (the annotation does help though).

A particularly smart an elegant way would be to realise that the condition above is the same as saying we want all the points x_i for which $|x_i - Mean(x)| > 2 \times 15.34$. The $x_i - Mean(x)$ has the effect of centring x so that its mean is zero and the absolute value ($|...|$) disregards the sign. Thus $|x| > 1$ is the same as $x < -1$ OR $x > 1$.

Good, so the condition we want to apply to subset the IQ variable of df is abs(df$IQ - mean(df$IQ, na.rm = T)) > 2 * sd(df$IQ, na.rm = T). The rest, is the same:

df$IQ[abs(df$IQ - mean(df$IQ, na.rm = T)) > 2 * sd(df$IQ, na.rm = T)] <- NA

This is quite a neat way of replacing outliers with NAs and code like this shows a desire to make things elegant and efficient. However, all three approaches discussed above (and potentially others) are correct. If it works, it’s fine!

Elementary, my dear Watson! ;)