Chapter 4 Putting it all together
Now that we’ve discussed the key principles of talking to computers, let’s solidify this new understanding using an example you will often encounter. According to our second principle, if we want to keep it for later, we must put it in an object. Let’s have a look at some health and IQ data stored in some data frame object called df
:
head(df)
## ID Agegroup ExGroup IQ
## 1 1 1 1 89.95680
## 2 2 1 1 104.35860
## 3 3 1 1 98.56626
## 4 4 1 1 118.60979
## 5 5 1 1 116.72901
## 6 6 1 999 107.97970
Now, let’s replace all values of IQ that are further than \(\pm 2\) standard deviations from the mean of the variable with NA
s.
First, we need to think conceptually and algorithmically about this task: What does it actually mean for a data point to be further than \(\pm 2\) standard deviations from the mean? Well, that means that if \(Mean(x) = 100\) and \(std.dev(x) = 15.34\), we want to select all data points (elements of x
) that are either smaller than \(100 - 2 \times 15.34 = 69.32\) or larger than \(100 + 2 \times 15.34 = 130.68\).
# let's start by calculating the mean
# (the outer brackets are there for instant printing)
# na.rm = T is there to disregard any potential NAs
(m_iq <- mean(df$IQ, na.rm = T))
## [1] 99.99622
# now let's get the standard deviation
(sd_iq <- sd(df$IQ, na.rm = T))
## [1] 15.34238
# now calculate the lower and upper critical values
(crit_lo <- m_iq - 2 * sd_iq)
## [1] 69.31145
(crit_hi <- m_iq + 2 * sd_iq)
## [1] 130.681
This tells us that we want to replace all elements of df$IQ
that are smaller than 69.31 or larger than 130.68. Let’s do this!
# let's get a logical vector with TRUE where IQ is larger then crit_hi and
# FALSE otherwise
condition_hi <- df$IQ > crit_hi
# same for IQ smaller than crit_lo
condition_lo <- df$IQ < crit_lo
Since we want all data points that fulfil either condition, we need to use the OR operator. The R
symbol for OR is a vertical bar “|
” (see bottom of document for more info on logical operators):
# create logical vector with TRUE if df$IQ meets
# condition_lo OR condition_hi
condition <- condition_lo | condition_hi
Next, we want to replace the values that fulfil the condition with NA
s, in other words, we want to do a little subsetting. As we’ve discussed, there are only two ways of doing this: indices and logicals. If we heed principles 5 and 6, think of our code in terms of its output and know what to expect, we will understand that the code above returns a logical vector of length(df$IQ)
with TRUE
s in places corresponding to positions of those elements of df$IQ
that are further than \(\pm 2SD\) from the mean and FALSE
s elsewhere. Let’s check:
condition
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [210] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [243] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE FALSE
# now let's use this object to index out elements of df$IQ which
# fulfil the condition
df$IQ[condition]
## [1] 20.00000 132.35251 58.85790 132.98465 58.24403 135.92026 132.92835
## [8] 130.98682 61.39313 67.00030 135.09655 55.77873 68.37715
Finally, we want to replace these values with NA
s. That’s easy right? All we need to do is to put this vector into []
s next to df$IQ
(or df[["IQ"]]
, df[ , "IQ"]
, df[ , 4]
, or different still, df[[4]]
) and assign the value of NA
to them:
df$IQ[condition] <- NA
# see the result (only rows with NAs in IQ)
df[is.na(df[[4]]), ]
## ID Agegroup ExGroup IQ
## 10 10 1 1 NA
## 11 11 1 1 NA
## 71 71 1 2 NA
## 77 77 1 2 NA
## 122 122 1 2 NA
## 125 125 1 2 NA
## 133 133 1 2 NA
## 135 135 1 2 NA
## 171 171 2 1 NA
## 212 212 2 1 NA
## 247 247 2 2 NA
## 270 270 2 2 NA
## 288 288 2 2 NA
SUCCESS!
We replaced outlying values of IQ
with NA
s. Or, to be pedantic (and that is a virtue when talking to computers), we took the labels identifying the elements mat[c(FALSE, FALSE, TRUE, FALSE, FALSE), ]
of the df$IQ
vector, put those labels on a bunch of NA
s and burned the original elements. All that because you cannot really change an R
object.
4.1 Are there quicker ways?
You might be wondering if there are other ways of achieving the same outcome, perhaps with fewer steps. Well, aren’t you lucky, there are indeed! For instance, you can put all of the code above in a single command, like this:
# IQ[(IQ is smaller than mean - 2SD) OR (IQ is larger than mean + 2SD)] <- NA
df$IQ[(df$IQ < mean(df$IQ, na.rm = T) - 2 * sd(df$IQ, na.rm = T)) |
(df$IQ > mean(df$IQ, na.rm = T) + 2 * sd(df$IQ, na.rm = T))] <- NA
Of course, fewer commands doesn’t necessarily mean better code. The above has the benefit of not creating any additional objects (m_iq
, condition_lo
, etc.) and not cluttering your environment. However, it may be less intelligible to a novice R
user (the annotation does help though).
A particularly smart an elegant way would be to realise that the condition above is the same as saying we want all the points xi for which \(|x_i - Mean(x)| > 2 \times 15.34\). The \(x_i - Mean(x)\) has the effect of centring x so that its mean is zero and the absolute value (\(|...|\)) disregards the sign. Thus \(|x| > 1\) is the same as \(x < -1\) OR \(x > 1\).
Good, so the condition we want to apply to subset the IQ
variable of df
is abs(df$IQ - mean(df$IQ, na.rm = T)) > 2 * sd(df$IQ, na.rm = T)
. The rest, is the same:
df$IQ[abs(df$IQ - mean(df$IQ, na.rm = T)) > 2 * sd(df$IQ, na.rm = T)] <- NA
This is quite a neat way of replacing outliers with NA
s and code like this shows a desire to make things elegant and efficient. However, all three approaches discussed above (and potentially others) are correct. If it works, it’s fine!
Elementary, my dear Watson! ;)