Chapter 4 Missing data in R
In R the missing values are denoted by NA
which means “Not Available”. If we open the same dataset as above in R we get the following result.
library(haven)
dataset <- read_sav("data/CH2 example.sav")
head(dataset,10) # Data of first 10 patients is shown
## # A tibble: 10 x 7
## ID Pain Tampascale Disability Radiation Gender GA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 9 45 20 1 1 8
## 2 2 6 43 10 0 0 36
## 3 3 1 36 1 0 1 8
## 4 4 5 38 14 0 0 NA
## 5 5 6 44 14 1 1 8
## 6 6 7 43 11 1 0 29
## 7 7 8 43 18 0 0 NA
## 8 8 6 43 11 1 0 34
## 9 9 2 37 11 1 1 8
## 10 10 4 36 3 0 0 38
The Variable Gestational Age (GA) contains the values for GA (e.g. 36, 29, etc.), the value 8 for males and the NA’s. In R the value 8 will be treated as a real value, so we have to recode that value to NA
by using the following code to convert an 8 into an NA
for the males.
## # A tibble: 10 x 7
## ID Pain Tampascale Disability Radiation Gender GA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 9 45 20 1 1 NA
## 2 2 6 43 10 0 0 36
## 3 3 1 36 1 0 1 NA
## 4 4 5 38 14 0 0 NA
## 5 5 6 44 14 1 1 NA
## 6 6 7 43 11 1 0 29
## 7 7 8 43 18 0 0 NA
## 8 8 6 43 11 1 0 34
## 9 9 2 37 11 1 1 NA
## 10 10 4 36 3 0 0 38
Within most functions in R the handling of NA
values has to be defined, otherwise the function returns an NA as a result. For example, the following code to obtain the mean of Gestational Age results in an NA
.
## [1] NA
To obtain the mean of the observed data the statement na.rm = TRUE
has to be added.
## [1] 35.09524
The na.rm = TRUE
statement in the mean-function, indicates that values that are NA
are removed before the analysis is executed. Another NA
handling procedure that is regularly used in functions is called na.action
with as options na.fail
, na.omit
, NULL
(no action) and na.exclude
. For more information about na.action options you can look at the help-file by typing ?na.action
in the Console window.