1 Overview & examining data in R: viewing, summarising data, R data-types and converting between them; converting errors to NA, sorting, restricting variables/ to subsets

Data Wrangling Recipes in R: Hilary Watt

Oh yes, this book saves LOADS of time, by its clarity and focus on simple methods that readily achieve results.

1.1 Data sets for this book are available on github

The main dataset is called anaemia. Other data sets are used in latter chapters, to teach merging and restructuring datasets. https://github.com/hcwatt/data_wrangling_open.

1.2 Know your data and research question: essential knowledge before preparing data for analysis

You should receive information on any data set that is available to you, including how variables are coded, how the data was collected and how different variables were measured. Refer to this to better understand your data. Otherwise, check where you can obtain this information.

To prepare data for analysis, you need to be clear about the purpose of your analyses. You need clear research questions and a protocol to determine how you will answer these questions. Such protocols should be reviewed by supervisors/ colleagues/ other researchers/ ethic committee/ grant-rewarding committees before you start. Some people also publish detailed analysis plans, which adds extra robustness. How you prepare your data may depend on your research question.

1.3 Tidyverse packages invaluable for data management: it needs to be installed now if not already installed

This guide is intended for those relatively new to R, but absolute beginners would do well to start with a beginners course in R. The appendix gives a few core concepts of R.

Tidyverse suite of packages for data management are of consistent high-quality, make preparing data easier and work well together The tidyverse package is also valuable for tabulating data and for producing publication-quality highly-flexible graphs. You need to install it to use many of the commands in this book, as follows:

install.packages(“tidyverse”)     # downloads tidyverse onto computer – do this only once

Each time you use R you need to run the following:

library(tidyverse)      # makes the relevant packages available – required each time R/ RStudio is opened

When you load the tidyverse, the following packages are loaded and many more: ggplot2, dplyr, tidyr, readr, purr, tibble, stringr, and forecats. For more information on the packages tidyverse loads see https://www.tidyverse.org/packages/.

For any internet searches for data management in R, it is worthwhile including “tidyverse” as a search term. This might tend to give simpler and more powerful solutions than using base R.

1.4 Keep dataset as given to us, whilst aiming to overwrite different amended versions as much as possible

Always keep a copy of the dataset in the form given to us. If we save an amended “analysis” data set, we never overwrite our original data set. However, to avoid confusion it is good practice to over-write copies of our datasets that we no-longer need. Keeping a tidy well-annotated data-management R script is invaluable. This makes it easy to add in new variables/ data and create an amended analysis data set. It may help us to have confidence in overwriting previous versions.

1.5 View dataset including excel-style format & finding its dimensions

To view your data in a separate scrollable window, click on the name of the dataset in Environment window or use View() command. You can use your mouse to scroll right-to-left and up-and-down in the dataset:

View(anaemia)   ## Shows output in scrollable window SIMILAR to format below

Viewing data on chosen variables allows us to readily compare the values of the selected variables. This uses df_name[row, column] format: nothing before comma implies keep all rows - then after the comma 2 variables= columns are specified:

## Output NOT shown for any of these:
View(anaemia[, c("hb_pre", "hb_post")])  ## view specified variables only 
View(anaemia[anaemia$sex=="Female", c( "hb_post", "hb_pre")]) ## only show female (rows), change the order of vars to view accordingly
View(anaemia[c("hb_pre", "hb_post")])  ## view specified variables only - shorthand version

The following shows the first or last few rows, printing results into the console window.

head(anaemia)    ## show first few lines (default 6 lines)
tail(anaemia, n=2)  ## shows last few lines (argument requests 2 lines)

We might also want to view the dimensions of our data-base: how many rows (= observations) and columns (= variables) it has. We can also check for the length of any variables. Note that all variables in a data-frame must necessarily be of the same length (=number of observations in the data-frame). That is one motivation to keep variables within data-frames, to avoid error from variables being of different lengths and to avoid the need for extra code required to overcome this issue (issues with stand-alone variables).

dim(anaemia)  ## print number of rows and columns in data-frame
nrow(anaemia)  ## print number of rows/ observations in data-frame
ncol(anaemia)  ## print number of columns/ variables in data-frame
length(anaemia$hb_pre) ## print length of variable specified: must be same as number of rows in data-frame, for variables within data-frames

1.6 Too many variables? Create & focus on a subset of required variables

With very many variables, we don’t want to spend hours cleaning irrelevant variables. Therefore we initially identify ones that we need. Don’t worry that you might miss some - it is fine to add new variables later and clean them as we recognise that we need them. With a tidy R script file, the original script file can be amended to include more variables. Similarly, if we only want certain types of people for our analysis, we might be able to select only that subset at the outset - although we might need to clean the relevant variable first. Such decisions require a protocol that describes your requirements alongside your research questions, that others have reviewed.

# keep only variables listed: using df_name[row, column] format: blank for rows implies to keep all rows, then only named columns=variables
anaemia2 <- anaemia[ ,c("id2","hb_pre", "hb_post")]  ## base R robust option
# Alternative that does the same thing
anaemia2 <- subset(anaemia, select=c("id2", "hb_pre", "hb_post"))  ## alternative though warning for lack of robustness in complex situations

# drop a variable & data-frame
anaemia$ht <- NULL # drops a variable from within data-frame
anaemia2 <- anaemia # copy data-frame
rm(anaemia2)  ## drops data-frame

1.7 R data-types: Check what data-type R currently recognises each variable to be

We firstly check what datatypes R is currently recognising variables to be. If R is not correctly recognising their type, we need to correct this prior to analysis. Code is given here to achieve this quickly and easily, in simple situations. However, there is a lot of detail on how to do this, whilst overcoming different types of messiness, for each different type of data in the sections that follow.

The following R datatypes may be useful
Data Type Code in Environment window & str() function Appearance in R data window Examples where this is appropriate datatype
Numeric num Numbers weight (kg), height (m)
Integer int Integers Number of people in household, number of hospital admissions
Logicals/ Binary logi TRUE/FALSE Has diabetes? Has emigrated?
String/ Character chr Text: can include letters, numbers, special characters & spaces Person ID variables, name, text responses
Categorical/ Factor/ Binary Factor Categories are often named, but can be represented by numbers death (died or not), urgency of operation (elective, urgent, emergency or salvage)
Ordered Factor/ Ordered categorical Ord.factor Categories are often named, but can be represented by numbers Cancer stage, Walking speed (slow, medium, fast, very fast)
Date Date, format Date Date of birth
Date & time POSIXct, format Date & time Date & time of entry into study

Environment window (top right in RStudio) shows each variable and which data-type they are currently recognised as, within R.

str(anaemia) # this command gives similar information to environment window, when applied to a data-frame
## 'data.frame':    1040 obs. of  24 variables:
##  $ id2            : chr  "AB001" "AB002" "AB003" "AB004" ...
##  $ sex            : chr  "Female" "Male" "Male" "Male" ...
##  $ operat         : chr  "Urgent" "Urgent" "Elective" "Elective" ...
##  $ hb_pre         : int  134 97 115 138 135 156 115 103 137 121 ...
##  $ hb_post        : int  92 104 91 105 100 108 95 90 104 94 ...
##  $ death          : chr  "alive" "alive" "alive" "alive" ...
##  $ deathICU       : chr  "alive" "alive" "alive" "alive" ...
##  $ return         : chr  "no" " yes" "no" "no" ...
##  $ hpt            : chr  "no" " yes" " yes" " yes" ...
##  $ weight         : num  61 79 103 117 82 ...
##  $ height         : num  1.67 1.82 1.76 1.82 1.67 ...
##  $ FFP            : int  7 5 0 0 0 0 7 5 0 0 ...
##  $ plts           : int  0 2 0 0 0 0 1 1 0 0 ...
##  $ rbc            : int  5 15 2 1 0 0 2 8 1 0 ...
##  $ los_pre        : int  8 23 2 1 1 1 1 12 0 20 ...
##  $ los_post       : int  7 52 7 6 151 7 6 34 6 50 ...
##  $ los            : int  15 75 9 7 152 8 7 46 6 70 ...
##  $ futime         : int  93 103 101 106 151 100 109 107 91 101 ...
##  $ death_fu       : int  0 0 0 1 1 0 1 0 1 0 ...
##  $ date_operat    : chr  "07-Sep-22" "09-Apr-22" "12-Jul-21" "25-Sep-22" ...
##  $ datetime_end_fu: chr  "09/12/2022 23:10" "21/07/2022 00:47" "21/10/2021 06:21" "09/01/2023 13:21" ...
##  $ wgt            : chr  "61kg" "79kg" "103kg" "117kg" ...
##  $ height2        : chr  "1.67m" "182mm" "1.76" "1.8200001" ...
##  $ factorx        : chr  "54,024" "10,480" "18,556" "65,273" ...

Interpretation:

  • Stored as dataframe
  • The output shows how many observations (often representing people) there are and how many variables/ columns there are.
  • Variable names: storage type (see table below) and first few observations.

1.8 Converting between R data types (simple stuff plus tips and hints: worked examples and info to cope with many types of messiness are given in following sections by intended data-type)

The following R datatypes may be useful
Data Type Code in Environment window & str() function Test for this Converting to this data-type R treated this as
Numeric num is.numeric() as.numeric() OR as.numeric(as.character()) to convert from numerically coded factor variable retaining numerical values * Number
Integer int is.integer() as.integer() OR as.integer(as.character()) to convert from integer-coded factor variable retaining numerical values * Integer
Logicals/ Binary logi is.logical() as.logical() TRUE=1, FALSE=0
String/ Character chr is.character() as.character() Strings of letters, numbers, special characters
Categorical/ Factor/ Binary Factor is.factor() as.factor() OR factor() ** Categorical (or numbers 1, 2, 3,… according to their order)
Ordered Factor/ Ordered categorical Ord.factor is.ordered() as.ordered() (often better to specify as simple factor above ***) Ordinal (or numbers 1, 2, 3,… according to their order)
Date Date, format is.Date() as.Date() from POXIT OR from character: mdy(), dmy(), ymd(),… (order of d, m, y in function reflects order of day, month, year ****) Number of days since 1 jan 1970.
Date & time POSIXct, format is.POSIXt()/ is.POSITct() as.POSIXct(), ymd_hms(), hms() for time alone****) Number of milliseconds from 1 jan 1970.

Note: * converting directly from factor to number when factors appear as numbers, the resulting numerical variable is often different from observed numbers, which can confuse. To avoid this, use as.numeric( as.character( ) ) or as.integer( as.character( ) ) to retain number values of numerically coded factor variables. To code factors where categories include text, use as.numeric( ) or as.integer( ) alone.

Note: ** Specifying an (intended predictor) variable to be a factor variable will generally mean that it is fitted as a categorical predictor variable within regression analyses. This implies that R will generate appropriate indicator variables.

Note: *** Specifying a variable as an ordered factor often implies that it will be analysed as a linear trend across ordered categories. We often prefer to estimate the effect in each level separately, so may be better to avoid this option and specify as (unordered) factor. Because R is written by many different users, the treatment of such variables varies by command.

Note: **** More details on section on date/ times. Many of these require library(tidyverse) or the subset library(lubridate)

class() function: shows data-type/ object-type:

class(anaemia) # names data type – here dataframe
## [1] "data.frame"
class(anaemia$sex) # names data-format – here factor
## [1] "character"

str() structure function sometimes gives the same information, but gives far more detail for data-frames.

Once you have converted to appropriate data-types, use the summary() command (below) on the new & old variables, and check that the number of missing observations (NAs) has not accidentally increased. The following sections give far more detail on coping with messy data to avoid this happening. Refer to relevant chapter according to appropriate data type for variable concerned.

1.9 Summarising variables: to quickly assess what we have and to check for errors (more detail in following sections)

It is useful to look at tables/ summaries of each variable, to better understand what data we have. This is part of the necessary process to check and clean variables. This is taught in far more detail in the sections that follow, by data-type.

summary() summarises variables, with summary presented depending on R data-type.

For integer/numerical variables, summary() outputs the mean, median, range (min / max), intequartile range, and number of missing (NA) values. For categorical (factor) variables, summary() produces the frequency of each factor level, including NAs. For character variables, it gives their (maximum) length. ?summary gives further details.

summary(anaemia[c("id2", "operat", "weight")]) #  gives summary for each selected var in turn
##      id2               operat              weight      
##  Length:1040        Length:1040        Min.   : 38.00  
##  Class :character   Class :character   1st Qu.: 70.00  
##  Mode  :character   Mode  :character   Median : 80.00  
##                                        Mean   : 83.21  
##                                        3rd Qu.: 92.00  
##                                        Max.   :999.00  
##                                        NA's   :11
summary(anaemia$sex)
##    Length     Class      Mode 
##      1040 character character

Watch out for “messy” categories – different spellings of the same thing and unnecessary poorly defined categories. “Blank” (empty quotes) should be NA, so R recognises it as missing and treats it accordingly. The two “spellings” of male should be merged into one group. See next sub-heading on recoding to NA.

For numeric data, look out for outliers and impossible values, as well as for shapes of distributions. Missing data is often coded as 9, 99 or 999, or perhaps -1, -2, (or blank/ empty quotes “” for character data-type). Hence 999 looks like missing data code and it is an impossible value for weight (of man in kg). Ideally, we should have a code book giving this information. These need to be recoded to NA, so that the missing data code values are not included into calculated mean values and similar.

For numeric/ integer data which takes very few values, it can be more useful to see a table of the data, rather than the information given by summary().

table(anaemia$plts, exclude=NULL) # useful to keep track of missing data here too with this exclude=NULL option
## 
##    0    1    2    3    4    5    6    7   11   13   16 <NA> 
##  725  123  108   36   22    4    6    3    1    2    1    9

We might also want to know how many unique values a variable takes: code follows.

length(anaemia$sex) # shows number of rows(=observation)
## [1] 1040
unique(anaemia$sex) # shows unique values (without frequencies)
## [1] "Female" "Male"   ""       "male"
length(unique(anaemia$sex)) # reports NUMBER of unique values
## [1] 4

The sections that follow give far more information on appropriate checks and considerations, according to datatype.

1.10 Recoding to NA any impossible values & missing data codes: so that R recognises them as missing

The missing data code in R is NA for all data-types, including character and numeric and date/ times.

anaemia$weight[anaemia$weight == 999] <- NA # Set weight = NA where weight is 999
anaemia$sex[anaemia$sex==""] <- NA   # Change from blank sex to missing data code NA

Detail of code: This evaluates whether or not: anaemia$weight == 999 with double equals required to evaluate to see whether or not equal to. Then applies the code only when this expression is TRUE, which sets value to be NA (R missing data code). The code anaemia$weight <- NA would set all weight values to missing NA, which would rarely be useful.

1.11 Sorting dataset to peruse observations according to characteristics/ high/ low values

We might want to sort the data, so that we can look over the data more efficiently. This can allow easy inspection of patients with the lowest anaemia prior to operation (lowest hb_pre), where we can scroll through their values on all variables:

Using base R, use syntax df_name [row, column] or else use Tidyverse syntax which may be easier. The following sorts the rows, and selects all columns, so there is a blank after the comma in column position:

anaemia <- anaemia[order(anaemia$hb_pre), ]

To look at each sex separately, since hb_pre is generally lower in females, sort firstly by sex and then by hb_pre within gender:

anaemia[order(anaemia$sex, anaemia$hb_pre), ]   # base R version: sort by firstly sex and then hb_pre within each value of sex.
anaemia %>% 
   arrange(sex, hb_pre)  # Tidyverse version: sort by firstly sex and then hb_pre within each value of sex.

By default, we sort lowest to highest. To reverse this order (highest to lowest), use the optional argument decreasing = TRUE.

anaemia <- anaemia[order(anaemia$hb_pre, decreasing = TRUE), ]  # base R version
anaemia %>%
 arrange(desc(hb_pre)) # tidyverse version

1.12 Restricting to subsets: so that you can check data by subsets, or restrict whole analysis data-set to a subset

The following shows how you can save a subset of variables to a dataframe; the following code retains the original dataframe (anaemia) with all the variables in addition to the version with few variables (named aneamia2). Remember to keep any ID variables, which enables more variables to be added in as required. The following shows you how to save only selected rows/ observations into a new dataframe; the following code retains the original dataframe (anaemia) with all rows in addition to the modified version (named aneamia2). Remember you might need to clean the relevant variable(s) first.

# View data on anyone you plan to drop – only two people here
print(subset(anaemia, anaemia$operat=="Salvage"), font_size = 6)
# Exclude the only person with “salvage”
anaemia.NOTsalvage <- subset(anaemia, anaemia$operat!="Salvage")

# drop those with Elective operations, only those with other operation types
# (or missing operation type) will remain, using df_name[row, column] syntax
anaemia.NOTelective <- anaemia[anaemia$operat != "Elective", ]
# keeps only those with elective operations, using df_name[row, column] syntax
anaemia.elective <- anaemia[anaemia$operat == "Elective", ]

Using dataframe[row, column] notation, condition indicate which rows to include, so are before the comma. Blank in column position indicates all columns=variables are selected. Appropriate data-frame names are chosen.

Keep everyone in the dataset, unless there is strong justification to do otherwise. Declare how many were dropped and your rationale in any reports/ articles/ dissertations.

To restrict commands to subsets, we can use [conditions in brackets]. Note that such conditions are often specified on variables, which have just one dimension.

hist(anaemia$hb_pre[anaemia$operat == "Elective"],
     main="Histogram of hb_pre where operation was elective",
     xlab="hb pre (g/L)", col="navy", border="white")
table(anaemia$sex[anaemia$return == "no"], exclude = NULL)
## 
## Female   Male   <NA> 
##    290    614     31

The following is more sophisticated – this has conditions which are evaluated to TRUE/FALSE/NA (for each row of data), and then tabulated (giving total number of TRUE, FALSE and NA across all rows). Logical operators can be combined, using brackets as necessary. Without brackets, conditions are combined with AND (&) before they are combined with OR (|). We cannot use syntax with [conditions] within Tidyverse commands.

table(anaemia$sex == "Female" & anaemia$hb_pre > 130)
## 
## FALSE  TRUE 
##   906   124
# count if operation is [either emergency(=2) or urgent(=4)]
# AND hb_pre is between >110 and <140
table((anaemia$operat == "Emergency"|anaemia$operat == "Urgent") &
(anaemia$hb_pre > 110 & anaemia$hb_pre < 140))
## 
## FALSE  TRUE 
##   876   164
# count if operation is emergency(=2) (and hb_pre can be anything) or
# operation is urgent(=4) AND hb_pre is between >110 and <140
table(anaemia$operat == "Emergency"|anaemia$operat == "Urgent" &
anaemia$hb_pre > 110 & anaemia$hb_pre < 140)
## 
## FALSE  TRUE 
##   849   191

The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.