1 Overview & examining data in R: viewing, summarising data, R data-types and converting between them; converting errors to NA, sorting, restricting variables/ to subsets
Data Wrangling Recipes in R: Hilary Watt
Oh yes, this book saves LOADS of time, by its clarity and focus on simple methods that readily achieve results.
1.1 Data sets for this book are available on github
The main dataset is called anaemia. Other data sets are used in latter chapters, to teach merging and restructuring datasets. https://github.com/hcwatt/data_wrangling_open.
1.2 Know your data and research question: essential knowledge before preparing data for analysis
You should receive information on any data set that is available to you, including how variables are coded, how the data was collected and how different variables were measured. Refer to this to better understand your data. Otherwise, check where you can obtain this information.
To prepare data for analysis, you need to be clear about the purpose of your analyses. You need clear research questions and a protocol to determine how you will answer these questions. Such protocols should be reviewed by supervisors/ colleagues/ other researchers/ ethic committee/ grant-rewarding committees before you start. Some people also publish detailed analysis plans, which adds extra robustness. How you prepare your data may depend on your research question.
1.3 Tidyverse packages invaluable for data management: it needs to be installed now if not already installed
This guide is intended for those relatively new to R, but absolute beginners would do well to start with a beginners course in R. The appendix gives a few core concepts of R.
Tidyverse suite of packages for data management are of consistent high-quality, make preparing data easier and work well together The tidyverse package is also valuable for tabulating data and for producing publication-quality highly-flexible graphs. You need to install it to use many of the commands in this book, as follows:
Each time you use R you need to run the following:
library(tidyverse) # makes the relevant packages available – required each time R/ RStudio is opened
When you load the tidyverse
, the following packages are loaded and many more: ggplot2
, dplyr
, tidyr
, readr
, purr
, tibble
, stringr
, and forecats
. For more information on the packages tidyverse
loads see https://www.tidyverse.org/packages/.
For any internet searches for data management in R, it is worthwhile including “tidyverse” as a search term. This might tend to give simpler and more powerful solutions than using base R.
1.4 Keep dataset as given to us, whilst aiming to overwrite different amended versions as much as possible
Always keep a copy of the dataset in the form given to us. If we save an amended “analysis” data set, we never overwrite our original data set. However, to avoid confusion it is good practice to over-write copies of our datasets that we no-longer need. Keeping a tidy well-annotated data-management R script is invaluable. This makes it easy to add in new variables/ data and create an amended analysis data set. It may help us to have confidence in overwriting previous versions.
1.5 View dataset including excel-style format & finding its dimensions
To view your data in a separate scrollable window, click on the name of the dataset in Environment window or use View()
command. You can use your mouse to scroll right-to-left and up-and-down in the dataset:
Viewing data on chosen variables allows us to readily compare the values of the selected variables. This uses df_name[row, column] format: nothing before comma implies keep all rows - then after the comma 2 variables= columns are specified:
## Output NOT shown for any of these:
View(anaemia[, c("hb_pre", "hb_post")]) ## view specified variables only
View(anaemia[anaemia$sex=="Female", c( "hb_post", "hb_pre")]) ## only show female (rows), change the order of vars to view accordingly
View(anaemia[c("hb_pre", "hb_post")]) ## view specified variables only - shorthand version
The following shows the first or last few rows, printing results into the console window.
head(anaemia) ## show first few lines (default 6 lines)
tail(anaemia, n=2) ## shows last few lines (argument requests 2 lines)
We might also want to view the dimensions of our data-base: how many rows (= observations) and columns (= variables) it has. We can also check for the length of any variables. Note that all variables in a data-frame must necessarily be of the same length (=number of observations in the data-frame). That is one motivation to keep variables within data-frames, to avoid error from variables being of different lengths and to avoid the need for extra code required to overcome this issue (issues with stand-alone variables).
dim(anaemia) ## print number of rows and columns in data-frame
nrow(anaemia) ## print number of rows/ observations in data-frame
ncol(anaemia) ## print number of columns/ variables in data-frame
length(anaemia$hb_pre) ## print length of variable specified: must be same as number of rows in data-frame, for variables within data-frames
1.6 Too many variables? Create & focus on a subset of required variables
With very many variables, we don’t want to spend hours cleaning irrelevant variables. Therefore we initially identify ones that we need. Don’t worry that you might miss some - it is fine to add new variables later and clean them as we recognise that we need them. With a tidy R script file, the original script file can be amended to include more variables. Similarly, if we only want certain types of people for our analysis, we might be able to select only that subset at the outset - although we might need to clean the relevant variable first. Such decisions require a protocol that describes your requirements alongside your research questions, that others have reviewed.
# keep only variables listed: using df_name[row, column] format: blank for rows implies to keep all rows, then only named columns=variables
anaemia2 <- anaemia[ ,c("id2","hb_pre", "hb_post")] ## base R robust option
# Alternative that does the same thing
anaemia2 <- subset(anaemia, select=c("id2", "hb_pre", "hb_post")) ## alternative though warning for lack of robustness in complex situations
# drop a variable & data-frame
anaemia$ht <- NULL # drops a variable from within data-frame
anaemia2 <- anaemia # copy data-frame
rm(anaemia2) ## drops data-frame
1.7 R data-types: Check what data-type R currently recognises each variable to be
We firstly check what datatypes R is currently recognising variables to be. If R is not correctly recognising their type, we need to correct this prior to analysis. Code is given here to achieve this quickly and easily, in simple situations. However, there is a lot of detail on how to do this, whilst overcoming different types of messiness, for each different type of data in the sections that follow.
The following R datatypes may be usefulData Type | Code in Environment window & str() function | Appearance in R data window | Examples where this is appropriate datatype |
---|---|---|---|
Numeric | num | Numbers | weight (kg), height (m) |
Integer | int | Integers | Number of people in household, number of hospital admissions |
Logicals/ Binary | logi | TRUE/FALSE | Has diabetes? Has emigrated? |
String/ Character | chr | Text: can include letters, numbers, special characters & spaces | Person ID variables, name, text responses |
Categorical/ Factor/ Binary | Factor | Categories are often named, but can be represented by numbers | death (died or not), urgency of operation (elective, urgent, emergency or salvage) |
Ordered Factor/ Ordered categorical | Ord.factor | Categories are often named, but can be represented by numbers | Cancer stage, Walking speed (slow, medium, fast, very fast) |
Date | Date, format | Date | Date of birth |
Date & time | POSIXct, format | Date & time | Date & time of entry into study |
Environment window (top right in RStudio) shows each variable and which data-type they are currently recognised as, within R.
str(anaemia) # this command gives similar information to environment window, when applied to a data-frame
## 'data.frame': 1040 obs. of 24 variables:
## $ id2 : chr "AB001" "AB002" "AB003" "AB004" ...
## $ sex : chr "Female" "Male" "Male" "Male" ...
## $ operat : chr "Urgent" "Urgent" "Elective" "Elective" ...
## $ hb_pre : int 134 97 115 138 135 156 115 103 137 121 ...
## $ hb_post : int 92 104 91 105 100 108 95 90 104 94 ...
## $ death : chr "alive" "alive" "alive" "alive" ...
## $ deathICU : chr "alive" "alive" "alive" "alive" ...
## $ return : chr "no" " yes" "no" "no" ...
## $ hpt : chr "no" " yes" " yes" " yes" ...
## $ weight : num 61 79 103 117 82 ...
## $ height : num 1.67 1.82 1.76 1.82 1.67 ...
## $ FFP : int 7 5 0 0 0 0 7 5 0 0 ...
## $ plts : int 0 2 0 0 0 0 1 1 0 0 ...
## $ rbc : int 5 15 2 1 0 0 2 8 1 0 ...
## $ los_pre : int 8 23 2 1 1 1 1 12 0 20 ...
## $ los_post : int 7 52 7 6 151 7 6 34 6 50 ...
## $ los : int 15 75 9 7 152 8 7 46 6 70 ...
## $ futime : int 93 103 101 106 151 100 109 107 91 101 ...
## $ death_fu : int 0 0 0 1 1 0 1 0 1 0 ...
## $ date_operat : chr "07-Sep-22" "09-Apr-22" "12-Jul-21" "25-Sep-22" ...
## $ datetime_end_fu: chr "09/12/2022 23:10" "21/07/2022 00:47" "21/10/2021 06:21" "09/01/2023 13:21" ...
## $ wgt : chr "61kg" "79kg" "103kg" "117kg" ...
## $ height2 : chr "1.67m" "182mm" "1.76" "1.8200001" ...
## $ factorx : chr "54,024" "10,480" "18,556" "65,273" ...
Interpretation:
- Stored as dataframe
- The output shows how many observations (often representing people) there are and how many variables/ columns there are.
- Variable names: storage type (see table below) and first few observations.
1.8 Converting between R data types (simple stuff plus tips and hints: worked examples and info to cope with many types of messiness are given in following sections by intended data-type)
The following R datatypes may be usefulData Type | Code in Environment window & str() function | Test for this | Converting to this data-type | R treated this as |
---|---|---|---|---|
Numeric | num | is.numeric() | as.numeric() OR as.numeric(as.character()) to convert from numerically coded factor variable retaining numerical values * | Number |
Integer | int | is.integer() | as.integer() OR as.integer(as.character()) to convert from integer-coded factor variable retaining numerical values * | Integer |
Logicals/ Binary | logi | is.logical() | as.logical() | TRUE=1, FALSE=0 |
String/ Character | chr | is.character() | as.character() | Strings of letters, numbers, special characters |
Categorical/ Factor/ Binary | Factor | is.factor() | as.factor() OR factor() ** | Categorical (or numbers 1, 2, 3,… according to their order) |
Ordered Factor/ Ordered categorical | Ord.factor | is.ordered() | as.ordered() (often better to specify as simple factor above ***) | Ordinal (or numbers 1, 2, 3,… according to their order) |
Date | Date, format | is.Date() | as.Date() from POXIT OR from character: mdy(), dmy(), ymd(),… (order of d, m, y in function reflects order of day, month, year ****) | Number of days since 1 jan 1970. |
Date & time | POSIXct, format | is.POSIXt()/ is.POSITct() | as.POSIXct(), ymd_hms(), hms() for time alone****) | Number of milliseconds from 1 jan 1970. |
Note: * converting directly from factor to number when factors appear as numbers, the resulting numerical variable is often different from observed numbers, which can confuse. To avoid this, use as.numeric( as.character( ) ) or as.integer( as.character( ) ) to retain number values of numerically coded factor variables. To code factors where categories include text, use as.numeric( ) or as.integer( ) alone.
Note: ** Specifying an (intended predictor) variable to be a factor variable will generally mean that it is fitted as a categorical predictor variable within regression analyses. This implies that R will generate appropriate indicator variables.
Note: *** Specifying a variable as an ordered factor often implies that it will be analysed as a linear trend across ordered categories. We often prefer to estimate the effect in each level separately, so may be better to avoid this option and specify as (unordered) factor. Because R is written by many different users, the treatment of such variables varies by command.
Note: **** More details on section on date/ times. Many of these require library(tidyverse) or the subset library(lubridate)
class()
function: shows data-type/ object-type:
## [1] "data.frame"
## [1] "character"
str()
structure function sometimes gives the same information, but gives far more detail for data-frames.
Once you have converted to appropriate data-types, use the summary() command (below) on the new & old variables, and check that the number of missing observations (NAs) has not accidentally increased. The following sections give far more detail on coping with messy data to avoid this happening. Refer to relevant chapter according to appropriate data type for variable concerned.
1.9 Summarising variables: to quickly assess what we have and to check for errors (more detail in following sections)
It is useful to look at tables/ summaries of each variable, to better understand what data we have. This is part of the necessary process to check and clean variables. This is taught in far more detail in the sections that follow, by data-type.
summary()
summarises variables, with summary presented depending on R data-type.
For integer/numerical variables, summary()
outputs the mean, median, range (min / max), intequartile range, and number of missing (NA
) values. For categorical (factor
) variables, summary
() produces the frequency of each factor level, including NAs. For character variables, it gives their (maximum) length.
?summary
gives further details.
## id2 operat weight
## Length:1040 Length:1040 Min. : 38.00
## Class :character Class :character 1st Qu.: 70.00
## Mode :character Mode :character Median : 80.00
## Mean : 83.21
## 3rd Qu.: 92.00
## Max. :999.00
## NA's :11
## Length Class Mode
## 1040 character character
Watch out for “messy” categories – different spellings of the same thing and unnecessary poorly defined categories. “Blank” (empty quotes) should be NA, so R recognises it as missing and treats it accordingly. The two “spellings” of male should be merged into one group. See next sub-heading on recoding to NA.
For numeric data, look out for outliers and impossible values, as well as for shapes of distributions. Missing data is often coded as 9, 99 or 999, or perhaps -1, -2, (or blank/ empty quotes “” for character data-type). Hence 999 looks like missing data code and it is an impossible value for weight (of man in kg). Ideally, we should have a code book giving this information. These need to be recoded to NA, so that the missing data code values are not included into calculated mean values and similar.
For numeric/ integer data which takes very few values, it can be more useful to see a table of the data, rather than the information given by summary()
.
table(anaemia$plts, exclude=NULL) # useful to keep track of missing data here too with this exclude=NULL option
##
## 0 1 2 3 4 5 6 7 11 13 16 <NA>
## 725 123 108 36 22 4 6 3 1 2 1 9
We might also want to know how many unique values a variable takes: code follows.
## [1] 1040
## [1] "Female" "Male" "" "male"
## [1] 4
The sections that follow give far more information on appropriate checks and considerations, according to datatype.
1.10 Recoding to NA any impossible values & missing data codes: so that R recognises them as missing
The missing data code in R is NA for all data-types, including character and numeric and date/ times.
anaemia$weight[anaemia$weight == 999] <- NA # Set weight = NA where weight is 999
anaemia$sex[anaemia$sex==""] <- NA # Change from blank sex to missing data code NA
Detail of code: This evaluates whether or not: anaemia$weight == 999
with double equals required to evaluate to see whether or not equal to. Then applies the code only when this expression is TRUE, which sets value to be NA (R missing data code). The code anaemia$weight <- NA
would set all weight values to missing NA, which would rarely be useful.
1.11 Sorting dataset to peruse observations according to characteristics/ high/ low values
We might want to sort the data, so that we can look over the data more efficiently. This can allow easy inspection of patients with the lowest anaemia prior to operation (lowest hb_pre
), where we can scroll through their values on all variables:
Using base R, use syntax df_name [row, column] or else use Tidyverse syntax which may be easier. The following sorts the rows, and selects all columns, so there is a blank after the comma in column position:
To look at each sex
separately, since hb_pre
is generally lower in females, sort firstly by sex
and then by hb_pre
within gender:
anaemia[order(anaemia$sex, anaemia$hb_pre), ] # base R version: sort by firstly sex and then hb_pre within each value of sex.
anaemia %>%
arrange(sex, hb_pre) # Tidyverse version: sort by firstly sex and then hb_pre within each value of sex.
By default, we sort lowest to highest. To reverse this order (highest to lowest), use the optional argument decreasing = TRUE
.
1.12 Restricting to subsets: so that you can check data by subsets, or restrict whole analysis data-set to a subset
The following shows how you can save a subset of variables to a dataframe; the following code retains the original dataframe (anaemia) with all the variables in addition to the version with few variables (named aneamia2). Remember to keep any ID variables, which enables more variables to be added in as required. The following shows you how to save only selected rows/ observations into a new dataframe; the following code retains the original dataframe (anaemia) with all rows in addition to the modified version (named aneamia2). Remember you might need to clean the relevant variable(s) first.
# View data on anyone you plan to drop – only two people here
print(subset(anaemia, anaemia$operat=="Salvage"), font_size = 6)
# Exclude the only person with “salvage”
anaemia.NOTsalvage <- subset(anaemia, anaemia$operat!="Salvage")
# drop those with Elective operations, only those with other operation types
# (or missing operation type) will remain, using df_name[row, column] syntax
anaemia.NOTelective <- anaemia[anaemia$operat != "Elective", ]
# keeps only those with elective operations, using df_name[row, column] syntax
anaemia.elective <- anaemia[anaemia$operat == "Elective", ]
Using dataframe[row, column]
notation, condition indicate which rows to include, so are before the comma. Blank in column position indicates all columns=variables
are selected. Appropriate data-frame names are chosen.
Keep everyone in the dataset, unless there is strong justification to do otherwise. Declare how many were dropped and your rationale in any reports/ articles/ dissertations.
To restrict commands to subsets, we can use [conditions in brackets]
. Note that such conditions are often specified on variables, which have just one dimension.
hist(anaemia$hb_pre[anaemia$operat == "Elective"],
main="Histogram of hb_pre where operation was elective",
xlab="hb pre (g/L)", col="navy", border="white")
##
## Female Male <NA>
## 290 614 31
The following is more sophisticated – this has conditions which are evaluated to TRUE
/FALSE
/NA
(for each row of data), and then tabulated (giving total number of TRUE
, FALSE
and NA
across all rows). Logical operators can be combined, using brackets as necessary. Without brackets, conditions are combined with AND (&
) before they are combined with OR (|
). We cannot use syntax with [conditions] within Tidyverse commands.
##
## FALSE TRUE
## 906 124
# count if operation is [either emergency(=2) or urgent(=4)]
# AND hb_pre is between >110 and <140
table((anaemia$operat == "Emergency"|anaemia$operat == "Urgent") &
(anaemia$hb_pre > 110 & anaemia$hb_pre < 140))
##
## FALSE TRUE
## 876 164
# count if operation is emergency(=2) (and hb_pre can be anything) or
# operation is urgent(=4) AND hb_pre is between >110 and <140
table(anaemia$operat == "Emergency"|anaemia$operat == "Urgent" &
anaemia$hb_pre > 110 & anaemia$hb_pre < 140)
##
## FALSE TRUE
## 849 191
The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.
Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.