Chapter 2 Data exploration

2.1 Functions

Besides the data structures you have learned about in the last chapter, there is another important concept you need to learn about when using R: the function. In principle, you can imagine a function as a little machine that takes some input (usually some kind of data), processes that input in a certain way and gives back the result as output. There are functions for almost every computation and statistical test you might want to do, there are functions to read and write data, to shape and manipulate it, to produce plots and even to write books (this document is written completely in R)! The function mean() for example takes a numeric vector as input and computes the mean of the numbers in the numeric vector:

mean(c(2, 4, 6))
[1] 4

2.1.1 Structure of a function

The information that goes into the function is called an argument, the output is called the result. A function can have an arbitrary number of arguments, which are named to tell them apart. The function log() for example takes two arguments: a numeric vector x with numbers you want to take the logarithm of and a single integer base with respect to which the logarithm should be taken :

log(x = c(10, 20, 30), base = 10)
[1] 1.000000 1.301030 1.477121

2.1.2 How to use a function

To find out how a function is used (i.e. what arguments it takes and what kind of result it returns) you can use R’s help. Just put ? in front of the function name (without brackets after the function name). If you run this code, the help page appears in the lower right window in R Studio.

?log

As you can see the help page gives information about a couple of functions, one of which is log(). Besides the description of the arguments you should have a look at the information under Usage. Here you can see that the default value for base is exp(1) (which is approximately 2.72, i.e. Eulers number), whereas there is no default value for x. All arguments that appear solely with their name but without a default value (like x in this case) under Usage are mandatory when you call the function. Not providing these arguments will throw an error. All arguments that have a default value given (like base in this case) can be omitted, in which case R assumes the default value for that argument:

log(x = c(20,30,40)) #argument base can be omitted
[1] 2.995732 3.401197 3.688879
log(base=3) #argument x can not be omitted
Error in eval(expr, envir, enclos): Argument "x" fehlt (ohne Standardwert)

(Translates to Error: Argument "x" is missing (without a default value)) If you omit the names of the arguments in the function call, R will assume which object belongs to which argument:

log(c(10, 20, 30), 10)
[1] 1.000000 1.301030 1.477121

2.2 Packages

A basic set of functions are already included in basic R, i.e. the software you downloaded when installing R. But since there is a huge community worldwide constantly developing new functions and features for R and since the entirety of all R functions there are is way to big to install at once, most of the functions are bundeled into so called packages. A package is a bundle of functions you can download and install from the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/). If you visit the site, you can also get an overview over all available packages. You can install a package by using the function install.packages() which takes the package name as a string (i.e. in quotes) as its argument:

install.packages("lubridate")

If you run this line of code, R goes online and downloads the package lubdridate (Grolemund and Wickham 2011) which contains a number of useful functions for dealing with dates. This operation has to be done only once, so it is one of the rare cases where it makes sense to copy the code directly into the console. If you write it in your script window it is advisable to comment out the code with a # after you’ve run it once to avoid unnecessarily running it again if you rerun the rest of your script.

Once you have installed the package, its functions are downloaded to your computer but are not accessible yet, because the package has to be activated first. If you try to call a function from a package that is not activated yet (e.g. the function today() from lubridate), you get this error:

today()
Error in today(): konnte Funktion "today" nicht finden

(Translates to Error in today(): Could not find function "today").

To activate the package, you use the function library(). This function activates the package for your current R session, so you have to do this once per session (a session starts when you open R/Rstudio and ends when you close the window).

library(lubridate)
today()
[1] "2024-01-09"

As you can see the function today()is an example of a function that doesn’t need any argument. Nevertheless you have to call it with brackets () to indicate that you’re calling a function rather than a variable called today. Most packages print some sort of information into the console when they are loaded with library(). Don’t be alarmed by the red color - all of R’s messages, warnings and errors are printed in red. The only messages you should be worried about for this course are the ones starting with Error in:, the rest can be safely ignored for now. However, warning messages can be informative if they appear.

2.3 Loading data into R

Most of the time you will want to use R on your own data so the first thing you’ll usually do when starting an analysis is to load your data into R. There are generally two ways of loading data into R: either your data is available in an R-format such as an .RData file, or your data comes in some non-R format. The former mostly happens when you saved your R workspace for later, the latter will be needed more frequently. For almost every data format there is a dedicated importing function in R. We’ll start by showing you how to read in non-R formats and then show you how to save and load your R workspace. For demonstration purposes, we have provided the NINDS data set (Marler et al. 1995) you have encountered in the lecture in two formats: NINDS.csv and NINDS.xlsx.

2.3.1 The working directory

Before we can show you how to import data, you have to get to know another important concept: the working directory. The working directory is basically the folder on your computer where R looks for files to import and where R will create files if you save something. To find out what the current working directory is, use:

getwd() #no arguments needed

R should now print the path to your current woking directory to the console. To change it you can use R-Studio. Click Session > Set Working Directory > Choose Directory… in the toolbar in the upper left of your window. You can then navigate to the folder of your choice and click Open.

Now you will see that R prints setwd("<Path to the chosen directory>") to the console. This shows you how you can set your working directory without clicking: You use the function setwd() and put the correct path in it. Note that R uses / to divide folders, this is different to windows.

Check if it worked by rerunning getwd(). You should now put the data files NINDS.csv and NINDS.xlsx in the folder you have chosen as your working directory.

2.3.2 Reading data

The comma-separated values (.csv) format is probably the format most widely used in the open-source community. Csv-files can be read into R using the read.csv() function, which expects a csv-file using , as a separator and . as a decimal point. If you happen to have a German file that uses ; and , instead, you have to use read.csv2(). Here, we will however use the standard csv format:

read.csv("NINDS.csv")

We haven’t printed the result in this document because it is too long, but if you execute the code yourself you can see that the read.csv() function prints the entire data set (possibly truncated) into the console. If you want to work with the data, it makes sense to store it in a variable:

NINDS_csv <- read.csv("NINDS.csv")

You can now see the data.frame in the environment window. To show you at least one other importing function, we have provided the exact same data set as an excel file. To read this file, you need to install a package with functions for excel files first, for example the package openxlsx (Schauberger and Walker 2023):

install.packages("openxlsx") #only do this once
library(openxlsx)
NINDS_xlsx <- read.xlsx("NINDS.xlsx")

If you have another kind of file, just google read R and you file type and you will most likely find an R package for just that.

You should now be able to see NINDS_xlsx and NINDS_csv, two identical data.frames in your working directory. Since they are identical we will work with NINDS_csv from here on.

2.3.3 Saving and loading R data

Sometimes you have worked on some data and want to be able to use your R objects in a later session. In this case, you can save your workspace (the objects listed under Environment) using save() or save.image(). save() takes the names ob the objects you want to save and a name for the file they are saved in. save.image() just saves all of the R objects in your workspace, so you just have to provide the file name:

save(NINDS_csv, file="NINDS.RData") #saves only NINDS_csv
save.image(file="my_workspace.RData") #saves entire workspace

When you now open a new R session and want to pick up where you left, you can load the data with load():

load("my_workspace.RData")

If you want to save a data.frame in some non-R format, almost every read function has a corresponding write function. The most versatile is write.table() which will write a text-file based format, like a tabular separated file or a csv, depending on what you supply in the sep argument.

2.4 Descriptive analysis

Now that we’ve loaded data into R, let’s start with some actual statistics. To get an overview over your data.frame, you can first have a look at it using the View() function. :

View(NINDS_csv)

The data set contains the following variables:

  • record A registration number
  • AGE Age in years
  • BGENDER Patient’s gender
  • TDX Presumptive Diagnosis at time of treatment
  • BDIAB History of diabetes at baseline
  • BNINETY Barthel index at 90 days
  • BHYPER History of hypertension at baseline
  • BRACE Patient’s race
  • DCENSOR Indicates if patient died during trial follow-up
  • GOS6M Glasgow at 6 months
  • HOUR2 NIH stroke scale at 2 hours
  • HOUR24 NIH stroke scale at 24 hours
  • SEVENTEN NIH stroke scale at 7-10 days
  • NINETY NIH stroke scale at 90 days
  • NIHSSB NIH stroke scale at baseline
  • PART Designation of Trial (Part 1 or 2)
  • SURDAYS Days from randomization to death/censored
  • TREATCD Treatment code
  • TWEIGHT Estimated weight at randomization
  • WEIGHT Actual weight measured after randomization
  • STATUS24 Dichotomised NIH stroke scale at 24 hours
  • STATUS2 Dichotomised NIH stroke scale at 2 hours

2.4.1 Data types in data.frames

In the lecture you have learned about different data types that variables can have. Here is an overview over the data types R uses to represent these variable types:

Variable type R data type Example variable in NINDS_csv
Metric Numeric/Integer WEIGHT/SURDAYS
Ordinal Integer/Ordered factor GOS6M
Nominal Unordered factor / character TREATCD

You are already familiar with the numeric and the character. The integer is a special case of a numeric that only contains whole numbers. The factor on the other hand is a data type that is used to represent categorical variables. It is similar to a character but has only a limited number of values, the so called factor levels. The patient’s name would for example be a character because there is a potentially infinite number of values this variable could take. The variable TREATCD on the other hand can be used as a factor, because the only two values it can take in our data set are Placebo and t-PA.

TREATCD would be an unordered factor, that is, a nominal variable. To represent ordinal variables, you can use the ordered factor that implies an ordering of the factor levels.

You can get an overview over the variable types in the NINDS_csv data.frame by clicking the little blue icon with the triangle next to the NINDS_csv object in the environment window in the upper right corner of R studio.

When you read the data into R without specifying the data type of every column, R will try to guess them, usually ending up with numeric or integer for all variables containing only numbers and character for variables containing letters or other symbols.

If you want to specify the classes for the read-in process, you can usually pass the argument colClasses with a character vector containing the types for all your variables to the reading function. Because that can be quite a long vector when you have a lot of variables it is often more easy to just let R guess the types and correct them later if necessary.

TREATCD for example has been read in as a character as you can see using the class() function:

class(NINDS_csv$TREATCD)
[1] "character"

You can turn it into a factor like this:

NINDS_csv$TREATCD <- factor(NINDS_csv$TREATCD)

The factor levels (i.e. the values you newly build factor variable can take) can be extracted with levels():

levels(NINDS_csv$TREATCD)
[1] "Placebo" "t-PA"   

Similarly, one could argue that GOS6M should be an ordered factor with Good < Mod. Dis < Sev. Dis < Veget < Dead, but currently it is represented as character.

We can fix that like this:

NINDS_csv$GOS6M <- factor(NINDS_csv$GOS6M, ordered = TRUE, 
                          levels = c("Good", "Mod. Dis", 
                                     "Sev. Dis", "Veget", "Dead"))

factor() creates a factor variable from a character vector or existing factor, ordered=TRUE, tells the function to make the factor ordered and the levels= argument specifies the correct order of the levels. With the assignment operator <- we overwrite the old version of GOS6M and TREATCD with the new version. The line breaks are just there so the code is better readable. They don’t change the functionality, just make sure to mark all lines when executing the code.

If you know beforehand that most of the character variables in your data.frame should actually be factors, you can specify this when reading the data in using the argument stringsAsFactors = TRUE:

NINDS_csv <- read.csv("NINDS.csv", stringsAsFactors = TRUE)

Have a look at how the description of the data.frame in the environment window changes after running this line of code!

2.4.2 Missing values

When scrolling trough your data, you might have noticed some cells contain NA as a value. NA stands for Not Available and is the value R uses to represent missing values. If you have read in your data from other formats, you might have to check how missing values were coded there and give that information to the read-in function to make sure they are turned in NA.

In most computations you’ll have to tell R explicitly how to deal with these values (e.g. remove them before computation), else you’ll get NA as a result. For example, if you want to compute the mean of the variable TWEIGHT, which contains missing values, you have to set the argument na.rm=TRUE, where na.rm stands for NA remove:

#default for na.rm is FALSE so NA's are not removed
mean(NINDS_csv$TWEIGHT) 
[1] NA
#this way, NA's are removed before computation
mean(NINDS_csv$TWEIGHT, na.rm=TRUE)
[1] 78.37432

2.4.3 Numerical description

There are functions for many descriptive measures you might want to compute. Most of the time the function name gives a good clue at what the function does. We’ll go trough the most common ones in the following paragraphs.

Measures of central tendency

Measures of central tendency tell us where the majority of values in the distribution are located. Let’s compute the mean, median and all quartiles of the AGE variable:

mean(NINDS_csv$AGE) #mean
[1] 66.94177
median(NINDS_csv$AGE) #median
[1] 68.69141
quantile(NINDS_csv$AGE) #gives all quartiles
      0%      25%      50%      75%     100% 
26.48927 60.12106 68.69141 75.37464 89.00000 

Measures of dispersion

Measures of dispersion describe the spread of the values around a central value. Here you can see how to compute the variance, the standard deviation and the range of a variable. To get the interquartile range just pick the 25- and 75-percentile from the quantile()function above!

var(NINDS_csv$AGE) #variance
[1] 135.7818
sd(NINDS_csv$AGE) #standard deviaton
[1] 11.65254
range(NINDS_csv$AGE) #range
[1] 26.48927 89.00000
min(NINDS_csv$AGE) #minimum
[1] 26.48927
max(NINDS_csv$AGE) #maximum
[1] 89

Measures of association

Measures of association describe the relationship between two or more variables. In this case there is more than one way to deal with missing values, so instead of the na.rm argument, here we have the argument use= to specify which values to use in case of missing values. For the simple case of looking at correlations between two variables, you can set this argument to use = "complete.obs" which means that only cases without missing values go into the computation. If an observation (i.e. a patient) has NA on at least one of the two variables, this observation is excluded from the computation.

We can compute the Pearson and the Spearman correlation of the actual and the estimated weight of the NINDS patients using the cor() function:

cor(NINDS_csv$WEIGHT, NINDS_csv$TWEIGHT, use = "complete.obs",
    method = "pearson")
[1] 0.9313269
cor(NINDS_csv$WEIGHT, NINDS_csv$TWEIGHT, use = "complete.obs",
    method = "spearman")
[1] 0.9362007

For the association of categorical variables, you’ll mostly want to look at the frequency tables of the categories. A frequency table for a single variable is produced like this:

table(NINDS_csv$TREATCD) #gives absolute frequencies

Placebo    t-PA 
    312     312 

But you can also use table() to generate cross tables for two variables:

table(NINDS_csv$BRACE, NINDS_csv$BGENDER)
                     
                      female male
  Asian                    5    3
  Black                   75   94
  Hispanic                16   21
  Other                    1    6
  White, non-Hispanic    165  238

General overview

If you want to get an overview over your entire data.frame, the summary() function is convenient. This function can be used for a lot of different kinds of R objects and gives a summary appropriate for whatever the input is. If you give it a data.frame, summary() will give the minimal and maximal value, the 1st and 3rd quartile and the mean and median for every quantitative (i.e. numeric/integer) variable and a frequency table for every factor as well as the number of missing values:

summary(NINDS_csv)
     record             AGE          BGENDER                        TDX     
 Min.   :  33656   Min.   :26.49   female:262   Cardioembolic         :273  
 1st Qu.:2689566   1st Qu.:60.12   male  :362   Large vessel occlusive:252  
 Median :4968180   Median :68.69                Other                 : 18  
 Mean   :4980392   Mean   :66.94                Small vessel occlusive: 81  
 3rd Qu.:7276841   3rd Qu.:75.37                                            
 Max.   :9987047   Max.   :89.00                                            
                                                                            
  BDIAB        BNINETY        BHYPER                    BRACE     DCENSOR  
 No  :489   Min.   :  0.00   No  :209   Asian              :  8   No :456  
 Yes :131   1st Qu.: 10.00   Yes :408   Black              :169   Yes:168  
 NA's:  4   Median : 85.00   NA's:  7   Hispanic           : 37            
            Mean   : 62.59              Other              :  7            
            3rd Qu.:100.00              White, non-Hispanic:403            
            Max.   :100.00                                                 
                                                                           
      GOS6M         HOUR2           HOUR24         SEVENTEN         NINETY     
 Dead    :153   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
 Good    :231   1st Qu.: 6.00   1st Qu.: 4.00   1st Qu.: 2.00   1st Qu.: 1.00  
 Mod. Dis:133   Median :12.00   Median :11.00   Median : 8.00   Median : 6.00  
 Sev. Dis:105   Mean   :12.76   Mean   :12.33   Mean   :12.08   Mean   :12.85  
 Veget   :  2   3rd Qu.:18.00   3rd Qu.:19.00   3rd Qu.:18.00   3rd Qu.:18.00  
                Max.   :38.00   Max.   :42.00   Max.   :42.00   Max.   :42.00  
                NA's   :1       NA's   :1       NA's   :9                      
     NIHSSB           PART          SURDAYS          TREATCD   
 Min.   : 1.00   Min.   :1.000   Min.   :   0.0   Placebo:312  
 1st Qu.: 9.00   1st Qu.:1.000   1st Qu.: 242.8   t-PA   :312  
 Median :14.00   Median :2.000   Median : 366.0                
 Mean   :14.79   Mean   :1.534   Mean   : 359.3                
 3rd Qu.:20.00   3rd Qu.:2.000   3rd Qu.: 378.0                
 Max.   :37.00   Max.   :2.000   Max.   :1970.0                
                                                               
    TWEIGHT           WEIGHT       STATUS24   STATUS2   
 Min.   : 39.00   Min.   : 41.18   high:300   high:129  
 1st Qu.: 68.15   1st Qu.: 65.88   low :283   low :494  
 Median : 78.00   Median : 77.36   NA's: 41   NA's:  1  
 Mean   : 78.37   Mean   : 78.07                        
 3rd Qu.: 86.40   3rd Qu.: 88.00                        
 Max.   :179.50   Max.   :179.59                        
 NA's   :1                                              

2.4.4 Graphical description

For exploration it is also useful to plot the data. R has a number of options for plotting, ranging from simple plots which are quickly made to more elaborate functions and packages that allow you to produce more complex, publication ready plots with just a little more effort.

For this chapter we’ll start with the quick and easy ones and go from the most broadly applicable plots that can be used for all types of data to the more exclusive ones that can be used only for data types with a certain scaling.

We’ll start by introducing the basic plot functions without any customization of labels or axes to give you an overview. When you create plots you want to share, you should of course improve them as, e.g., shown in the last paragraph of the chapter.

Barplot

A barplot can technically be used on every variable with a finite set of values. The barplot() function takes a frequency table and produces a barplot from it.

barplot(table(NINDS_csv$BRACE))

If you give it a crosstable, you get divided barplots:

barplot(table(NINDS_csv$BGENDER, NINDS_csv$BRACE))

Histogram

If you have a metric variable, you can also use the histogram:

hist(NINDS_csv$AGE)

Boxplot

If you have data that is as least ordinal you can use the boxplot() function:

boxplot(NINDS_csv$SURDAYS)

You can also split the boxplot by another (categorical) variable using the ~ sign:

boxplot(NINDS_csv$SURDAYS ~ NINDS_csv$BGENDER)

Scatterplot

And we can use scatterplots to get an idea about the relationship of two metric variables:

plot(NINDS_csv$TWEIGHT, NINDS_csv$WEIGHT)

Customisation

Even these basic plots come with a whole lot of customisation options. We’ll exemplary show you a couple of them for the histogram. You can find out about all possible options by going to the help page of the respective function (e.g. ?hist).

#change the number of breaks
hist(NINDS_csv$AGE, breaks = 5) 

#Add customized x axis and Title
hist(NINDS_csv$AGE, xlab = "Age of subjects in years", main = "My Title") 

#change the color
hist(NINDS_csv$AGE, col = "blue") 

All the plotting functions we have just shown you are useful because they are easy to use. In a later chapter we will introduce the package ggplot2 (Wickham et al. 2023) which allows you to make plots for more complex displays like this one:

2.5 Troubleshooting

  1. Error in file(file, “rt”) : cannot open the connection […] No such file or directory : The file you are trying to open probably doesn’t exist. Check if you spelled the file name correctly. Also check if the working directory actually contains the file you are trying to read.

  2. Error in library(“xy”) : there is no package called ‘xy’ : You either misspelled the package name or you haven’t installed the package yet.

  3. Error in install.packages: object ‘xy’ not found : Have you forgotten to put quotation marks around the package name?

  4. Error in install.packages: package ‘xy’ is not available (for R version x.x.x): Either you misspelled the package name or the package does not exist, or it does not exist for your R version.

  5. Error in plot.new() : figure margins too large : The plot window in the lower right corner of R studio is too small to display the plot. Make it bigger by dragging the left margin further to the left and rerun the plotting function.

References

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Marler, J. R., T. Brott, J. Broderick, R. Kothari, M. Odonoghue, W. Barsan, and et al. 1995. “Tissue Plasminogen Activator for Acute Ischemic Stroke.” New England Journal of Medicine 333 (24): 1581–88. https://doi.org/10.1056/NEJM199512143332401.
Schauberger, Philipp, and Alexander Walker. 2023. Openxlsx: Read, Write and Edit Xlsx Files. https://ycphs.github.io/openxlsx/index.html.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.