Chapter 2 Getting Started With Data in R


2.1 Loading Data

There are many functions in R for loading data of various formats:

  • read.csv() for csv files (comma separated values)
  • read.tsv() for tsv files (tab separated values)
  • read.xlsx() for Excel files (from the readxl package)
  • read.dta13() for dat files (from the readstata13 package)
  • read.table() for huge datasets

When loading data you must specify the exact file path in the argument (see below). If you don’t know how to find your file path, give it a google. Remember to name your dataset (i.e. assign your dataset to an object).

The following code loads a csv file with data on the body temperature, heart rate, and gender of 130 subjects.1 Locally the file is called ‘normtemp.csv’ and it is assigned to an object called cardiacdata:

cardiacdata <- read.csv('./data/normtemp.csv', header = TRUE)

Note the extra argument, header = TRUE, which specifies that the first row of the dataset is a header. If your dataset has no header you should specify header = FALSE.2

 


2.2 Viewing Data

To view the entire dataset, use the View() command in the console. A table view of the dataset will open as a new tab. Don’t use the View() command for large datasets as it is very memory intensive.

Another way to see your data is to print the first or last few rows using the head() or tail() function. You can specify exactly how many rows as an additional argument (by default it will print six):

head(cardiacdata, n = 5)
##   gender bodytemp heartrate
## 1      2     96.3        70
## 2      2     96.7        71
## 3      2     96.9        74
## 4      2     97.0        80
## 5      2     97.1        73

To check the column names of your dataset, use colnames():

colnames(cardiacdata)
## [1] "gender"    "bodytemp"  "heartrate"

To check the dimensions of your dataset (number of rows and columns), use dim():

dim(cardiacdata)
## [1] 130   3

 


2.3 Basic Data Structures in R

In statistics a variable usually refers to something that is measured. Data is just observations of some variable–in the cardiac dataset each column contains data pertaining to a specific variable. Data can be continuous, discrete, or categorical.

  • Continuous data can take on infinitely many values (real numbers). E.g. the bodytemp variable in the cardiac dataset.
  • Discrete data can take on countable values only (integer numbers). E.g. the heartrate variable in the cardiac dataset.
  • Categorical data fall into a finite number of categories or distinct groups. E.g. the gender variable in the cardiac dataset.3

Everything in R is an object. When you load data into R, it is parsed into objects. Every object has a data type, and different forms of data are parsed into different data types.

Below are the five elementary data types in R:

  • character – e.g. 'abcd'
  • integer – integer numbers, e.g. '2'
  • numeric – decimal numbers, e.g. '2.21'
  • complex – complex numbers, e.g. '2+2i'
  • logical – either TRUE or FALSE

Objects may be combined to form larger data structures. Some common ones:

  • vector – a one-dimensional array; there are two kinds of vectors:
    • atomic vector – holds data of a single data type
    • list – holds data of multiple data types
  • matrix – a two-dimensional array; all columns have the same data type
  • data frame – a two-dimensional array; columns may have different data types

You can check an object’s data type using class():

class(cardiacdata)
## [1] "data.frame"

The data frame is indeed a common structure for tabular data. To check the data type(s) in the column bodytemp:

class(cardiacdata$bodytemp)
## [1] "numeric"

Similarly for the column heartrate:

class(cardiacdata$heartrate)
## [1] "integer"

 


 


  1. Download the data here. Excuse its mediocrity–there will (hopefully) be more interesting examples in chapters to come. To see where the data actually comes from, click here.

  2. Here the = operator is not used for variable assignment, but rather to specify a parameter for the read.csv() function (this is the main difference between the <- and = operators).

  3. Even though gender is numerically coded in the temperatures dataset, it is still categorical since it only takes on two values, 1 and 2.