Chapter 2 Getting Started With Data in R
2.1 Loading Data
There are many functions in R for loading data of various formats:
read.csv()
for csv files (comma separated values)read.tsv()
for tsv files (tab separated values)read.xlsx()
for Excel files (from thereadxl
package)read.dta13()
for dat files (from thereadstata13
package)read.table()
for huge datasets
When loading data you must specify the exact file path in the argument (see below). If you don’t know how to find your file path, give it a google. Remember to name your dataset (i.e. assign your dataset to an object).
The following code loads a csv file with data on the body temperature, heart rate, and gender of 130 subjects.1 Locally the file is called ‘normtemp.csv’ and it is assigned to an object called cardiacdata
:
cardiacdata <- read.csv('./data/normtemp.csv', header = TRUE)
Note the extra argument, header = TRUE
, which specifies that the first row of the dataset is a header. If your dataset has no header you should specify header = FALSE
.2
2.2 Viewing Data
To view the entire dataset, use the View()
command in the console. A table view of the dataset will open as a new tab. Don’t use the View()
command for large datasets as it is very memory intensive.
Another way to see your data is to print the first or last few rows using the head()
or tail()
function. You can specify exactly how many rows as an additional argument (by default it will print six):
head(cardiacdata, n = 5)
## gender bodytemp heartrate
## 1 2 96.3 70
## 2 2 96.7 71
## 3 2 96.9 74
## 4 2 97.0 80
## 5 2 97.1 73
To check the column names of your dataset, use colnames()
:
colnames(cardiacdata)
## [1] "gender" "bodytemp" "heartrate"
To check the dimensions of your dataset (number of rows and columns), use dim()
:
dim(cardiacdata)
## [1] 130 3
2.3 Basic Data Structures in R
In statistics a variable usually refers to something that is measured. Data is just observations of some variable–in the cardiac dataset each column contains data pertaining to a specific variable. Data can be continuous, discrete, or categorical.
- Continuous data can take on infinitely many values (real numbers). E.g. the
bodytemp
variable in the cardiac dataset. - Discrete data can take on countable values only (integer numbers). E.g. the
heartrate
variable in the cardiac dataset. - Categorical data fall into a finite number of categories or distinct groups. E.g. the
gender
variable in the cardiac dataset.3
Everything in R is an object. When you load data into R, it is parsed into objects. Every object has a data type, and different forms of data are parsed into different data types.
Below are the five elementary data types in R:
- character – e.g.
'abcd'
- integer – integer numbers, e.g.
'2'
- numeric – decimal numbers, e.g.
'2.21'
- complex – complex numbers, e.g.
'2+2i'
- logical – either
TRUE
orFALSE
Objects may be combined to form larger data structures. Some common ones:
- vector – a one-dimensional array; there are two kinds of vectors:
- atomic vector – holds data of a single data type
- list – holds data of multiple data types
- matrix – a two-dimensional array; all columns have the same data type
- data frame – a two-dimensional array; columns may have different data types
You can check an object’s data type using class()
:
class(cardiacdata)
## [1] "data.frame"
The data frame is indeed a common structure for tabular data. To check the data type(s) in the column bodytemp
:
class(cardiacdata$bodytemp)
## [1] "numeric"
Similarly for the column heartrate
:
class(cardiacdata$heartrate)
## [1] "integer"
Download the data here. Excuse its mediocrity–there will (hopefully) be more interesting examples in chapters to come. To see where the data actually comes from, click here.↩
Here the
=
operator is not used for variable assignment, but rather to specify a parameter for theread.csv()
function (this is the main difference between the<-
and=
operators).↩Even though
gender
is numerically coded in the temperatures dataset, it is still categorical since it only takes on two values, 1 and 2.↩