Chapter 10 Loading Data

For most analyses that you conduct in R, the first step involves importing a data set into R. There are lots of different ways to load data into R, and many different types of data it can read too.

Datasets that we will use for this section can be downloaded here. It is a zip folder. Save it somewhere where you can easily find it.

We will be using the read.csv function to do this, as our data is stored in Excel files. One thing to be mindful of here is the path to the file.

If you have saved the file within your current working directory, you can simply write:

books <- read.csv("books.csv", header = T)

Note that the books dataset has now appeared in the Global Environment.

There are other ways that you can do this too. For example, you can use the read.table() function, and you can also use the read.spss() or read.sav() functions for SPSS files too. Note, to use the SPSS import functions, you will need to load the foreign or haven packages (more on this later).

Now that we have our data read into R, lets have a look at it. We might first want to see a breakdown of the data frame. We can do this by using the str() function.

str(books)

## 'data.frame':    500 obs. of  2 variables:
##  $ comic   : int  -44 20 0 -18 -19 13 16 14 -6 11 ...
##  $ statbook: int  16 -14 6 -13 7 -33 3 -7 -6 -3 ...

We can see that we have 2 variables, one called comic and one called statbook. Both are numeric, and there are 500 observations in each. We could also extract the specific information in single commands:

ncol(books) #number of columns

## [1] 2

nrow(books) #number of rows

## [1] 500

colnames(books) #column names

## [1] "comic"    "statbook"

If we wanted to have a quick glance at the data, you could use the head() or tail()functions. If you really wanted to see all of your data, you can use theprint()` function.

head(books) #first 6 rows

##   comic statbook
## 1   -44       16
## 2    20      -14
## 3     0        6
## 4   -18      -13
## 5   -19        7
## 6    13      -33

tail(books) #last 6 rows

##     comic statbook
## 495    11        3
## 496     9       13
## 497    10      -10
## 498    11      -22
## 499    10      -10
## 500    26       10

10.1 Practical Example

(Example is partially adapted from A. Field, “Discovering statistics using R”, Sage, chapter 10, p. 400)

The example contains data relating to what contributed to pain relief for patients and compares the effects – of administering a sugar pill to a patient (placebo condition, dose code = 1), or a low dose of a drug, for instance ibuprofen (dose code = 2) or a high dose of the same drug (dose code = 3).

We thus have two main variables and we surveyed 15 participants: Condition – 1 (Placebo), 2 (Low dose of ibuprofen), 3 (high dose of ibuprofen); and Pain level (effect) – measured at scale 1-10

Now its over to you - read in the ‘dose.csv’ file, check the type of data you have, level and label where appropriate. You should also explore the data using the commands that we used above too.

#Read in data
exp <- read.csv("dose.csv", header = T)
exp

##    ID dose effect
## 1   1    1      3
## 2   2    1      2
## 3   3    1      1
## 4   4    1      1
## 5   5    1      4
## 6   6    2      5
## 7   7    2      2
## 8   8    2      4
## 9   9    2      2
## 10 10    2      3
## 11 11    3      7
## 12 12    3      4
## 13 13    3      5
## 14 14    3      3
## 15 15    3      6

str(exp)

## 'data.frame':    15 obs. of  3 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dose  : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ effect: int  3 2 1 1 4 5 2 4 2 3 ...

exp$dose <- factor(exp$dose, levels=c(1,2,3), labels = c("Placebo", "Low_dose", "High_dose"))
is.factor(exp$dose)

## [1] TRUE

head(exp)

##   ID     dose effect
## 1  1  Placebo      3
## 2  2  Placebo      2
## 3  3  Placebo      1
## 4  4  Placebo      1
## 5  5  Placebo      4
## 6  6 Low_dose      5

10.2 Subsetting Dataframes

R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The subset() function is the easiest way to select variables and observations.

# using subset function to extract ID and effect columns from exp dataframe for Placebo dose only
exp_placebo <- subset(exp,
                  dose == "Placebo",
                  select=c(ID, effect)
                  )

# using subset function to extract ID, dose and effect columns from exp dataframe for scores greater than or equal to 7, or less than 4.
exp_lowscore <- subset(exp, 
                    effect >= 7 | effect < 4, 
                    select=c(ID, dose, effect)
                    )
exp_lowscore

##    ID      dose effect
## 1   1   Placebo      3
## 2   2   Placebo      2
## 3   3   Placebo      1
## 4   4   Placebo      1
## 7   7  Low_dose      2
## 9   9  Low_dose      2
## 10 10  Low_dose      3
## 11 11 High_dose      7
## 14 14 High_dose      3

10.2.1 Replacing Values & NAs

Now that we’ve subset our data, it turns out that we don’t want to have any 7’s in our exp_lowscore data frame. We need to replace any values of 7 with an NA. There are lots of ways of replacing values and setting specific values to NA, but we’ll cover the basics below.

exp_lowscore[exp_lowscore == 7] <- NA
exp_lowscore

##    ID      dose effect
## 1   1   Placebo      3
## 2   2   Placebo      2
## 3   3   Placebo      1
## 4   4   Placebo      1
## 7  NA  Low_dose      2
## 9   9  Low_dose      2
## 10 10  Low_dose      3
## 11 11 High_dose     NA
## 14 14 High_dose      3

What about replacing NAs with a specific value?

exp_lowscore[is.na(exp_lowscore)] <- 0
exp_lowscore

##    ID      dose effect
## 1   1   Placebo      3
## 2   2   Placebo      2
## 3   3   Placebo      1
## 4   4   Placebo      1
## 7   0  Low_dose      2
## 9   9  Low_dose      2
## 10 10  Low_dose      3
## 11 11 High_dose      0
## 14 14 High_dose      3

Another way to do this is by using the replace function:

##    ID      dose effect
## 1   1   Placebo      3
## 2   2   Placebo      2
## 3   3   Placebo      1
## 4   4   Placebo      1
## 7  NA  Low_dose      2
## 9   9  Low_dose      2
## 10 10  Low_dose      3
## 11 11 High_dose     NA
## 14 14 High_dose      3

exp_lowscore <- replace(exp_lowscore, is.na(exp_lowscore), 0)
exp_lowscore

##    ID      dose effect
## 1   1   Placebo      3
## 2   2   Placebo      2
## 3   3   Placebo      1
## 4   4   Placebo      1
## 7   0  Low_dose      2
## 9   9  Low_dose      2
## 10 10  Low_dose      3
## 11 11 High_dose      0
## 14 14 High_dose      3

Can you spot the problem with using these two approaches? What’s happened in the ID column? We need to index here too.

exp_lowscore$effect[exp_lowscore$effect == 7] <- NA

10.3 Indexing Data Frames

To index a data frame, you can use similar syntax to what we have done previously. This time, you need to use data[row, column] syntax.

How would you retrieve the data from the 2nd row of the 3rd column of the dose data?

exp[2, 3]

## [1] 2

What about if we wanted to retrieve data from the entire row or column?

exp[2, ] #2nd row

##   ID    dose effect
## 2  2 Placebo      2

exp[, 2] #2nd column

##  [1] Placebo   Placebo   Placebo   Placebo   Placebo   Low_dose  Low_dose 
##  [8] Low_dose  Low_dose  Low_dose  High_dose High_dose High_dose High_dose
## [15] High_dose
## Levels: Placebo Low_dose High_dose

exp[, "dose"] #same as above

##  [1] Placebo   Placebo   Placebo   Placebo   Placebo   Low_dose  Low_dose 
##  [8] Low_dose  Low_dose  Low_dose  High_dose High_dose High_dose High_dose
## [15] High_dose
## Levels: Placebo Low_dose High_dose

We can use indexing to get other information about the data, such as the means, standard deviations, medians etc.

mean(exp$effect)

## [1] 3.466667

median(exp$effect)

## [1] 3

sd(exp[, 3])

## [1] 1.76743

It’s a little tedious having to write all of these separate lines of code though, don’t you think? Surely there must be an easier way to get these statistics, maybe with the help of a function…?