Chapter 7 Reading and data inspection

7.1 Reading data into R

7.1.1 The basics

There are several function in base R that exist to read data in, and the function in R you use will depend on the file format being read in. Below we have a table with the base R functions that can be used for importing some common text data types (plain text).

Data type	Extension	Function
Comma separated values	csv	`read.csv()`
Tab separated values	tsv	`read.delim`
Other delimited formats	txt	`read.table()`

For example, if we have text file where the columns are separated by commas (comma-delimited), you could use the function read.csv. However, if the data are separated by a different delimiter in a text file (e.g. “:”, “;”, ” “), you could use the generic read.table function and specify the delimiter (sep = " ") as an argument in the function.

7.1.2 Metadata

When working with large datasets, you will very likely be working with a “metadata” file which contains the information about each sample in your dataset.

Alt text

7.1.3 The `read.csv()` function

Let’s bring in the metadata file in our data folder (mouse_exp_design.csv) using the read.csv function.

First, check the arguments for the function using the ? to ensure that you are entering all the information appropriately:

?read.csv

Alt text

The first item on the documentation page is the function Description, which specifies that the output of this set of functions is going to be a data frame.

In usage, all of the arguments listed for read.table() are the default values for all of the family members unless otherwise specified for a given function. Let’s take a look at 3 examples:

The separator

for read.table() sep = "" (space or tab)
for read.csv() sep = "," (a comma).

The header - This argument refers to the column headers that may (TRUE) or may not (FALSE) exist in the plain text file you are reading in.

for read.table() header = FALSE (by default, it assumes you do not have column names)
for read.csv() header = TRUE (by default, it assumes that all your columns have names listed).

The row.names - This argument refers to the rownames.

for read.table() row.names by default assumes that your rownames are not in the first column.
for read.csv() header = TRUE (by default, it assumes that your rownames are in the first column.

Note: this one is tricky because the default isn’t listed as such in the help file.

The take-home from the “Usage” section for read.csv() is that it has one mandatory argument, the path to the file and filename in quotations; in our case that is data/mouse_exp_design.csv

7.1.4 Create a data frame by reading in the file

Let’s read in the mouse_exp_design.csv file and create a new data frame called metadata.

metadata <- read.csv(file="data/mouse_exp_design.csv")

We can see if it has successfully been read in by running:

metadata

##          genotype celltype replicate
## sample1        Wt    typeA         1
## sample2        Wt    typeA         2
## sample3        Wt    typeA         3
## sample4        KO    typeA         1
## sample5        KO    typeA         2
## sample6        KO    typeA         3
## sample7        Wt    typeB         1
## sample8        Wt    typeB         2
## sample9        Wt    typeB         3
## sample10       KO    typeB         1
## sample11       KO    typeB         2
## sample12       KO    typeB         3

Exercise 1

Read “project-summary.txt” in to R using read.table() with the appropriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:

all the columns in the input text file have column name/headers
you want the first column of the text file to be used as row names
hint: look up the input for the row.names = argument in read.table()

Display the contents of proj_summary in your console

7.2 Inspecting data structures

There are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Below is a non-exhaustive list of these functions:

The list has been divided into functions that work on all types of objects, some that work only on vectors/factors (1 dimensional objects), and others that work on data frames and matrices (2 dimensional objects).

All data structures - content display:

str(): compact display of data contents (similar to what you see in the Global environment)
class(): displays the data type for vectors (e.g. character, numeric, etc.) and data structure for dataframes, matrices
summary(): detailed display of the contents of a given object, including descriptive statistics, frequencies
head(): prints the first 6 entries (elements for 1-D objects, rows for 2-D objects)
tail(): prints the last 6 entries (elements for 1-D objects, rows for 2-D objects)

Vector and factor variables:

length(): returns the number of elements in a vector or factor

Dataframe and matrix variables:

dim(): returns dimensions of the dataset (number_of_rows, number_of_columns) [Note, row numbers will always be displayed before column numbers in R]
nrow(): returns the number of rows in the dataset
ncol(): returns the number of columns in the dataset
rownames(): returns the row names in the dataset
colnames(): returns the column names in the dataset

Let’s use the metadata file that we created to test out data inspection functions.

head(metadata)

##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3
## sample4       KO    typeA         1
## sample5       KO    typeA         2
## sample6       KO    typeA         3

str(metadata)

## 'data.frame':    12 obs. of  3 variables:
##  $ genotype : chr  "Wt" "Wt" "Wt" "KO" ...
##  $ celltype : chr  "typeA" "typeA" "typeA" "typeA" ...
##  $ replicate: int  1 2 3 1 2 3 1 2 3 1 ...

dim(metadata)

## [1] 12  3

nrow(metadata)

## [1] 12

ncol(metadata)

## [1] 3

class(metadata)

## [1] "data.frame"

colnames(metadata)

## [1] "genotype"  "celltype"  "replicate"

Exercise 2

What is the class of each column in metadata (use one command)?
What is the median of the replicates in metadata (use one command)?

Exercise 3

Use the class() function on glengths and metadata, how does the output differ between the two?
Use the summary() function on the proj_summary dataframe, what is the median “rRNA_rate”?
How long is the samplegroup factor?
What are the dimensions of the proj_summary dataframe?
When you use the rownames() function on metadata, what is the data structure of the output?
[Optional] How many elements in (how long is) the output of colnames(proj_summary)? Don’t count, but use another function to determine this.