Chapter 7 Reading and data inspection

7.1 Reading data into R

7.1.1 The basics

There are several function in base R that exist to read data in, and the function in R you use will depend on the file format being read in. Below we have a table with the base R functions that can be used for importing some common text data types (plain text).

Data type	Extension	Function
Comma separated values	csv	`read.csv()`
Tab separated values	tsv	`read.delim`
Other delimited formats	txt	`read.table()`

For example, if we have text file where the columns are separated by commas (comma-delimited), you could use the function read.csv. However, if the data are separated by a different delimiter in a text file (e.g. ":", ";", " "), you could use the generic read.table function and specify the delimiter (sep = " ") as an argument in the function.

7.1.2 Metadata

When working with large datasets, you will very likely be working with a "metadata" file which contains the information about each sample in your dataset.

7.1.3 The `read.csv()` function

Let's bring in the metadata file in our data folder (mouse_exp_design.csv) using the read.csv function.

First, check the arguments for the function using the ? to ensure that you are entering all the information appropriately:

?read.csv

The first item on the documentation page is the function Description, which specifies that the output of this set of functions is going to be a data frame.

In usage, all of the arguments listed for read.table() are the default values for all of the family members unless otherwise specified for a given function. Let's take a look at 2 examples:

The separator

for read.table() sep = "" (space or tab)
for read.csv() sep = "," (a comma).

The header - This argument refers to the column headers that may (TRUE) or may not (FALSE) exist in the plain text file you are reading in.

for read.table() header = FALSE (by default, it assumes you do not have column names)
for read.csv() header = TRUE (by default, it assumes that all your columns have names listed).

The row.names - This argument refers to the rownames.

for read.table() row.names by default assumes that your rownames are not in the first column.
for read.csv() header = TRUE (by default, it assumes that your rownames are in the first column.

Note: this one is tricky because the default isn't listed as such in the help file.

The take-home from the "Usage" section for read.csv() is that it has one mandatory argument, the path to the file and filename in quotations; in our case that is data/mouse_exp_design.csv

7.1.4 Create a data frame by reading in the file

Let's read in the mouse_exp_design.csv file and create a new data frame called metadata.

metadata <- read.csv(file="data/mouse_exp_design.csv")

We can see if it has successfully been read in by running:

metadata

Exercise 1

Read "project-summary.txt" in to R using read.table() with the approriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:
- all the columns in the input text file have column name/headers
- you want the first column of the text file to be used as row names (hint: look up the input for the row.names = argument in read.table())
Display the contents of proj_summary in your console

7.2 Inspecting data structures

There are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Below is a non-exhaustive list of these functions:

The list has been divided into functions that work on all types of objects, some that work only on vectors/factors (1 dimensional objects), and others that work on data frames and matrices (2 dimensional objects).

All data structures - content display:

str(): compact display of data contents (similar to what you see in the Global environment)
class(): displays the data type for vectors (e.g. character, numeric, etc.) and data structure for dataframes, matrices
summary(): detailed display of the contents of a given object, including descriptive statistics, frequencies
head(): prints the first 6 entries (elements for 1-D objects, rows for 2-D objects)
tail(): prints the last 6 entries (elements for 1-D objects, rows for 2-D objects)

Vector and factor variables:

length(): returns the number of elements in a vector or factor

Dataframe and matrix variables:

dim(): returns dimensions of the dataset (number_of_rows, number_of_columns) [Note, row numbers will always be displayed before column numbers in R]
nrow(): returns the number of rows in the dataset
ncol(): returns the number of columns in the dataset
rownames(): returns the row names in the dataset
colnames(): returns the column names in the dataset

Let's use the metadata file that we created to test out data inspection functions.

head(metadata)
str(metadata)
dim(metadata)
nrow(metadata)
ncol(metadata)
class(metadata)
colnames(metadata)

Exercise 2

What is the class of each column in metadata (use one command)?
What is the median of the replicates in metadata (use one command)?

Exercise 3

Use the class() function on glengths and metadata, how does the output differ between the two?
Use the summary() function on the proj_summary dataframe, what is the median "rRNA_rate"?
How long is the samplegroup factor?
What are the dimensions of the proj_summary dataframe?
When you use the rownames() function on metadata, what is the data structure of the output?
[Optional] How many elements in (how long is) the output of colnames(proj_summary)? Don't count, but use another function to determine this.