3.1 A quick intro to data visualization.

FIGURE 3.1: Watch the first minute of this video about getting started with ggplot2 (7 min and 17 sec), from STAT 545

Recall that as bio-statisticians, we bring data to bear on critical biological questions, and communicate these results to interested folks. A key component of this process is visualizing our data.

They say “a picture is worth a thousand words,” similarly a clear graph can communicate complex patterns in our data.

3.1.1 Exploratory and explanatory visualizations

We generally think of two extremes of the goals of data visualization

  • In exploratory visualizations we aim to identify any interesting patterns in the data, we also conduct quality control to see if there are patterns indicating mistakes or biases in our data, and to think about appropriate transformations of data. On the whole, our goal in exploratory data analysis is to understand the stories in the data.

  • In explanatory visualizations we aim to communicate our results to a broader audience. Here our goals are communication and persuasion. When developing explanatory plots we consider our audience (scientists? consumers? experts?) and how we are communicating (talk? website? paper?).

The ggplot2 package in R is well suited for both purposes. Today we focus on exploratory visualization in ggplot2 because

  1. They are the starting point of all statistical analyses.
  2. You can do them with less ggplot2 knowledge.
  3. They take less time to make than explanatory plots.

Later in the term we will show how we can use ggplot2 to make high quality explanatory plots.

3.1.2 Centering plots on biology

Whether developing an explanatory or exploratory plot, you should think hard about the biology you hope to convey before jumping into a plot. Ask yourself

  • What do you hope to learn from this plot?
  • Which is the response variable (we usually place that on the y-axis)?
  • Are data numeric or categorical?
    • If they are categorical are they ordinal, and if so what order should they be in?

The answers to these questions should guide our data visualization strategy, as this is a key step in our statistical analysis of a dataset. The best plots should evoke an immediate understanding of the (potentially complex) data. Put another way, a plot should highlight both the biological question and its answer.

Before jumping into making a plot in R, it is often useful to take this step back, think about your main biological question, and take a pencil and paper to sketch some ideas and potential outcomes. I do this to prepare my mind to interpret different results, and to ensure that I’m using R to answer my questions, rather than getting sucked in to so much Ring that I forget why I even started. With this in mind, we’re ready to get introduced to ggploting!