Chapter 5 Data Visualization

Visualizing Information can give us a very quick solution to problems. We can get clarity or the answer to a simple problem very quickly - David Mc Candless

The data used is the iris dataset. Here is a summary of the data used:

library(tidyverse)
data("iris")
iris %>% summary()

The iris dataset consists of 5 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first four variables are numeric variables, while the Species variable is a categorical variable.

The data can be represented in both Wide Table and Long Table formats. Wide Table Format shown below :

iris %>% head(5)

Transformation to reform Wie to Long Table format :

iris_long <- iris %>% gather(fitur, value, Sepal.Length:Petal.Width)
iris_long %>% head(5)

5.1 Base R

5.1.1 plot()

The plot() function is the simplest function to create plots. Commonly used parameters:

  • type: type of plot.
  • xlab: label for the X-axis.
  • ylab: label for the Y-axis.
  • main: title of the plot.
  • col: color used in the plot, can be a color or variable name.

5.1.1.1 plot(x)

The plot() function can be used with a single parameter, where x is a numeric variable. The resulting visualization is a scatter plot where the x-axis represents the index of the data and y is the value of the variable. The parameter type can be adjusted to specify the plot style according to the instructions provided in the plot type documentation.

plot(iris$Sepal.Length, type = 'b')

5.1.1.2 plot(x,y)

The plot() function can be used with two parameters, where both x and y are numeric variables. The resulting visualization is a scatter plot where the x-axis represents the variable in the first parameter and the y-axis represents the variable in the second parameter.

plot(iris$Sepal.Length, iris$Petal.Length, main='Correlation between Petal Lenght dan Sepal Lenght', xlab='Sepal Length', ylab='Petal Length')

plot(iris$Sepal.Length, iris$Petal.Length, main='Correlation between Petal Lenght dan Sepal Lenght', xlab='Sepal Length', ylab='Petal Length', col=iris$Species)

5.1.1.3 plot(data)

The plot() function can be used with a single parameter, which is the data. The resulting visualization is a scatter plot combining all variables in the dataset.

plot(iris)
plot(iris[,1:4], col=iris$Species)

5.1.2 barplot()

Commonly used parameters:

  • xlab: label for the X-axis
  • ylab: label for the Y-axis
  • main: plot title
  • col: color used in the plot, which can be a color name or variable
  • horiz: a boolean parameter indicating whether a barplot is horizontal or vertical
t <- table(iris$Species)
barplot(t)

5.1.3 hist()

The visualization is in the form of a bar chart that shows the distribution of continuous data. Commonly used parameters:

  • xlab: label for the X-axis
  • ylab: label for the Y-axis
  • main: plot title
  • col: color used in the plot, which can be a color name or variable
  • breaks: the number of bins used for binning the data
hist(iris$Sepal.Length, breaks = 20)

5.1.4 boxplot()

The visualization that can be used to observe the distribution of numerical data based on its median and quartiles, and can also be used to identify outlier values, is called a box plot or box-and-whisker plot. Box plots can be used for one or more continuous variables.

boxplot(iris$Sepal.Width)
boxplot(iris[,1:4])

5.1.5 heatmap()

It can be used to visualize the relationship between multiple variables. This visualization also includes a tree diagram that shows clusters among observations, where nearby observations are in the same cluster (having similarities).

heatmap(as.matrix(iris[1:30,1:3]))

5.1.6 par()

To facilitate R users in visualizing multiple plots together, the function par() is used. Commonly used parameter:

  • mfrow: the number of plots in rows and columns. mfrow=c(2,3) indicates visualizing 6 plots in 2 rows and 3 columns.
par(mfrow=c(2,3))

plot(iris$Sepal.Length)
plot(iris$Petal.Width, col=iris$Species)
hist(iris$Sepal.Length, breaks = 50, main='')
boxplot(iris$Sepal.Width)
barplot(t)
boxplot(iris$Petal.Length)

5.2 Correlation Plot

It is used to visualize the correlation between numerical variables. This function can be accessed through the corrplot package. By using the method parameter, we can specify the type of visualization to be used, which can be “circle” (default), “square”, “ellipse”, “number”, “pie”, “shade”, and “color”.

library(corrplot)
corrplot(cor(iris[,1:4]))
corrplot(cor(iris[,1:4]), method = "pie")
corrplot.mixed(cor(iris[,1:4]), lower = "number", upper = "shade")

5.3 ggplot2

on progress~

5.3.1 Introduction to ggplot2

5.3.2 Basic Plot using ggplot2

5.3.3 Customizing Aestetics

5.3.4 Adding Layers and Geometries

5.3.5 Faceting