Before the start, make sure that you have tidyverse
loaded.
#Load tidyverse
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
For your first practice we want you to build a plot based on the example you have seen in the tutorial. We will work with diamonds again.
#Get diamonds dataset
data<-diamonds
Can you check which variables we have?
#Glimpse at your data
glimpse(data)
## Observations: 53,940
## Variables: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Goo…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1,…
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, …
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 3…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.…
Select variable carat .
#Check carat
select(data, carat)
## # A tibble: 53,940 x 1
## carat
## <dbl>
## 1 0.23
## 2 0.21
## 3 0.23
## 4 0.290
## 5 0.31
## 6 0.24
## 7 0.24
## 8 0.26
## 9 0.22
## 10 0.23
## # … with 53,930 more rows
Produce a simple histogram (finish the expression below)
#ggplot of carat
ggplot(data = data,aes(x=carat))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Once you are happy with what you see above. Try to add labels and change colours.
#Complete ggplot for carat
ggplot(data = data,aes(x=carat))+ geom_histogram(aes(y=..density..), color="cadetblue", fill="bisque") + geom_density() +
labs(x = "Weight of the diamond (carat)",
title = 'Histogram of diamond weight (carat)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you wanted to put both plots side by side:
#Assign your plots to specific objects
#plot1
plot1<-ggplot(data = data,aes(x=carat))+ geom_histogram()
#plot2
plot2<-ggplot(data = data,aes(x=carat))+ geom_histogram(aes(y=..density..), color="cadetblue", fill="bisque") + geom_density() +
labs(x = "Weight of the diamond (carat)",
title = 'Histogram of diamond weight (carat)')
You can then install.package(cowplot)
and use the following code:
#Load package
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
#Set theme
theme_set(theme_grey())
#Put plots side by side
plot_grid(plot1,plot2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now I am keen again to look at the categorical variable cut and ideally, want to plot the differences in weight of the diamond by cut.
Lets first again produce a plot for cut, you need to use geom_bar
here:
#ggplot for cut
ggplot(data = data, aes(x=cut))+ geom_bar()
Try to make it a little nicer.
#ggplot for cut (with labels, title, colours)
# Simple bar plot for variable 'cut'
ggplot(data = data, aes(x=cut))+ geom_bar( color="grey", fill="blue") +
labs(x = "Quality of the diamond cut",
title = 'Bar plot for the quality of diamonds cut' ) + theme_minimal()
Finally, get the visualisation of carat by cut:
#ggplot for carat by cut (add labels, title, colours) - I started this one for you
ggplot(data = data, aes(x=cut, y=carat, fill=cut))+ geom_boxplot() + theme_minimal()+ scale_fill_brewer(palette="Pastel2")+
labs(x = "Quality of the diamond cut", y= "Weight of the diamond (carat)",
title = 'Box plot of diamonds weight (carat) by the cut' )
Here is the plot which is based on slighly different data. For this task you will need to get the data and explore it yourself. You will then want to work with key variables that are visible on the plot below. Your task if to recreate the plots.
#Load data
data2<-iris
#Glimpse
glimpse(iris)
## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…
Check what each variable mean:
?iris
#Simple plot of sepal length
ggplot(data=data2, aes( y=Sepal.Length))+ geom_boxplot()
#Sepal lenght by species
ggplot(data=data2, aes(x=Species, y=Sepal.Length)) + geom_boxplot()
Note that here we also adjusted the position of the legend and the theme.
#Advanced plot
ggplot(data=data2, aes(x=Species, y=Sepal.Length)) + geom_boxplot(aes(fill=Species)) +
ylab("Sepal Length") + ggtitle("Iris Data Boxplot: Sepal Lenght by Species") + scale_fill_brewer(palette="YlGn") + theme( legend.direction = "horizontal") + theme_dark()