Chapter 5 Exploratory Data Analysis, part 2
5.1 Lesson 1: Basic Statistics in R
This is a lesson that will recap the main ways to do descriptive statistics in R. This is not a lesson in how to do statistics, as this has already been covered in earlier modules. However, if you’d like a deep dive into statistics, I’d recommend: https://www.coursera.org/instructor/~20199472.
With a lot of things you can do in the R programming language, there are a number of ways to do the same thing. Hence, here, we will look at a number of different coding styles, too, drawing on the ‘tidy’ way, among base R across a number of packages.
For this, we will use the iris dataset. PLEASE BE CAREFUL OF WHICH PACKAGE YOU ARE CALLING DUE TO COMMANDS WITH THE SAME NAME. This is a common issue that will break your code. For example, if you want to use describe() – do you want the psych package or Hmisc? Ensure you know your packages and how they overlap in command names and also whether one package masks another (e.g., sometimes the order of package loading is important…)
5.1.1 Descriptive Statistics
One of the simplest ways to capture descriptives from a dataframe uses the summary() function in base R. This provides: mean, median, 25th and 75th quartiles, min, and max values:
#loading typical packages just in case you've not already for todays lesson
library('dplyr')
library('magrittr')
library('ggplot2')
library('tidyverse')
head(iris) #ensure you've loaded packages and iris is accessible
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
#lets get the means for variables in iris
# we will exclude missing values
::summary(iris) base
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
This of course provides a variety of statistics that may be of use. Similarly, the Hmisc package provides some useful and more detailed statistics that includes: n, number of missing values, number of unique values (disctinct), mean, 5, 10, 25, 50, 75, 90, 95th percentiles, 5 lowest and 5 highest scores:
library('Hmisc')
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
::describe(iris) #make sure this command is coming from the Hmisc package Hmisc
## iris
##
## 5 Variables 150 Observations
## ------------------------------------------------------------------------------------------------------------
## Sepal.Length
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
## 150 0 35 0.998 5.843 0.9462 4.600 4.800 5.100 5.800 6.400 6.900
## .95
## 7.255
##
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## ------------------------------------------------------------------------------------------------------------
## Sepal.Width
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
## 150 0 23 0.992 3.057 0.4872 2.345 2.500 2.800 3.000 3.300 3.610
## .95
## 3.800
##
## lowest : 2.0 2.2 2.3 2.4 2.5, highest: 3.9 4.0 4.1 4.2 4.4
## ------------------------------------------------------------------------------------------------------------
## Petal.Length
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
## 150 0 43 0.998 3.758 1.979 1.30 1.40 1.60 4.35 5.10 5.80
## .95
## 6.10
##
## lowest : 1.0 1.1 1.2 1.3 1.4, highest: 6.3 6.4 6.6 6.7 6.9
## ------------------------------------------------------------------------------------------------------------
## Petal.Width
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
## 150 0 22 0.99 1.199 0.8676 0.2 0.2 0.3 1.3 1.8 2.2
## .95
## 2.3
##
## lowest : 0.1 0.2 0.3 0.4 0.5, highest: 2.1 2.2 2.3 2.4 2.5
## ------------------------------------------------------------------------------------------------------------
## Species
## n missing distinct
## 150 0 3
##
## Value setosa versicolor virginica
## Frequency 50 50 50
## Proportion 0.333 0.333 0.333
## ------------------------------------------------------------------------------------------------------------
Similarly, we’ll look at the pastecs package, which gives you nbr.val, nbr.null, nbr.na, min max, range, sum, median, mean, SE.mean, CI.mean, var, std.dev, and coef.var as a series of more advanced statistics.
library('pastecs')
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:magrittr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following object is masked from 'package:tidyr':
##
## extract
stat.desc(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## nbr.val 150.00000000 150.00000000 150.0000000 150.00000000 NA
## nbr.null 0.00000000 0.00000000 0.0000000 0.00000000 NA
## nbr.na 0.00000000 0.00000000 0.0000000 0.00000000 NA
## min 4.30000000 2.00000000 1.0000000 0.10000000 NA
## max 7.90000000 4.40000000 6.9000000 2.50000000 NA
## range 3.60000000 2.40000000 5.9000000 2.40000000 NA
## sum 876.50000000 458.60000000 563.7000000 179.90000000 NA
## median 5.80000000 3.00000000 4.3500000 1.30000000 NA
## mean 5.84333333 3.05733333 3.7580000 1.19933333 NA
## SE.mean 0.06761132 0.03558833 0.1441360 0.06223645 NA
## CI.mean.0.95 0.13360085 0.07032302 0.2848146 0.12298004 NA
## var 0.68569351 0.18997942 3.1162779 0.58100626 NA
## std.dev 0.82806613 0.43586628 1.7652982 0.76223767 NA
## coef.var 0.14171126 0.14256420 0.4697441 0.63555114 NA
Finally, the psych package is very common for social science scholars, which provides another series of descriptive outputs that include: item name, item number, nvalid, mean, sd, median, mad, min, max, skew, kurtosis, and se:
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
::describe(iris) #calling psych:: to ensure we are using the correct package psych
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31 -0.61 0.07
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31 0.14 0.04
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27 -1.42 0.14
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10 -1.36 0.06
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00 -1.52 0.07
Another useful element of the psych package is that you can describe by group, whereby we can get the same descriptive statistics as above, but specifically for each species:
describeBy(iris, group = 'Species')
##
## Descriptive statistics by group
## Species: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11 -0.45 0.05
## Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.60 0.05
## Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10 0.65 0.02
## Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18 1.26 0.01
## Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN 0.00
## ---------------------------------------------------------------------------------
## Species: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Sepal.Length 1 50 5.94 0.52 5.90 5.94 0.52 4.9 7.0 2.1 0.10 -0.69 0.07
## Sepal.Width 2 50 2.77 0.31 2.80 2.78 0.30 2.0 3.4 1.4 -0.34 -0.55 0.04
## Petal.Length 3 50 4.26 0.47 4.35 4.29 0.52 3.0 5.1 2.1 -0.57 -0.19 0.07
## Petal.Width 4 50 1.33 0.20 1.30 1.32 0.22 1.0 1.8 0.8 -0.03 -0.59 0.03
## Species* 5 50 2.00 0.00 2.00 2.00 0.00 2.0 2.0 0.0 NaN NaN 0.00
## ---------------------------------------------------------------------------------
## Species: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Sepal.Length 1 50 6.59 0.64 6.50 6.57 0.59 4.9 7.9 3.0 0.11 -0.20 0.09
## Sepal.Width 2 50 2.97 0.32 3.00 2.96 0.30 2.2 3.8 1.6 0.34 0.38 0.05
## Petal.Length 3 50 5.55 0.55 5.55 5.51 0.67 4.5 6.9 2.4 0.52 -0.37 0.08
## Petal.Width 4 50 2.03 0.27 2.00 2.03 0.30 1.4 2.5 1.1 -0.12 -0.75 0.04
## Species* 5 50 3.00 0.00 3.00 3.00 0.00 3.0 3.0 0.0 NaN NaN 0.00
5.1.1.1 Activity
Part 1: Use all of the packages and commands above that we have discussed and use the mpg package. For the group descriptives, use any of the grouping variables such as class or manufacturer.
Part 2: From last week, add in some missing values (10-15%), and rerun all of your code from part 1 with na.rm = TRUE and again na.rm = FALSE and see the differences.
5.1.2 Correlations
Correlation or dependence is any statistical relationship, whether causal or not (this is important!), between two random variables or bivariate data. In the broadest sense, a correlation is any statistical association between two variables, though it commonly refers to the degree to which a pair of variables are linearly related.
Thre are a number of ways we can capture a correlation both numerically or visually. For instance, in base R we can use cor() to get the correlation. You can define the method: perason, spearman, or kendall. However, if you’d like to see the p-value, you’d need to use the cor.test() function to get the correlation coefficient. Similarly, using the Hmisc package, the rcorr() command produces correlations, covariances, and significance levels for pearson and spearman correlations. Lets have a look:
cor(mtcars, use = "everything", method ="pearson")
## mpg cyl disp hp drat wt qsec vs am
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594 0.41868403 0.6640389 0.59983243
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958 -0.59124207 -0.8108118 -0.52260705
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799 -0.43369788 -0.7104159 -0.59122704
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479 -0.70822339 -0.7230967 -0.24320426
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406 0.09120476 0.4402785 0.71271113
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000 -0.17471588 -0.5549157 -0.69249526
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159 1.00000000 0.7445354 -0.22986086
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157 0.74453544 1.0000000 0.16834512
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953 -0.22986086 0.1683451 1.00000000
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870 -0.21268223 0.2060233 0.79405876
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059 -0.65624923 -0.5696071 0.05753435
## gear carb
## mpg 0.4802848 -0.55092507
## cyl -0.4926866 0.52698829
## disp -0.5555692 0.39497686
## hp -0.1257043 0.74981247
## drat 0.6996101 -0.09078980
## wt -0.5832870 0.42760594
## qsec -0.2126822 -0.65624923
## vs 0.2060233 -0.56960714
## am 0.7940588 0.05753435
## gear 1.0000000 0.27407284
## carb 0.2740728 1.00000000
#please note: under use, this allows you to specify how to use missing data, where all.obs = assumes no missing data
#and if there is missing data this will error; complete.obs will perform listwise deletion, etc.
#As above, if we wanted to see the correlation coefficients, using cor.test() is one way to do this.
#this provides correlation statistics for a SINGLE COEFFICIENT (see below)
cor.test(mtcars$mpg, mtcars$cyl)
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$cyl
## t = -8.9197, df = 30, p-value = 6.113e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9257694 -0.7163171
## sample estimates:
## cor
## -0.852162
#using Hmisc package
#note: input MUST be a matrix and pairwise deletion must have been useed
::rcorr(as.matrix(mtcars)) Hmisc
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
##
## n= 32
##
##
## P
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 0.0000 0.0000 0.0000 0.0000 0.0000 0.0171 0.0000 0.0003 0.0054 0.0011
## cyl 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0000 0.0022 0.0042 0.0019
## disp 0.0000 0.0000 0.0000 0.0000 0.0000 0.0131 0.0000 0.0004 0.0010 0.0253
## hp 0.0000 0.0000 0.0000 0.0100 0.0000 0.0000 0.0000 0.1798 0.4930 0.0000
## drat 0.0000 0.0000 0.0000 0.0100 0.0000 0.6196 0.0117 0.0000 0.0000 0.6212
## wt 0.0000 0.0000 0.0000 0.0000 0.0000 0.3389 0.0010 0.0000 0.0005 0.0146
## qsec 0.0171 0.0004 0.0131 0.0000 0.6196 0.3389 0.0000 0.2057 0.2425 0.0000
## vs 0.0000 0.0000 0.0000 0.0000 0.0117 0.0010 0.0000 0.3570 0.2579 0.0007
## am 0.0003 0.0022 0.0004 0.1798 0.0000 0.0000 0.2057 0.3570 0.0000 0.7545
## gear 0.0054 0.0042 0.0010 0.4930 0.0000 0.0005 0.2425 0.2579 0.0000 0.1290
## carb 0.0011 0.0019 0.0253 0.0000 0.6212 0.0146 0.0000 0.0007 0.7545 0.1290
5.1.3 Visual Correlations using corrplot
<- cor(mtcars) #need to have the correlations to plot
mtcars_mat
::corrplot(mtcars_mat, method = 'ellipse', type = 'upper') corrplot
5.1.3.1 Activity
Read through this tutorial: (https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) and (https://cran.r-project.org/web/packages/corrplot/corrplot.pdf) and experiment with all of the various parameters (e.g., method = circle, square, ellispse, number, shade…) – use the iris, diamonds, and mpg datasets.
5.1.4 t-tests
t-tests compare two groups, for instance, control group and experimental groups. There are different types of t-tests, where the most common you’ll be looking at are independent t-tests or paired t-tests.
Briefly, an independent or unpaired t-test is when you’re comparing groups that are independent of one another, aka, each observation or participant was exposed to one of the groups only. For instance, in an exeriemtn you are only in the CONTROL or the EXPERIMENTAL condition.
In contrast, paired t-tests or repeated measures t-tests is where the participants are in BOTH groups, typically this is where participants are tested on one day and tested again at a later time. For example, you might be analyzing anxiety at the start of the week (Monday) versus the end of the week (Friday), so you ask participant to fill out the GAD7 on Monday and again on Friday and compare the difference.
5.1.4.1 Independent Samples t-test
Let’s create a dataset to use for an INDEPENDENT samples t-test:
# Data in two numeric vectors
<- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
women_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
men_weight
# Create a data frame
<- data.frame(
my_data group = rep(c("Woman", "Man"), each = 9),
weight = c(women_weight, men_weight)
)
#check
print(my_data)
## group weight
## 1 Woman 38.9
## 2 Woman 61.2
## 3 Woman 73.3
## 4 Woman 21.8
## 5 Woman 63.4
## 6 Woman 64.6
## 7 Woman 48.4
## 8 Woman 48.8
## 9 Woman 48.5
## 10 Man 67.8
## 11 Man 60.0
## 12 Man 63.4
## 13 Man 76.0
## 14 Man 89.4
## 15 Man 73.3
## 16 Man 67.3
## 17 Man 61.3
## 18 Man 62.4
It is always important to do some descriptive summary statistics across groups:
group_by(my_data, group) %>%
::summarise(
dplyrcount = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
## # A tibble: 2 × 4
## group count mean sd
## <chr> <int> <dbl> <dbl>
## 1 Man 9 69.0 9.38
## 2 Woman 9 52.1 15.6
#or we can of course use describeBy
describeBy(my_data, group = 'group')
##
## Descriptive statistics by group
## group: Man
## vars n mean sd median trimmed mad min max range skew kurtosis se
## group* 1 9 1.00 0.00 1.0 1.00 0.0 1 1.0 0.0 NaN NaN 0.00
## weight 2 9 68.99 9.38 67.3 68.99 8.9 60 89.4 29.4 0.98 -0.28 3.13
## ---------------------------------------------------------------------------------
## group: Woman
## vars n mean sd median trimmed mad min max range skew kurtosis se
## group* 1 9 1.0 0.0 1.0 1.0 0.00 1.0 1.0 0.0 NaN NaN 0.0
## weight 2 9 52.1 15.6 48.8 52.1 18.38 21.8 73.3 51.5 -0.49 -0.89 5.2
As you will know, there are a number of assumptions that need to be checked before doing t-tests (or any linear tests, e.g., regression). For instance:
are your data normally distributed?
do the two sampls have the same variances? (e.g., looking for homogenetity, where we do not want significant differences between variances)
Let’s check those assumptions. We will use the Shapiro-Wilk normality test to test for normality in each group, where the null hypothesis is that the data are normally distributed and the alternative hypothesis is that the data are not normally distributed.
We’ll use the functions with() and shapiro.test() to compute Shapiro-Wilk test for each group of samples (without splitting data into two using subset() and then using the shapiro.test() function).
# Shapiro-Wilk normality test for Men's weights
with(my_data, shapiro.test(weight[group == "Man"]))
##
## Shapiro-Wilk normality test
##
## data: weight[group == "Man"]
## W = 0.86425, p-value = 0.1066
# Shapiro-Wilk normality test for Women's weights
with(my_data, shapiro.test(weight[group == "Woman"]))
##
## Shapiro-Wilk normality test
##
## data: weight[group == "Woman"]
## W = 0.94266, p-value = 0.6101
Here we can see that both groups have a p-value that are greater than 0.05 whereby we interpret that the distributions are nor significantly different from a normal distribution. We can assume normality.
TO check for homogeneity in variances, we’ll use var.test(), which is an F-test:
<- var.test(weight ~ group, data = my_data)
res.ftest res.ftest
##
## F test to compare two variances
##
## data: weight by group
## F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.08150656 1.60191315
## sample estimates:
## ratio of variances
## 0.3613398
Here, the p-value of F-test is p = 0.1713596. It is greater than the significance level alpha = 0.05, whereby here is no significant difference between the variances of the two sets of data. Therefore, we can use the classic t-test witch assume equality of the two variances.
Now, we can go ahead and do a t-test:
#there are multiple ways you can write this code
# Compute t-test method one:
<- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res res
##
## Two Sample t-test
##
## data: weight by group
## t = 2.7842, df = 16, p-value = 0.01327
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.029759 29.748019
## sample estimates:
## mean in group Man mean in group Woman
## 68.98889 52.10000
# Compute t-test method two
<- t.test(women_weight, men_weight, var.equal = TRUE)
res res
##
## Two Sample t-test
##
## data: women_weight and men_weight
## t = -2.7842, df = 16, p-value = 0.01327
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -29.748019 -4.029759
## sample estimates:
## mean of x mean of y
## 52.10000 68.98889
Briefly, the p-value of the test is 0.01327, which is less than the significance level alpha = 0.05. Hence, we can conclude that men’s average weight is significantly different from women’s average weight.
5.1.4.2 Paired Sample t-test
Here, we will look at the paired sample t-test. Lets create a dataset:
# Weight of the mice before treatment
<-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
before # Weight of the mice after treatment
<-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
after # Create a data frame
<- data.frame(
my_data_paired group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)
print(my_data_paired)
## group weight
## 1 before 200.1
## 2 before 190.9
## 3 before 192.7
## 4 before 213.0
## 5 before 241.4
## 6 before 196.9
## 7 before 172.2
## 8 before 185.5
## 9 before 205.2
## 10 before 193.7
## 11 after 392.9
## 12 after 393.2
## 13 after 345.1
## 14 after 393.0
## 15 after 434.0
## 16 after 427.9
## 17 after 422.0
## 18 after 383.9
## 19 after 392.3
## 20 after 352.2
group_by(my_data_paired, group) %>%
::summarise(
dplyrcount = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
## # A tibble: 2 × 4
## group count mean sd
## <chr> <int> <dbl> <dbl>
## 1 after 10 394. 29.4
## 2 before 10 199. 18.5
#or we can of course use describeBy
describeBy(my_data_paired, group = 'group')
##
## Descriptive statistics by group
## group: after
## vars n mean sd median trimmed mad min max range skew kurtosis se
## group* 1 10 1.00 0.0 1.00 1.00 0.00 1.0 1 0.0 NaN NaN 0.0
## weight 2 10 393.65 29.4 392.95 394.68 28.24 345.1 434 88.9 -0.23 -1.23 9.3
## ---------------------------------------------------------------------------------
## group: before
## vars n mean sd median trimmed mad min max range skew kurtosis se
## group* 1 10 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN 0.00
## weight 2 10 199.16 18.47 195.3 197.25 10.82 172.2 241.4 69.2 0.87 0.26 5.84
You can also visualize this, where there are specific packages for paired data (boxplots are another useful method):
library('PairedData')
# Subset weight data before treatment
<- subset(my_data_paired, group == "before", weight,
before drop = TRUE)
# subset weight data after treatment
<- subset(my_data_paired, group == "after", weight,
after drop = TRUE)
# Plot paired data
<- paired(before, after)
pd plot(pd, type = "profile") + theme_bw()
Due to the small sample size we will want to check again for normality:
# compute the difference between the before and after
<- with(my_data_paired,
d == "before"] - weight[group == "after"])
weight[group
# Shapiro-Wilk normality test for the differences
shapiro.test(d)
##
## Shapiro-Wilk normality test
##
## data: d
## W = 0.94536, p-value = 0.6141
Here, as the Shapiro-Wilk normality test p-value is greater than 0.05, we can assume normality.
Let’s so the paired sample t-test:
# Compute t-test
<- t.test(weight ~ group, data = my_data_paired, paired = TRUE)
res res
##
## Paired t-test
##
## data: weight by group
## t = 20.883, df = 9, p-value = 6.2e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 173.4219 215.5581
## sample estimates:
## mean of the differences
## 194.49
So the conclusion here is that the p-value of the test is less than the significance level alpha = 0.05, so we can then reject null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment.
Please also note: if your data are not normally distributed, there are non-parametric tests that you can use instead. For independent t-tests we can use the Mann-Whitney test, which is the wilcox.test() function and for paired t-tests we can use Wilcoxon Rank tests, by using the wilcox.test() function with the paired = TRUE command.
5.2 Lesson 2: Data Visualization
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs and maps, data visualization tools provide an accessible way to see and understand trends, outliers, or patterns in data. Data visualization is an incredibly powerful tool and one that it worth learning about. Sometimes this can be trail and error to see what type of visual works best with data, however, there are some excellent ‘rules of thumb’, that we will look into. In the world of big data, data visualization tools and technologies are even more essential for analyzing various types of data, information, and formulating insights from them.
The history of data visualization can be argued to be the history of many millennia, this often consisted of start constellations, maps for navigation—where the Egyptians were using early coordinates to pay out towns, earthly and heavenly positions as early as 200 B.C. (Friendly, 2008). In other kinds of visualization, some of the earliest artwork of hands drawn by humans even as early as 40,000 years ago. It was not until the 14th Century that the concept of ‘plotting’ came into fruition, where the link between tabulate data and the ability to visualize this as a graphical plot appeared by Nicole Oresme (1323-1382). Please read more about this [here] (https://link.springer.com/chapter/10.1007/978-3-540-33037-0_2).
Brefly, it is important to remember that data visualization is far more than simply visualizing numbers. For instance, we can visualize ‘artistic data’, such as da Vinci’s famous work spanning both art and science is arguably another highly influential example of data visualization. For instance, da Vinci’s masterfully depicted human and animal bodies, demonstrating great detail and annotation to visualize biological structures from skeletal to muscular, which are incredibly beautiful piece of art, but also science. There is a great creative side to data visualization, so when going through this lesson – experiment!
5.2.1 The ggplot way
In earlier modules, you’ve had some interactions with data visualization, so this will be a recap for the most part in R using ggplot2.
Let’s focus on the mpg dataset, and we will just start with a simple plot:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Here, as with all ggplots – we follow the usual template, of calling data, and then stating which type of plot (e.g., point, bar, smooth), and defining the aesthetics (x, y, and grouping variables, potentially color and shape types).
The same plot above can be coded differently, too. This is a short hand version, but also we are defining the aesthetics in the overall plot call rather than specifically in the geom_point. There are reasons why you might do one over the other, for instance, if you are doing highly complex plots with multiple layers, the first method of defining aesthetics in each plot layer (e.g., geom_point) is more sensible.
ggplot(mpg, aes(displ, hwy)) +
geom_point()
We can add more information into this plot by coloring the points by a categorical variable, like class.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
There is a lot we can continue to edit, where we might want the size of the circles to change according to number of cyliners…
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cyl))
You can force size to use unordered variables like class, but this is not a great idea as you should want the size of bubbles to be meaningful and to not cause confusion.
We could also use the alpha aesthetic, which increases transparency, which is especially useful when we have a lot of points and it is difficult to distinguish and see density. Simialrly, we can use different shapes, too. However, please be careful when using shapes, as sometimes ggplot has limits of 6, where the additional groups are unplotted (here thats suv).
#looking at alpha and transparency
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
#looking at different shapes
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes
## difficult to discriminate; you have 7. Consider specifying shapes manually if you must have
## them.
## Warning: Removed 62 rows containing missing values (geom_point).
You can easily edit and adapt your plots in terms of color, too, which is very simple, but remember brackets are extremely important here. When you are defining the aesthetic arguments, aka defining which variables are used where and how remain in bracket. Anything you manually change typically goes outside of the aes() bracket… this includes color (string), size of points (in mm), and shapes. See here for colors, linetypes, and shape choices in [ggplot2] (http://sape.inf.usi.ch/quick-reference/ggplot2/colour)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "darkorchid3")
Note: when using color, note when you need to use color = ‘red’ and fill = ‘red’. Certain shapes that are ‘hollow’ need to be filled and if you use color it will just change the color of the outline. For example, R has 25 built in shapes that are identified by numbers (see http://sape.inf.usi.ch/quick-reference/ggplot2/shape). Some may look like duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by colour; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill.
5.2.1.1 Activity
Using the mpg dataset and plot various variables and change the color, shape, and size of the points.
Specifically then also:
Use shape 23 and ensure it is completely blue.
Use shape 14 and color it purple.
Plot displ and hwy and make the points bigger than the usual ggplot size.
Fix this code to ensure the points are green:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "green"))
Map a continuous variable to color, size, and shape. What happens? How do these aesthetics behave differently for categorical vs. continuous variables?
What happens if you map the same variable to multiple aesthetics?
What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y.
5.2.2 Facets
Another way we can add in additional variables rather than just using aesthetics is using facets. These are sub-plots that display different parts of the data, therefore each sub-plot plots a subset of the dataset using the facet_wrap() command:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
You can also facet plots on the combination of two variables, where you use facet_grid() instead.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
5.2.2.1 Activity
Using the iris dataset, create a facet plot using “Species” to facet the plots (which other variables you use does not matter).
What happens if you facet on a continuous variable?
What does nrow do? Try changing it using either dataset.
Now try ncol, what does that do?
5.3 Lesson 3: More Advanced Data Visualization
There are a number of ways we can visualize the same variables. For instance, the two plots below, both contain the same x variable, the same y variable, and both describe the same data. However, each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms, specifically plot A uses geom_point() as we used above, and plot B uses geom_smooth():
library('cowplot')
<- ggplot(data = mpg) +
a geom_point(mapping = aes(x = displ, y = hwy))
<- ggplot(data = mpg) +
b geom_smooth(mapping = aes(x = displ, y = hwy))
plot_grid(a, b, labels = c('A', 'B'))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms: geom_bar(), line charts use line geoms: geom_line(), boxplots use boxplot geoms: geom_boxplot(), and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.
As we looked at briefly above, we can change the aesthetics… Here geom_smooth() separates the cars into three lines based on their drv value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive. Hence, there are 3 separate lines describing each drive type.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can add color and actually plot the points, too, as this might make the relationships clearer:
ggplot(data = mpg) +
geom_point(aes(displ, hwy, color=drv)) +
geom_smooth(aes(displ, hwy, linetype = drv, color = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can see in the previous code that you can build up a number of layers and mappings using color, linetypes, shapes, sizes… You can control whether you see the legend too, which is another line of code to remove this. However, think wisely as to whether this is a good idea or not.
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Do not forget you can label your axes better and also add titles…
ggplot(data = mpg) +
geom_point(aes(displ, hwy, color=drv)) +
geom_smooth(aes(displ, hwy, linetype = drv, color = drv)) +
xlab("A nice x axes label") +
ylab("Another nice label") +
ggtitle("A very sensible title")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are a lot of additional ways you can edit your plots, for example, using the ggthemes package, you have access to a number of additional overall aesthetics on your plots… for example:
library('ggthemes')
##
## Attaching package: 'ggthemes'
## The following object is masked from 'package:cowplot':
##
## theme_map
#here we are changing the last line of code to try
#different themes that change the overall look (look up the package to see more)
<- ggplot(data = mpg) +
a geom_point(mapping = aes(x = displ, y = hwy)) +
theme_bw()
<- ggplot(data = mpg) +
b geom_point(mapping = aes(x = displ, y = hwy)) +
theme_fivethirtyeight()
<- ggplot(data = mpg) +
c geom_point(mapping = aes(x = displ, y = hwy)) +
theme_gray()
<- ggplot(data = mpg) +
d geom_point(mapping = aes(x = displ, y = hwy)) +
theme_economist()
<- ggplot(data = mpg) +
e geom_point(mapping = aes(x = displ, y = hwy)) +
theme_solarized_2()
<- ggplot(data = mpg) +
f geom_point(mapping = aes(x = displ, y = hwy)) +
theme_foundation()
plot_grid(a,b,c,d,e,f)
5.3.0.1 Activity
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart? Try each of these using the iris or mpg dataset.
What does the se argument to geom_smooth() do? Try changing it and see what happens.
Recreate the following plots:
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Using size for a discrete variable is not advised.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
5.3.1 Bar Plots
While bar plots in theory appear to be a simple kind of plot, they can actually become quite complex to plot how you want them. Lets look at the diamonds dataset and look at cut. gpglot is clever in that it will use count here, despite it is not in the dataset itself.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
This is not the only type of plot that will create it’s own variables in orer to plot…While many graphs, like scatterplots, plot the raw values of your dataset… other graphs, like bar charts, calculate new values to plot:
bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box.
For geom_bar(), the default behavior is to count the rows for each x value. It doesn’t expect a y-value, since it’s going to count itself – in fact, it will flag a warning if you give it one, since it thinks you’re confused. How aggregation is to be performed is specified as an argument to geom_bar(), which is stat = “count” for the default value. Hence, if you explicitly say stat = “identity” in geom_bar(), you’re telling ggplot2 to skip the aggregation and that you’ll provide the y values. This mirrors the natural behavior of geom_col(), which is a very similar geom_object, see below for the example:
#notice the additional line to rotate axis labels as they overlapped.
<- ggplot(mpg) +
a geom_bar(aes(x = class, y = hwy), stat = "identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
<- ggplot(mpg) +
bgeom_col(aes(x = class, y = hwy)) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
plot_grid(a,b, labels = c("geom_bar", "geom_col"))
Let’s look some more at how we can display more information. Remember earlier we looked at the difference between fill and color? Let’s look again here to show how these color arguments effect var plots:
<- ggplot(data = diamonds) +
ageom_bar(mapping = aes(x = cut, colour = cut))+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
<- ggplot(data = diamonds) +
bgeom_bar(mapping = aes(x = cut, fill = cut))+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
plot_grid(a,b, labels = c("color", "fill"))
We can also add grouping variables, too… and these can be displayed differently (e.g., identity, dodge, or fill). Essentially, dodge is a grouping effect around a specific variable, and here it is cut; essentially it places overlapping objects directly beside one another. This makes it easier to compare individual values. Whereas, fill will give you a % barplot; and works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
#lets look at the different positions...
<-ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
ageom_bar(position = "dodge")+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
<- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
bgeom_bar(position = "fill")+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
<- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
cgeom_bar(position = "identity")+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
plot_grid(a,b,c, labels = c("dogde", "fill", 'identity'))
5.3.1.1 Activity
Go to this [website] (https://datavizcatalogue.com/search.html) and read up about different types of data visualization.
Spend at least 60 minutes experimenting with various types of data visualizations in R and use any of the datasets, iris, mpg, diamonds, or nycflights.