Chapter 20 FAQs
20.1 How do I convert my survey scores into percentages?
In the example below we have to arrays of data. One provides the survey score for
each of 12 Haggises for burrowing activity, and the other is the score for the use of
taxis by these individuals.
Since the scores have different denominators, it is advisable the the scores be converted into percentages before comparing the two variables.
<- c(2,3,4,1,3,4,3,5,3,2,3,4)
burrowing <- c(9,9,8,6,9,4,3,2,8,9,8,7)
taxi_use <- 100*(burrowing/5)
burrow_perc <-100*(taxi_use/10) taxi_use_perc
20.2 How do I test my data for Normality?
Testing for Normality of a variable can be described as testing to see if the data follows the Normal distribution. The most commonly used way to assess this is by visual inspection, either by plotting a Histogram (from the data) or by plotting a Q-Q plot (or Quantile-Quantile plot). The formal methods commonly used are the Shapiro-Wilk test and the Kolmogorov-Smirnov test.
20.2.1 Visual methods
When the data is plotted as a histogram, Normality is assessed by inspecting the shape of the histogram. Typically, if the histogram exhibits a bell-shape, the data is assumed to be Normal. The shape of the histogram can also indicate whether the distribution is skewed. For the Q-Q plot, Normality is assumed if the points on the plot follow (approximately) the diagonal line. In this case, the Q-Q plot is a scatterplot of the quantiles from the data plotted against theoretical, Normally distributed quantiles.
Try the following code to test the Haggis data (from the section on T-tests) for Normality.
library(ggplot2)
<- data.frame(c(before,after))
ba $timeframe <- c(rep("before",length(before)),rep("after",length(after)))
bacolnames(ba)<-c("number","timeframe")
# Histogram
ggplot(ba, aes(x = number)) + facet_wrap(. ~timeframe,ncol=1)+
geom_histogram(binwidth=1,aes(fill = ..count..)) +
scale_x_continuous(name = "Percentage of limpers per household") +
scale_y_continuous(name = "Count") +
ggtitle("Percentage of limpers per Haggis household",
subtitle="before and after physiotherapy")+
theme(text = element_text(size=40))
# Q-Q plot
qqnorm(before,cex=4,pch=18,cex.main=3,cex.axis=3,cex.lab=2)
qqnorm(after,cex=4,pch=18,cex.main=3,cex.axis=3,cex.lab=2)
20.2.2 Formal methods
For both of the abovementioned formal methods, a p-value greater than 0.05 indicates that the data is Normally distributed.
Try the following code to test the Haggis data for Normality.
# Shapiro-Wilk test
shapiro.test(before)
##
## Shapiro-Wilk normality test
##
## data: before
## W = 0.96498, p-value = 0.009255
# Kolgomorov test
ks.test(after, 'pnorm')
##
## One-sample Kolmogorov-Smirnov test
##
## data: after
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
20.3 How do I conduct multiple correlation tests?
For this topic, we’ve created some data to illustrate the method. The data will represent three variables obtained from a survey of 1000 individuals. Lets imagine that the researcher at hand is interested in assessing the correlation between the variables as an exploratory exercise. Lets give the variables the names: “limping”, “burrowing”,“symbiosis”. I guess you’ve figured out that I like the Haggis data example - the variables above refer to limping score, burrowing activity, and symbiotic activity (engaging for mutual benefit with other species/genera) for each of 1000 Haggises.
First we’ll inspect the names of the variables in the dataset We’ll then visualize the correlations, then print out the matrix of correlations (correlation matrix).
names(dataf)
## [1] "limping" "burrowing" "symbiosis"
ggpairs(
dataf, upper = list(
continuous = wrap("cor", size = 10, alignPercent = 1)
) )
<-cor(dataf)
Mat corrplot(Mat, method="circle")
Mat
## limping burrowing symbiosis
## limping 1.00000000 -0.6039400 -0.09197149
## burrowing -0.60393996 1.0000000 0.23888882
## symbiosis -0.09197149 0.2388888 1.00000000
20.4 How do I deal with missing data?
One common approach to this is data imputation. There are various methods available including replacing an NA in a given variable with the mean value (or median,or mode) observed for that variable. There are formal approaches to characterize the missingness and it is worth examining the data to see if there are any patterns in missingness. More information will be provided here using the chocolate biscuit case study.