3 Plots in 2.2 with base R

This section will show you (one way of) how to generate those plots in Slide 2.2 with base R.

3.1 Set up

We need the Hollywood movies dataset from Lock5withR package and the dotPlot() function from the mosaic package to plot the dotplot.

library(Lock5withR)
library(mosaic)

3.2 World Gross of all Hollywood Movies in 2011

The HollywoodMovies20113 dataset contains data on Hollywood movies released in 2011, with 136 cases and 14 variables. In this section, the quantitative variable of interest is WorldGross4 - the gross income for all viewers (in millions of dollars). Note: this is a population because it contains all Hollywood movies released in 2011.

3.2.1 Dotplot

movieDot <- mosaic::dotPlot(Lock5withR::HollywoodMovies2011$WorldGross,
                    # we use 25 as bin width here.
                    breaks=seq(0, 1400, by=25),right=FALSE,
                    main='World gross for all 2011 Hollywood movies',
                    xlab = "World Gross(in millions of dollars)", col='thistle4')
movieDot$x.scales$tick.number= 7
movieDot$y.scales$tick.number= 5
movieDot$x.limits=c(-100,1500)
movieDot$y.limits=c(-2,32)

movieDot

In this dotplot, each dot corresponds to one case (movie) in the dataset. If there are multiple cases with the same (or similar) values, we stack the dots vertically5. This allows us to see the shape of the data, the shape of the data’s distribution. In this example, we see that world gross ranges from 0 to around 1.3 billion dollars. Most of the cases are piled up on the left (more relatively low world gross values - this is called skewed to the right) and there are three cases with unusually large values (values greater than 1 billion dollars – these are called outliers). We can also see that a large proportion of Hollywoodmovies make less than 300 million dollars (gross). You can obtain a lot of useful information by simply plotting the data. Always plot your data!

3.2.2 with median and mean

movieMean=mean(Lock5withR::HollywoodMovies2011$WorldGross, na.rm=TRUE)
movieMedian=median(Lock5withR::HollywoodMovies2011$WorldGross, na.rm=TRUE)
movieDot$panel = function(x,y,...){
  panel.abline(v = c(movieMean,movieMedian), col=c('red', 'blue'))
  panel.dotPlot(Lock5withR::HollywoodMovies2011$WorldGross,
                breaks=seq(0, 1400, by=25),col='thistle4')
  panel.text(280,25,
             labels=paste('Mean', round(movieMean,2),sep='='), 
             col='red', cex= 0.8)
  panel.text(-20,30,
             labels=paste('Median', round(movieMedian,2),sep='=\n'),
             col='blue', cex= 0.8)}
movieDot

A few very large world gross values have pulled up the population mean (pulled in the direction of skewness). These large values do not affect the median because the median splits the data in half (67 values above the median and 67 values below). In this example, there is a substantial difference between these measures of centre and almost 2/3 of the cases are below the mean.

3.2.3 Histogram

To produce a histogram, we must choose a bin width or specify the total number of bins. In this example, we set binwidth to 25, the same as the dotplot above, but other choices are possible.

movieHist <- hist(Lock5withR::HollywoodMovies2011$WorldGross,breaks=seq(0, 1400, by=25),
                  main='World gross for all 2011 Hollywood movies',right=FALSE,
                  xlab = "World Gross(in millions of dollars)",
                  col='thistle4',ylim=c(0,30))
abline(h=seq(0,30,5), v=seq(0,1400,200), col="gray", lty=3)

3.3 Shapes

3.3.1 Bell shaped

First, let’s take a look at a Bell-shaped distribution, this is the most important symmetric distribution in science and we’ll see a lot of bell-shaped distributions in this course!

set.seed(1)
meanC = 0
sdC = 0.25
groupCenter<-rnorm(1000,mean=meanC,sd=sdC)
hist(groupCenter, breaks = 20, main='mean=0, std=0.25, n=1000', 
     xlim = c(-1,1),   ylim = c(0,200), right=FALSE,
     xlab = '', ylab = 'Frequency')
abline(h=seq(0,200,50), v=seq(-1,1,0.5), col="gray", lty=3)

If we have a bigger standard deviation(3 times in this case), we will have a flatter shape/distribution.

set.seed(1)

par(mar = c(2, 2, 1, 0), mfrow=c(2,1))

histTall <- hist(groupCenter, breaks=20,
     xlim=c(-2.5,2.5),ylim=c(0,200),
     main='mean=0, std=0.25, n=1000', xlab='',ylab = '')

multiplier <- histTall$counts / histTall$density
mydensity <- density(groupCenter)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col='green', lwd=3)

abline(h=seq(0,200,50), v=seq(-3,3,0.5), col="gray", lty=3)

# std * 3
groupCenterFlat <- rnorm(1000,mean=meanC,sd=sdC*3)
histFlat <- hist(groupCenterFlat,breaks=20,
     xlim=c(-2.5, 2.5),ylim=c(0,200),
     main='mean=0, std=0.25*3, n=1000', xlab='',ylab = '')
multiplier <- histFlat$counts / histFlat$density
mydensity <- density(groupCenterFlat)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col='green', lwd=3)
abline(h=seq(0,200,50), v=seq(-2.5,2.5,0.5), col="gray", lty=3)

3.3.2 Skewness

3.3.2.1 Right-skewed

A distribution is called skewed to the right if the data are piled up on the left and the tail extends relatively far out to the right.

set.seed(11)
skewRight <- rbeta(1000,2,5)
histSkewR <- hist(skewRight, breaks=20, main= 'Right-skewed', xlab='', ylab='')
multiplier <- histSkewR$counts / histSkewR$density
mydensity <- density(skewRight)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col='green', lwd=3)
abline(v = c(mean(skewRight),median(skewRight)),  
       col=c("blue", "red"), lty=c(2,2), lwd=c(3, 3))
legend('topright',
       legend = c(paste('Mean', round(mean(skewRight),3),sep='='), 
                  paste('Median', round(median(skewRight),3),sep='=')),
       col=c("blue", "red"), lty=c(2,2))

3.3.2.2 Left-skewed

A distribution is called skewed to the let if the data are piled up on the right and the tail extends relatively far out to the left.

set.seed(111)
skewLeft <- rbeta(1000,5,2)
histSkewL <- hist(skewLeft, breaks=20, main= 'Left-skewed', xlab='', ylab='')
multiplier <- histSkewL$counts / histSkewL$density
mydensity <- density(skewLeft)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col='green', lwd=3)
abline(v = c(mean(skewLeft),median(skewLeft)),  
       col=c("blue", "red"), lty=c(2,2), lwd=c(3, 3))
legend('topleft',
       legend = c(paste('Mean', round(mean(skewLeft),3),sep='='), 
                  paste('Median', round(median(skewLeft),3),sep='=')),
       col=c("blue", "red"), lty=c(2,2))

3.3.2.3 Not skewed

set.seed(1111)
skewSymmetric <- rnorm(1000,mean=0.5,sd=1/6)
histSymmetric <- hist(skewSymmetric, breaks=20, main= 'Symmetrical', xlab='', ylab='')
multiplier <- histSymmetric$counts / histSymmetric$density
mydensity <- density(skewSymmetric)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col='green', lwd=3)
abline(v = c(mean(skewSymmetric),median(skewSymmetric)),  
       col=c("blue", "red"), lty=c(2,2), lwd=c(3, 3))
legend('topleft',
       legend = c(paste('Mean', round(mean(skewSymmetric),3),sep='='), 
                  paste('Median', round(median(skewSymmetric),3),sep='=')),
       col=c("blue", "red"), lty=c(2,2))


  1. You can find this dataset (and many others) on the textbook’s website: https://www.lock5stat.com/datapage.html↩︎

  2. Slide 2.2, page 3↩︎

  3. It’s hard to tell the bin width with Statkey’s dotplot or histogram, we use 25(in millions of dollars) as bin width, the result varies from the plot in slides, but we can tell it shares the same distribution.↩︎