8 Day 7

  • Office hours today

  • Exam 1 is in 3 weeks

Review

  • The Empirical Rule

  • For a population that has an approximately bell-shaped distribution:

  • 68% of the data is within ONE standard deviation of the mean

  • 95%$ of the data is within TWO standard deviations of the mean

  • All or almost all of the data is within THREE standard deviations of the mean


  • z-scores

  • Let x be a value from a population with mean μ

    • The z-score is:

    z=xμσ

    • For a sample:

    z=xˉxs

    • A z-score data value x is the number of standard deviations x is from the mean of the data set

    • z<0 the value of x is less than the mean

    • z=0 the value of x is equal to the mean

    • z>0 the value of x is greater than the mean

Z-Scores and the Empirical Rule:

  • 68% of the data will be between z=1 and z=1

  • 95% of the data will be between z=2 and z=2

  • 100% of the data will be between z=3 and z=3

  • Quartiles

Every data set has three quartiles:

  • 1st quartile, denoted Q1 separates the lowest 25% of the data from the highest 75%

  • 2nd quartile, denoted Q2 separates the lowest 50% of the data from the highest 50% (Q2=Median)

  • 3rd quartile, denoted Q3 separates the lowest 75% of the data from the highest 25%

  • Percentiles:

For a number p between 1 and 99, the pth percentile separates the lowest p% of the data from the highest (100p)%

  • Quartiles separate data into 4 parts

    • Each part is 25% of the data
  • Percentiles divide the data set into 100 parts

  • Five-Number Summary

The five-number summary is a set of five measures of position computed from a data set. The summary consists of:

Min     Q1     Median     Q3     Max

  • Outliers:

    • An outlier is a value that is considerably large or smaller than most of the values in a data set
  • Interquartile Range (IQR)

    • The IQR is a measure of spread that is often used to detect outliers

    • Take the difference between Q1 and Q3:

IQR=Q3Q1

  • Finding outliers:
  1. Define outlier boundaries:

Lower Outlier Boundary=Q11.5IQR Upper Outlier Boundary=Q3+1.5IQR

  1. Check to see if any data is outside of these boundaries:

Upper Boundary<x<Lower Boundary

Boxplots

  • A boxplot or “box-and-whiskers” plot is a graphical display of a five number summary

    • Recall our five-number summary of the exam data:
(#tab:d6iqrex1.7)Five Number Summary
Metric Min Q1 Median Q3 Max
Value 53 70 78 82 91
  • The boxplot for this:

How to Construct a Boxplot

  1. Find the 5 values in the five number summary
  1. Compute the IQR

  2. Find the upper & lower bounds for outliers

  1. Draw a number line to represent the scale

  2. Above the number line, draw a box with one end at Q1 and the other at Q3

  1. Draw a verticle line across the box at the median
  1. Draw horizontal lines (“whiskers”) from the box to the smallest and largest values within the upper & lower outlier bounds

  2. Plot observations outside the bounds with a “star” (*) to identify them as outliers


Example: Boxplot from 5 Number Summary

Recall our five-number summar for Jamie’s commute time:

Table 8.1: Five Number Summary
Metric Min Q1 Median Q3 Max
Value 15 19 21 22 39

The outliers were 31, 36, 38, and 39. Use the information to construct a boxplot:





Skewness and Boxplots

  • Showing the skew of data with a boxplot is relatively intuitive and mimics histograms:

  • This would be negatively-skewed

  • Which is the higher value in these plots? Median or Mean


  • Positively-skewed is the opposite:

  • Median or mean, which is higher?


  • Approximately symmetric (otherwise known as?)

  • What are the stars?



Comparative Boxplots

  • Boxplots are extremely useful for comparing data sets on the same scale

  • Below is annual rainfall data (in inches) in LA during February: 19301974

Year Rainfall Year Rainfall Year Rainfall Year Rainfall Year Rainfall
1930 0.45 1939 1.13 1948 1.29 1957 1.47 1966 1.51
1931 3.25 1940 5.43 1949 1.41 1958 6.46 1967 0.11
1932 5.33 1941 12.42 1950 1.67 1959 3.32 1968 0.49
1933 0.00 1942 1.05 1951 1.48 1960 2.26 1969 8.03
1934 2.04 1943 3.07 1952 0.63 1961 0.15 1970 2.58
1935 2.23 1944 8.65 1953 0.33 1962 11.57 1971 0.67
1936 7.25 1945 3.34 1954 2.98 1963 2.88 1972 0.13
1937 7.87 1946 1.52 1955 0.68 1964 0.00 1973 7.89
1938 9.81 1947 0.86 1956 0.59 1965 0.23 1974 0.14
  • We can compare the data from 19301974 with the daya from 19752019 using boxplots:

  • What can you say about the shape of each dataset?

  • In which time period was the amount of rainfall generally greater?

  • On the whole, the rainfall was more variable in which time period?




  • Take a break



Correlation and Scatterplots

The most basic goal of statistics is to describe the relationship between two variables measured on a sample of individuals from a given population

  • We know what to do with one quantitative variable

    • How about two?

Example: MHK Houses

  • We have a sample of 14 homes for sale in west Manhattan, Kansas (n=14)

  • Each home has a value for its asking price (in dollars) and another for the size of its living space (in square feet)

    • Two variables for each individual in the sample
  • x= size of the living space

  • y= asking price of the home

  • For the ith home, we’ll denote it’s observated values as:

    • xi= the size of the ith home in ft2

    • yi= the asking price of the ith home in dollars

Square Feet Asking Price (in USD)
2761 349900
1824 219000
3362 385000
4048 350000
3016 325000
3768 399900
3072 305000
3815 307500
3213 325000
1963 257500
3507 310000
3386 389000
1896 240000
3206 369900
  • Our data consist of ordered pairs:

(x1,y1)=(2761,349900),...,(x14,y14)=(3206,369900)

  • Data that consist of ordered pairs are called bivariate data


  • How are x and y related in this data?

    • Does size of a house tend to effect price of a house?

    • In our head we have an idea, but the way we can see this visually is called a scatterplot

  • What do we think of the relationship between x and y?

  • We could even describe this relationship with a line

    • We call that a linear association

Scatterplot Definitions

For any two variables we can define their relationship as a:

  • Positive association if large values of one variable are associated with large values of another

  • Negative association if large values of one variable are associated with small values of another


  • Two variables can have a linear relationship if the data tend to cluster around a straight line when plotted on a scatterplot

  • Go away