Chapter 1 Stats Starts Here

These notes are meant to supplement, not replace your textbook. I will occasionally cover topics not in your textbook, and I will stress those topics I feel are most important.

1.1 Types of Data

My definition of statistics:

Statistics is the attempt to use qualitative and quantitative data in order to:

  • describe a situation
  • make comparisons between groups
  • discover a truth

When we collect data, we will often organize the data into a table where the rows represent cases or units, which are called subjects or respondents when they are humans, and the columns represent variables that were measured on each case. An example are the tables given on pages 3 and 7 of your textbook. Another example is given below:

##    Exercise  SAT  GPA Pulse Piercings CodedSex
## 1        10 1210 3.13    54         0        1
## 2         4 1150 2.50    66         3        0
## 3        14 1110 2.55   130         0        1
## 4         3 1120 3.10    78         0        1
## 5         3 1170 2.70    40         6        0
## 6         5 1150 3.20    80         4        0
## 7        10 1320 2.77    94         8        0
## 8        13 1370 3.30    77         0        1
## 9         3 1100 2.80    60         7        0
## 10       12 1370 3.70    94         2        0

This example has both categorical variables (sometimes called qualitative) and quantitative variables. Notice that the variable CodedSex, although it uses numbers, is actually categorical, as the use of the number 0 for female and 1 for male is arbitrary.

The other variables are all quantitative, with the students’ GPA, SAT score, Pulse rate, number of hours per week of Exercise, and number of body Piercings.

Beyond just qualitative vs quantitative variables, we sometimes talk about four Levels of Measurement, where types of data are arranged from weakest to strongest, in the sense of what sort of statistics can be computed. (NOTE: Much of this is NOT in the textbook)

  • Nominal level: this is the weakest type of data, where the variable is categorical or qualitative in nature. Examples include your sex and your favorite color.

  • Ordinal level: this is the next strongest form of data, and is the first one involving a quantitative variable. Ordinal data involves ``ranking’’. Examples include when you ranked your love of chocolate on a 1-to-10 scale, when you evaluate your instructors on a 1-to-5 scale (often called a Likert scale), and when a football coach ranks his three quarterbacks from 1 to 3, with 1 being the best player, 2 the second best, and 3 the worst.

  • Interval level: this is numerical data that goes beyond just ranking, but where there is no fixed zero point and the ratio between two numbers does not make sense. The standard example is temperature. For instance, suppose it is 30 degrees C in one city and 10 degrees in another. \[\frac{30}{10}=3\] But it does not make sense to say the first city is three times hotter, because if we use Fahrenheit instead, then the two cities are 86 degrees and 50 degrees, respectively, and the ratio is now \[\frac{86}{50}=1.72 \neq 3\]

  • Ratio level: similar to ordinal level and usually treated the same in computations. Here, there is a fixed zero point and the ratio between two numbers does make sense. For example, suppose the height of two individuals are 6 feet tall and 5 feet tall. We can convert to inches, getting 72 and 60 inches, respectively. Multiply inches by 2.54 to get centimeters, getting 182.88 cm and 152.4 cm. The ratio is the same no matter which units you use.

\[\frac{6}{5}=\frac{72}{60}=\frac{182.88}{152.4}=1.2\]

When we use a variable to help understand or predict values of the another variable, we call the former the explanatory variable (sometimes called the independent variable, often denoted as \(X\)) and the latter the response variable (sometimes called the dependent variable, \(Y\)).

In the GPAbySex data set, a college admissions officer might wish to predict the college GPA of new students, using the SAT score that they got when they took this test in high school. The explanatory variable is SAT and the response variable is GPA.

We can also use a categorical variable as the explanatory variable. If the response variable I want to understand is number of Piercings, I might notice that the female students tend to have a higher number (probably because most women have pierced ears, where most men do not) and I could use CodedSex as the explanatory variable.

1.2 Populations and Samples

A population includes all individuals or objects of interest. A census is the collection of data from an entire population. It is usually not possible to conduct a census.

A sample is a subset of a population that we collect data from. We hope to be able to make generalizations about the population based on the sample. If the sample is collected properly, methods of statistical inference can help us with such conclusions. For example, we could determine if a drug is effective in lowering cholesterol or what percent of the population will vote for a political candidate.

A parameter is a numerical characteristic of a population. It is usually not known, so we estimate it by collecting a sample and calculating a statistic.

It is customary (although there are exceptions) to use Greek letters to describe population parameters and Roman letters for sample statistics. The most common example is the mean (average). The population mean is denoted as \(\mu\). Since we typically do not have access to the entire population, we collect data from a sample and estimate the unknown parameter \(\mu\) with a reasonable statistic, such as the sample mean \(\bar{x}\) or “x-bar”, which is just the arithmetic average of the numbers.

Another parameter of interest is the proportion of a population, denoted in some books as \(\pi\) but in other books, such as ours, with \(p\). The sample proportion is \(\hat{p}\) or “p-hat”, which is just the proportion.