# STA 135 Notes (Murray State)

*2023-04-07*

# Chapter 1 Stats Starts Here

These notes are meant to *supplement*, not *replace* your textbook. I will occasionally cover topics not in your textbook, and I will stress those topics I feel are most important.

## 1.1 Types of Data

My definition of *statistics*:

Statistics is the attempt to use qualitative and quantitative data in order to:

- describe a situation
- make comparisons between groups
- discover a truth

When we collect data, we will often organize the data into a table where the *rows* represent *cases* or *units*, which are called *subjects* or *respondents* when they are humans, and the *columns* represent *variables* that were measured on each case. An example are the tables given on pages 3 and 7 of your textbook. Another example is given below:

```
## Exercise SAT GPA Pulse Piercings CodedSex
## 1 10 1210 3.13 54 0 1
## 2 4 1150 2.50 66 3 0
## 3 14 1110 2.55 130 0 1
## 4 3 1120 3.10 78 0 1
## 5 3 1170 2.70 40 6 0
## 6 5 1150 3.20 80 4 0
## 7 10 1320 2.77 94 8 0
## 8 13 1370 3.30 77 0 1
## 9 3 1100 2.80 60 7 0
## 10 12 1370 3.70 94 2 0
```

This example has both **categorical variables** (sometimes called **qualitative**) and **quantitative** variables. Notice that the variable `CodedSex`

, although it uses numbers, is actually categorical, as the use of the number 0 for `female`

and 1 for `male`

is arbitrary.

The other variables are all quantitative, with the students’ `GPA`

, `SAT`

score, `Pulse`

rate, number of hours per week of `Exercise`

, and number of body `Piercings`

.

Beyond just qualitative vs quantitative variables, we sometimes talk about four **Levels of Measurement**, where types of data are arranged from weakest to strongest, in the sense of what sort of statistics can be computed. (NOTE: Much of this is NOT in the textbook)

**Nominal level**: this is the weakest type of data, where the variable is categorical or qualitative in nature. Examples include your sex and your favorite color.**Ordinal level**: this is the next strongest form of data, and is the first one involving a quantitative variable. Ordinal data involves ``ranking’’. Examples include when you ranked your love of chocolate on a 1-to-10 scale, when you evaluate your instructors on a 1-to-5 scale (often called a Likert scale), and when a football coach ranks his three quarterbacks from 1 to 3, with 1 being the best player, 2 the second best, and 3 the worst.**Interval level**: this is numerical data that goes beyond just ranking, but where there is no fixed zero point and the ratio between two numbers does not make sense. The standard example is temperature. For instance, suppose it is 30 degrees C in one city and 10 degrees in another. \[\frac{30}{10}=3\] But it does not make sense to say the first city is three times hotter, because if we use Fahrenheit instead, then the two cities are 86 degrees and 50 degrees, respectively, and the ratio is now \[\frac{86}{50}=1.72 \neq 3\]**Ratio level**: similar to ordinal level and usually treated the same in computations. Here, there is a fixed zero point and the ratio between two numbers does make sense. For example, suppose the height of two individuals are 6 feet tall and 5 feet tall. We can convert to inches, getting 72 and 60 inches, respectively. Multiply inches by 2.54 to get centimeters, getting 182.88 cm and 152.4 cm. The ratio is the same no matter which units you use.

\[\frac{6}{5}=\frac{72}{60}=\frac{182.88}{152.4}=1.2\]

When we use a variable to help understand or predict values of the another variable, we call the former the **explanatory variable** (sometimes called the independent variable, often denoted as \(X\)) and the latter the **response variable** (sometimes called the dependent variable, \(Y\)).

In the `GPAbySex`

data set, a college admissions officer might wish to predict the college `GPA`

of new students, using the `SAT`

score that they got when they took this test in high school. The explanatory variable is `SAT`

and the response variable is `GPA`

.

We can also use a categorical variable as the explanatory variable. If the response variable I want to understand is number of `Piercings`

, I might notice that the female students tend to have a higher number (probably because most women have pierced ears, where most men do not) and I could use `CodedSex`

as the explanatory variable.

## 1.2 Populations and Samples

A **population** includes all individuals or objects of interest. A **census** is the collection of data from an entire population. It is usually not possible to conduct a census.

A **sample** is a subset of a population that we collect data from. We hope to be able to make generalizations about the population based on the sample. If the sample is collected properly, methods of statistical inference can help us with such conclusions. For example, we could determine if a drug is effective in lowering cholesterol or what percent of the population will vote for a political candidate.

A **parameter** is a numerical characteristic of a population. It is usually not known, so we estimate it by collecting a sample and calculating a **statistic**.

It is customary (although there are exceptions) to use Greek letters to describe population parameters and Roman letters for sample statistics. The most common example is the mean (average). The population mean is denoted as \(\mu\). Since we typically do not have access to the entire population, we collect data from a sample and estimate the unknown parameter \(\mu\) with a reasonable statistic, such as the sample mean \(\bar{x}\) or “x-bar”, which is just the arithmetic average of the numbers.

Another parameter of interest is the proportion of a population, denoted in some books as \(\pi\) but in other books, such as ours, with \(p\). The sample proportion is \(\hat{p}\) or “p-hat”, which is just the proportion.