# Chapter 1 Introduction to Data

These notes are meant to supplement, not replace your textbook. I will occasionally cover topics not in your textbook, and I will stress those topics I feel are most important.

## 1.1 Types of Data

My definition of statistics:

Statistics is the attempt to use qualitative and quantitative data in order to:

• describe a situation
• make comparisons between groups
• discover a truth

When we collect data, we will often organize or structure the data into a table (called a data matrix in your book and often referred to as a data frame) where the rows represent cases or units, which are called subjects or respondents when they are humans, and the columns represent variables that were measured on each case. An example is given below:

##    Exercise  SAT  GPA Pulse Piercings CodedSex
## 1        10 1210 3.13    54         0        1
## 2         4 1150 2.50    66         3        0
## 3        14 1110 2.55   130         0        1
## 4         3 1120 3.10    78         0        1
## 5         3 1170 2.70    40         6        0
## 6         5 1150 3.20    80         4        0
## 7        10 1320 2.77    94         8        0
## 8        13 1370 3.30    77         0        1
## 9         3 1100 2.80    60         7        0
## 10       12 1370 3.70    94         2        0

This example has both categorical variables (sometimes called qualitative) and quantitative variables. Notice that the variable GenderCode, although it uses numbers, is actually categorical, as the use of the number 0 for Male and 1 for Female is arbitrary. Also note that this data set is using a traditional binary classification of gender and is not inclusive of all possible categories. These categories are often referred to as levels.

The other variables are all quantitative, with the students’ GPA, SAT score, Pulse rate, number of hours per week of Exercise, and number of body Piercings.

Beyond just qualitative vs quantitative variables, we sometimes talk about four Levels of Measurement, where types of data are arranged from weakest to strongest, in the sense of what sort of statistics can be computed. (NOTE: Much of this is NOT in the textbook)

• Nominal level: this is the weakest type of data, where the variable is categorical or qualitative in nature. Examples include your sex and your favorite color.

• Ordinal level: this is the next strongest form of data, and is the first one involving a quantitative variable. Ordinal data involves ranking’’. Examples include when you ranked your love of chocolate on a 1-to-10 scale, when you evaluate your instructors on a 1-to-5 scale (often called a Likert scale), and when a football coach ranks his three quarterbacks from 1 to 3, with 1 being the best player, 2 the second best, and 3 the worst.

• Interval level: this is numerical data that goes beyond just ranking, but where there is no fixed zero point and the ratio between two numbers does not make sense. The standard example is temperature. For instance, suppose it is 30 degrees C in one city and 10 degrees in another. $\frac{30}{10}=3$ But it does not make sense to say the first city is three times hotter, because if we use Fahrenheit instead, then the two cities are 86 degrees and 50 degrees, respectively, and the ratio is now $\frac{86}{50}=1.72 \neq 3$

• Ratio level: similar to ordinal level and usually treated the same in computations. Here, there is a fixed zero point and the ratio between two numbers does make sense. For example, suppose the height of two individuals are 6 feet tall and 5 feet tall. We can convert to inches, getting 72 and 60 inches, respectively. Multiply inches by 2.54 to get centimeters, getting 182.88 cm and 152.4 cm. The ratio is the same no matter which units you use.

$\frac{6}{5}=\frac{72}{60}=\frac{182.88}{152.4}=1.2$

When we use a variable to help understand or predict values of the another variable, we call the former the explanatory variable (sometimes called the independent variable, often denoted as $$X$$) and the latter the response variable (sometimes called the dependent variable, $$Y$$).

In the GPAbySex data set, a college admissions officer might wish to predict the college GPA of new students, using the SAT score that they got when they took this test in high school. The explanatory variable is SAT and the response variable is GPA. We would probably expect a positive relationship in this situation, with higher SAT scores being associated with higher GPA.

We can also use a categorical variable as the explanatory variable. If the response variable I want to understand is number of Piercings, I might notice that the female students tend to have a higher number (probably because most women have pierced ears, where most men do not) and I could use Sex as the explanatory variable. If we had measured the Height of the students, we would probably expect a negative relationship between Height and Piercings, as women tend to be shorter but have more piercings.

## 1.2 Sampling From a Population

https://news.gallup.com/poll/262166/americans-converse-family-matters-politics.aspx

This article “Americans Converse More About Family Matters Than Politics” discusses the results of a recent poll conducted on people’s behaviors involving their conversations with family and friends.

A population includes all individuals or objects of interest. A census is the collection of data from an entire population. It is usually not possible to conduct a census.

A sample is a subset of a population that we collect data from. We hope to be able to make generalizations about the population based on the sample. If the sample is collected properly, methods of statistical inference can help us with such conclusions. For example, we could determine if a drug is effective in lowering cholesterol or what percent of the population will vote for a political candidate.

A parameter is a numerical characteristic of a population. It is usually not known, so we estimate it by collecting a sample and calcuating a statistic.

It is customary (although there are exceptions) to use Greek letters to describe population parameters and Roman letters for sample statistics. The most common example is the mean (average). The population mean is denoted as $$\mu$$. Since we typically do not have access to the entire population, we collect data from a sample and estimate the unknown parameter $$\mu$$ with a reasonable statistic, such as the sample mean $$\bar{x}$$ or “x-bar”, which is just the arithmetic average of the numbers.

Another parameter of interest is the proportion of a population, denoted in some books as $$\pi$$ but in other books, such as ours, with $$p$$. The sample proportion is $$\hat{p}$$ or “p-hat”, which is just the proportion.

A sample is a portion or subset of a larger population, where the population is the collection of all people/objects/things we would want to make a conclusions about. If I wanted to know what proportion of the population discuss politics with their family and since it is impossible to survey all Americans, we use a smaller portion (often around 1000) Americans as a sample to base any conclusions or inferences that we draw.

Numerical characteristics of samples are called statistics and we generally use Latin letters to represent them. For example, the article says that only 24% of Americans had discussed politics with family/friends in the last week. This is based on a sample and we say $\hat{p}=0.24$

If we knew the proportion for an entire population, we often (but not always) will use a Greek letter. From the article https://thriftytraveler.com/us-citizens-passport/, 42% of Americans have a passport. Since the total number of passport holders and the total American population are (more or less) known, it is a population parameter rather than a sample statistics, we say $\pi=0.42$ or often $p=0.42$ (we often don’t use the Greek letter here since $$\pi$$ has a special meaning as a mathematical constant).

Another common instance of sample statistics and population parameters are with the mean (average). If I ask a sample of $$n=100$$ college students how many texts they sent yesterday, the sample mean could be $\bar{x}=17.2$ If I compute a mean (or average) for an entire populatons, we would use the Greek letter $$\mu$$. For instance, the mean salary of a major league baseball players (a small, known population) is $\mu=4.3$ or \$4.3 million dollars! Source:

https://www.statista.com/statistics/236213/mean-salaray-of-players-in-majpr-league-baseball/ intervals and margins-of-error.

## 1.3 Why Collect a Sample?

Typically in a research setting, it is necessary to collect data. It is crucial to use proper methodology for data collection, or the effort can be in vain.

Data will often be collected with surveys or with an experiment. We will look at a couple of typical scenarios before starting a formal study of the methodology and terminology.

Collecting a Sample of Teachers or a Sample of Voters

Question: In the state of Kentucky, there are about $$N=40000$$ public school teachers at the K-12 level. Suppose it is necessary to collect a sample of $$n=400$$ of those teachers. Explain how you would go about collecting such a sample. Discuss what sort of real-life logistical problems you might encounter in attempting to collect data from this sample.

Question: You are trying to estimate what percent of voters thought that Hillary Clinton or Donald Trump won the first presidential debate. How would you go about collecting such a sample? Again, discuss what sort of real-life logistical problems you might encounter in attempting to collect data from this sample.

## 1.4 Population

Let us reconsider the scenario where we were ‘brainstorming’ on how to collect a sample of size $$n=400$$ from a population of $$N=40000$$ public school teachers.

Here, we have defined our target population as consisting of all individuals that are employed as K-12 public school teachers in the state of Kentucky. This is a large population, and it is unlikely that we would be able to conduct a census and collect data on the entire population. This is why we take a subset of the population, or sample, and compute a statistic to estimate a parameter of interest. For example, maybe we want to know what percentage of public school teachers are aware of Senate Bill 1.

If my target population was even larger, such as a national poll conducted by the Gallup Organization or other company, to see what percentage of the public approves of the President’s job performance, conducting a census is impossible. These companies generally rely on a sample of several hundred to a few thousand registered voters to make a conclusion about the value of this parameter for the nation in general.

It may seem amazing to you, but a relatively small sample, if collected using the methods of probability sampling, can accurately estimate these parameters when a much larger sample collected in a sloppy fashion can be biased and therefore completely worthless.

An excellent article in Scientific American explains how a relatively small sample can be highly accurate, while a larger sample can be worthless.

http://www.scientificamerican.com/article/howcan-a-poll-of-only-100/

## 1.5 Non-Probability Sampling

If a sample is collected in a non-random or nonscientific manner, then we do not know the probability of a member of the population being selected. The data will be biased and may overrepresent some segments of the population and underrepresent others.

A famous historical example was the Literary Digest poll in 1936. This magazine, popular at the time, was trying to determine if President Franklin D. Roosevelt (Democrat) would win re-election, or if he would be defeated by Republican Alf Landon. They included a stamped addressed postcard in each issue of the magazine and had readers mail them in. Several hundred thousand postcards were returned. In addition, many thousands of people were contacted by telephone.

When the results were compiled, about 60% of readers supported Alf Landon. Since you’ve probably never heard of Alf Landon (who was the governor of Kansas), you know that he did not win the election in 1936? Why did this sample give such an inaccurate result?

It turned out that the readership of Literary Digest was mostly fairly wealthy people, especially during the Great Depression. The survey was biased and overrepresented wealthy Americans, who were more likely to be Republicans, and underrepresented the working class and unemployed, who were more likely to be Democrats.

It is also important to consider who responds and who fails to respond (non-response) in a survey. This was an issue in the Literary Digest survey, and the next few slides will show a 21st century example.

It is very common for websites to have surveys to ask the visitors some sort of question. For example, there is usually a poll asking some sort of sports-related question (i.e. who will win the Super Bowl?) at http://www.espn.go.com. These surveys are fun and are meant strictly for entertainment, but no serious conclusions should be drawn from them. They will not have a margin of error reported.

## 1.6 Sampling Mistakes

1. Sample volunteers (voluntary response bias)

2. Sample “Conveniently” (convenience sampling)

3. Use a poor sampling frame (maybe one that misses parts of the population)

4. Undercoverage (portions of the population are undersampled or not sampled at all)

5. Nonresponse bias (I won’t take your call or mail back your survey)

6. Response bias (leading questions; questions about illegal or embarrassing or private activities one doesn’t want to admit)

## 1.7 Probability Sampling

How do we avoid selection bias and end up with a good scientific sample? By using a method based on probability sampling, where we know the chance that each member has a known chance of being included in the sample. There are four main methods of probability sampling, and you likely described some of them when answering the question about how you would take a sample from the population of teachers.

1. Simple Random Sampling

2. Systematic Sampling

3. Stratified Sampling

4. Cluster Sampling

## 1.8 Simple Random Sampling

To conduct a simple random sample, we need to have a sampling frame. A sampling frame is a list of all of the elements of a population; for instance, the names of all 40000 public school teachers.

Each individual is assigned an integer between 1 and $$N$$. A total of $$n$$ random integers between 1 and $$N$$ are drawn, and those individuals make up the simple random sample. If one of the randomly drawn numbers was #33769, and that was Mr. B. Stein, a high school economics teacher, then he would be in the sample.

If we drew many such simple random samples, we would get somewhat different results–this is called sampling variability.

A major advantage of simple random sampling is that instead of allowing participants to select (or not select) themselves, such as in the Literary Digest or Drudge Report examples, impersonal chance determines who is chosen, removing selection bias.

One disadvantage is if a sampling frame is flawed or unavailable. For example, it might be difficult or impossible to compile an accurate list of all teachers in Kentucky. It might be easier to obtain a sampling frame listing all school districts. Another disadvantage is that it can be cumbersome to draw many random numbers, but computer software is typically used to draw random samples.

## 1.9 Systematic Sampling

Instead of drawing a simple random sample, we can draw a 1-in-$$k$$ sample, usually called a systematic sample. We compute $$k=\frac{N}{n}$$, choose a random integer between 1 and $$k$$, and use that individual and every $$k^{th}$$ individual after as the sample.

For the teachers problem, $$k=\frac{40000}{400}=100$$. Choose a random number between 1 and 100. Suppose it is 37; we will sample the 37th teacher, 137th teacher, 237th teacher, and so on.

This can be very convenient in an industrial setting. We might choose to sample every $$200^{th}$$ computer off the assembly line to check for quality control purposes.

Although systematic sampling is convenient to use, it can be flawed if there is a cyclical pattern in the sampling frame. For example, suppose a systematic sample ended up sampling the number of absent students or patients admitted to a hospital in such a way that the entire sample was from the same day of the week (say Friday). I think you can see the potential for inadvertent bias.

## 1.10 Stratified Sampling

Sometimes it makes sense to divide the population into homogeneous groups, which are called strata (the singular is stratum), before the stratified random sample is selected.

For example, in the teachers example, I might divide the population into three groups: elementary teachers (K-5), middle school teachers (6-8), and high school teachers (9-12). Suppose that 50% of the population are elementary teachers, 20% middle school, and 30% high school.

Then three separate simple random samples would be collected. With the desired total sample size of $$n=400$$, we would select $$n_1=200$$ elmentary, $$n_2=80$$ middle school, and $$n_3$$=120 high school teachers.

Other categorical variables, such as sex, race, age group, etc. can be used as the stratification variable.

As you can see, collecting a stratified sample as opposed to a simple random sample is a bit more involved. Why do we bother going through the extra trouble, especially since both samples are random?

A big advantage of stratified sampling is that it can reduce the total variability of the sample statistics. In a political poll or opinion survey, this means that we could either have a smaller margin of error (desirable), or we can use a smaller total sample size to obtain a particular margin of error (also desirable).

In addition, we would often be interested in reporting the statistics by strata: for example, if there are differences in opinion between elementary/middle school/high school teachers, between men and women, between different racial groups, etc.

## 1.11 Cluster Sampling

Another sampling techique, which is often confused with stratified sampling, is cluster sampling.

In this method, the population is divided into representative subgroups, which are heterogeneous rather than homogeneous. This subgroups are called clusters. Then, instead of taking a simple random sample of individuals in the population, a simple random sample of the clusters is taken. In the purest form of cluster sampling, everyone in the chosen clusters is used, although in practice it is common to take a simple random sample or stratified sample of the chosen clusters.

In the teachers example, it might be difficult or impossible to get a sampling frame listing all teachers in the state. A frame of all school districts in Kentucky would be easier to obtain. We could randomly select a certain number of school districts and use the teachers at those districts.

Students often confuse stratified and cluster sampling. They appear similar because both methods involve dividing the population into subgroups.

In stratified sampling, the idea is to divide into strata where everyone within a stratum are similar on a characteristic of interest (i.e. teach the same grade level, same sex, same race, etc.) Often this is done to make sure the sample fairly’ represents the population.

In cluster sampling, the idea is to divide into clusters where ideally the individual clusters are microcosms of the population. Then, we can reduce the time and money needed to collect the sample by sampling clusters rather than individuals.

## 1.12 Multistage Sampling

Most companies (Gallup, Rasmussen, news organizations, etc.) that take large surveys regularly combine several of these techniques into what is called a multistage sample.

First, cluster sampling might be used to randomly select certain geographic areas; it is common to use census tracts or neighborhood blocks, which are subdivisions used by the U.S. Census Bureau. Then the company or researcher might choose to use stratified sampling within the chosen census tracts on a stratification variable(s) of interest.

These sample designs can be quite complicated, but the intention is to try to gain the advantages of the various methods, especially when a simple random sample is not feasible.

## 1.13 Association

Two variables are associated if values of one variable tend to be related to the values of the other variable. For example, I might notice that students who report studying for more hours per week also tend to be the students with the highest G.P.A.

The variables are causally associated if changing the value of one variable influences the value of the other variable.

This means if we manipulate one variable, we can change the other variable. For example, a farmer might notice that if they increase the amount of fertilizer used on a field, the yield of corn will increase. (Which is the explanatory variable and which is the response variable?)

Of course, this trend might not hold forever. If the farmer uses too much fertilizer, at some point it will actually cause the yield to decrease, as the farmer might actually kill the crop.

Just because two variables are associated doesn’t mean that there is a causal relationship. For example, suppose you own an ice cream store near the beach. You might notice that the months of the year that have the highest ice cream sales are also the months of the year with the highest number of shark attacks! This doesn’t mean eating ice cream before swimming will make sharks more likely to attack. Here, there is a confounding factor, or lurking variable that is associated with both the explanatory variable (shark attacks) and response variable (ice cream sales), in this case the time of year or the temperature. You will sell more ice cream in the summer, when more people go swimming, and thus there is a greater chance of a shark attack.

## 1.14 The Physicians’ Aspirin Study

In the 1980s, a study was designed to see if taking aspirin would be beneficial in reducing the chance of suffering a heart attack. A sample of approximately 22000 middle-aged male physicians was collected. Half were randomly assigned to take aspirin, and the other half a placebo, or inert substance that did not contain aspirin. Both the participants in the study and the doctors & nurses that they interacted with were unaware of whether they were in the treatment group (aspirin) or the control group (placebo).

After the data was collected and analyzed statistically, it was shown that the physicians in the treatment group were significantly less likely to have suffered a heart attack during the time period of the study than those in the control group.

Condition Heart Attack No Heart Attack Attacks per 1000
Aspirin 104 10,933 9.42
Placebo 189 10,845 17.13

The rate of heart attacks in the aspirin group was $\frac{9.42}{17.13}=0.55$ only 55% what was observed in the placebo group.

This study was the basis of the common recommendation of taking a low-dose aspirin tablet to reduce one’s chance of having a heart attack. You may have seen the commericials that some aspirin companies have on TV that advertise this fact and encourage viewers to see their doctor to see if they should go on an ‘aspirin regimen’.

A medical journal would include more sophisticated statistics like the risk ratio, odds ratio, confidence intervals with margins of error for these statistics, or the results of a statistical procedure such as the chi-square test. See vassarstats.net/odds2x2.html for an online calculator of these, if you like. Why might it be unlikely to include this information in an article meant for the general public?

This difference would have occured by chance only about 1 in 100,000 times (this is a statistic called the $$p$$-value which we cover later.)

More details of the study are found here: http://phs.bwh.harvard.edu/phs1.htm

1. Do you think the sample used in this study was collected in an appropriate fashion?

2. If you were going to redo this study, would you change anything about how the sample was collected?

3. Do you have any other concerns about how the study was conducted?

## 1.15 Designed Experiments: Single Group Design

Let us reconsider the Physicians Aspirin Study example? How come the researchers didn’t select a single group design for this experiment.

In other words, how come the experiment wasn’t just: gather a large sample of subjects (in that experiment, middle-aged male physicians), give the subject the treatment (aspirin), and measure to see if the desired outcome (reduction in rate of heart attacks) occurs?

The problem is that the effect of the independent variable (aspirin) on the dependent variable (rate of heart attacks) cannot be separated from other extraneous variable(s). These extraneous variable(s) may or may not even be known to the researchers and are called confounding variables.

The presence of confounding variables would make it impossible to attribute a reduction in heart attack rate to the aspirin. Maybe everyone in the study decided to exercise more or improve their eating habits.

## 1.16 Observational Study

In an observational study, no control is placed on the independent variables. Common examples of observational studies are sample surveys. Even though no control of the independent variables are used, this is often an appropriate design. For example, if I want to know the opinions of the American public on health insurance, I would not want to manipulate the subjects into answering a certain way.

In a planned experiment, the researcher do manipulate, or control, the values of independent variable(s). In the aspirin study, this was done by randomly assigning the subjects to receive either aspirin or a placebo. If there is a difference in the response between the two compariosn groups, we can attribute the difference to the treatment. In the aspirin study, there was a statisically significant decrease in rate of heart attacks when aspirin was taken.

## 1.17 Single Factor Design

Randomization is typically used in planned experiments to assign experimental units to different levels of an independent variable, or factor. This random selection used to be done with random number tables, but now is typically done via random number generators on computers.

Consider a single factor design. We are studying the effectiveness of a patch’ to help people quit smoking. Suppose we have three levels of the factor: a patch with a high dose of the active agent, a patch with a low dose of the active agent, and a patch with no dose of the active agent, which serves as a placebo. If we had $$n=300$$ subjects, we could randomly assign 100 to each level of the dose and determine what level of dose is most effective.

## 1.18 Types of Randomized Experiments

In a Randomized comparative experiment, we randomly assign cases to different treatment groups and then compare results on the response variable.

In a Matched Pairs experiment, each case gets both treatments in random order and we look at individual differences in the response variable between the two treatments. Sometimes cases get “paired up”, such as pairing you with the person in the study most similar to you. Twins are ideal for this design, with one twin getting one treatment and the other the second treatment.

Now consider a more complicated two factor design, where we want to study both the dose (high/low/placebo) and method of delivery (patch/gum). We have 3 levels of dose (what level of dose is best?) and 2 levels of method (is the patch better/worse than the gum?), so we have $$3 \times 2=6$$ different treatments. With a smaple of $$n=300$$, we would randomly assign 50 subjects to: patch-high, patch-low, patch-placebo, gum-high, gum-low, gum-placebo. We would determine what combination of dose and method is most effective.

We could have three or more factors as well in such a factorial design.

The designs we have considered so far are completely randomized designs, where the randomization of subjects to groups happens without any restriction. In my smoking example, the researcher would have complete control over whether a subject received the patch or the gum and what level of dose they received.

Often there are extraneous variables, or factors, of interest that we cannot control via randomization. In the smoking example, we might want to compare heavy smokers (more than one pack/day) versus light smokers (less than one pack/day). It would not be appropriate (or possible) to randomly assign subjects to smoke a certain amount of cigarettes, so we would use the technique called blocking.

If I had 120 heavy smokers and 180 light smokers in my study, looking at a patch with either a high/low/no dose, I would use the level of smoking as a block, randomly assign 40 heavy smokers to each dosage group, and randomly assign 60 light smokers to each dosage group.

Variables such as sex, age, etc. which cannot be randomly controlled are commonly used as blocking variables. I could randomly assign Emma & Josh to the treatment group and Jenni & Kyle to the control group, but I cannot randomly assign them to a gender!

## 1.19 Single- and Double-Blind Studies

Often in medical studies, particularly clinical trials involving a placebo, a single-blind or double-blind experiment is used.

In a single-blind study, the subjects do not know if they are receiving the treatment or placebo (i.e. they do not know if they are taking real or fake medicine). Knowledge of whether you are in the treatment group or not could affect the accuracy and the integrity of the trial.

A double-blind study extends this concept to keeping the treatment providers (such as nurses and physicians) blind as well. This is to prevent any conscious or subconsious bias in the way these providers deal with the subjects in the study.

The Physicians’ Aspirin Study was a double-blind study; the aspirin/placebo was mailed to the subjects by a third party that never directly interacted with the subjects.

It is not always desirable or possible to utilize single- or double-blindness in a study. Several years ago, I had appendicitis and had surgery to remove my appendix. I underwent laparascopic surgery, as opposed to the more invasive ‘open’ appendectomy.

There were studies published when the laparascopic technique was first developed, comparing it to the traditional open appendectomy. To my knowledge, these studies were based on observational studies and not on a randomized experiment.

Theoretically, it would be possible, if probably not ethical, to have a randomized experiment. I could have agreed to let randomization decide which technique was used for my surgery, therefore having a single-blind experiment. I doubt I would have agreed to such a plan, whereas I could imagine myself participating in a clinical trial for a drug (such as the aspirin study).

Double-blindness would not be possible. The surgical team would need to know what type of surgery they would perform.

## 1.20 Summarizing and Displaying a Categorical Variables

Descriptive statistics is the branch of statistics that involves ‘describing a situation’. Most of the methods used in descriptive statistics are relatively simple, such as finding averages or constructing a graph.

A distribution consists of the values that a random variable takes on and how often it takes those values. We will use three techniques to describe a distribution:

1. Table
2. Graph
3. Function (i.e. mathematical formula)

We will look at a number of ways to display data in a table or graph.

1. Frequency Table
2. Bar Chart & Pie Chart
3. Histogram
4. Stemplot
5. Scatterplot

A one-way frequency table shows the tabulated results for each value of a variable. It is often used for data at the nominal level. The table below shows the home state of Murray State students (from the 2015-2016 Murray State Fact Book, page 55 https://www.murraystate.edu/Libraries/Institutional_Research/factbook2015.pdf ).

State Frequency Relative Frequency
Kentucky 7430 67.6%
Tennessee 824 7.5%
Illinois 791 7.2%
Other U.S. 1179 10.7%
International 775 7.0%
Total 10998 100.0%

Part of the data table is presented. Each row represents a student, and each column is a variable (one of the questions you were asked).

##   Class    Sex  Color Texts Chocolate Height Temperature Applebee Jasmine
## 1  9:30   Male  Green    10         3     70          86       NA      NA
## 2  9:30 Female Purple     2         3     67          76        1       3
## 3  9:30 Female  Green     2        10     67          90        3       1
## 4  9:30 Female   Blue     4         7     66          85        3       2
## 5  9:30   Male Purple    30         6     67          90        3       1
## 6  9:30 Female    Red     4         7     67          83        3       2
##   LosPortales
## 1          NA
## 2           2
## 3           2
## 4           1
## 5           2
## 6           1
## [1] "Class Time"
##
## 10:30  9:30
##    33    38
## [1] "Favorite Color"
##
##   BabyBlue      Black       Blue       Gray      Green     Maroon     Orange
##          1          2         24          1          8          1          2
## Periwinkle       Pink     Purple        Red       Teal     Yellow
##          1          4         13          6          1          7
## [1] "Chocolate"
##
##  2  3  5  6  7  8  9 10
##  2  5  5 13 12 12  7 15
## [1] "Applebee's"
##
##  1  2  3
##  8 18 33
## [1] "Jasmine"
##
##  1  2  3
## 24 13 22
## [1] "Los Portales"
##
##  1  2  3
## 27 28  4

Los Portales seems to be your favorite restaurant, and very few ranked it as your least favorite of the three choices. I used the value NA to represent missing data values; there was one temperature I could not read, and several people misinterpreted the restauarant question and told me their 3 favorite (like Cracker Barrel, Mister B’s, Cookout). Height is given in inches and temperature in degrees Fahrenheit; I converted if you gave me an answer in a different unit.

How might the results have been different if I had asked you to rank the three restaurants on a 1-to-10 scale, like I did with chocolate ?

The data from the frequency table can be displayed graphically with either a bar chart or a pie chart. By the way, experts in statistical graphics hate pie charts!

Suppose I want to draw a graph based on how many text messages students sent. The usual way to do this is a variation of a bar chart called a histogram, where the bin width or interval is chosen, usually with the binwidth being equal and yielding 5 to 10 categories/bars. Below are a few examples, with different choices for the binwidth or number of bars.

The first histogram was one allowing my software to choose. It choose to use 30 intervals, so each interval is $$100/30=3.33$$ texts wide, which is inconvenient.

The second histogram is one where I specified the binwidth to be 10, probably a better choice.

A common way to display a quantitative data set is with a stem-and-leaf plot, also known as a stemplot. Each data value is divided into two pieces. The leaf consists of the final significant digit, and the stem the remaining digits.

If we were going to do a stemplot of ages and we had a 42 year-old man, the stem would be the ‘4’ and the leaf the ‘2’. If we had a 9 year-old girl, the stem would be ‘0’ and the leaf the ‘9’.

Here is a stemplot for the number of text messages you sent.

## 1 | 2: represents 12
##  leaf unit: 1
##             n: 71
##   (41)    0 | 00000011112222223344444555556667788889999
##    30     1 | 0000122236777
##    17     2 | 00000333788
##     6     3 | 0234
##           4 |
##           5 |
##           6 |
##           7 |
##     2     8 | 7
##           9 |
##     1    10 | 0
## ________________________________________________________________
##   1 | 2: represents 12, leaf unit: 1
##          Texts[Sex == "Male"]      Texts[Sex == "Female"]
## ________________________________________________________________
##   (20)   98765554332221100000|  0 |011222444455667888999  (21)
##    12                   21000|  1 |02236777                18
##     7                    7300|  2 |0003388                 10
##     3                       0|  3 |234                      3
##                              |  4 |
##                              |  5 |
##                              |  6 |
##                              |  7 |
##     2                       7|  8 |
##                              |  9 |
##     1                       0| 10 |
## ________________________________________________________________
## n:                         32      39
## ________________________________________________________________

## 1.21 Shape

In addition to the center and variability of a distribution, we are interested in shape. A distribution that has the property that the part of the distribution below the median matches the part above the median is said to be symmetric.

The mean will equal the median when the distribution is symmeteric. The well-known distribution is symmetric.

A distibution that is not symmetric, but has the property that the mean is greater than the median and has a ‘tail’ of low probability to the right is said to be right-skewed, or positively skewed.

The income of a sample of Americans would probably be right-skewed, as a small percentage of people make very large salaries.

A distibution that is not symmetric, but has the property that the mean is less than the median and has a ‘tail’ of low probability to the left is said to be left-skewed, or negatively skewed.

The ages of patients suffering from Alzheimer’s disease would be left-skewed, as a small percentage of younger people have an early onset of the disorder.

## 1.22 Measures of Central Tendency

We often are interested in the central value, or ‘average’, of a distribution. Common statistics used to measure the center of a distribution include:

1. mean
2. median
3. mode

The mean (or arithmetic mean) of a dataset is just the average, computed as you would expect (i.e. add up the values and divide by sample size).

The mean of a sample is referred to as a statistic (a characteristic of a sample) and is computed as $\bar{x}=\frac{\sum x}{n}$

The mean of a population is referred to as a parameter (a characteristic of a population) and is computed as $\mu=\frac{\sum x}{n}$

Notice we are using Greek letters for population parameters. This is the usual convention (although exceptions do exist).

The median is the middle value in an ordered data set. The median is often used if outliers (i.e. unusually small or large values) exist in the dataset. Outliers will affect the mean more than the median. We say the median is more resistant, or robust, to the effect of outliers.

The median is computed by taking the $$\frac{n+1}{2}$$th ordered data value, averaging the two middle values if $$\frac{n+1}{2}$$ is not an integer.

The mode is the most frequent value in the data set. A data set may have mulitple modes, and the mode may not necessarily be found in the center of the distribution.

Here, instead of using data I collected from the class, the data below are hypothetical exam scores that a class with $$n=35$$ students might have earned.

## 1 | 2: represents 12
##  leaf unit: 1
##             n: 35
##    1     2 | 9
##          3 |
##    4     4 | 579
##    5     5 | 3
##   11     6 | 233578
##   (8)    7 | 11124578
##   16     8 | 011236677789
##    4     9 | 367
##    1    10 | 0

Mean: $$\bar{x}=\frac{2603}{35}=74.4$$

Median: $$M=77$$ (the $$\frac{35+1}{2}=18$$th ordered value)

Mode: 71 and 87 (both occur three times)

## 1.23 Measures of Location and Variability

In addition to central tendency, we are also interested in the amount of spread, or variability, in the data. Are the data clustered close to the mean or is there a wide range?

Statistics for measuring variability include:

1. Range
2. Five-Number Summary and IQR
3. Variance and Standard Deviation

The five-number summary is a set of five statistics that gives information about the center, spread, and shape of a distribution. It consists of the following values:

1. Minimum ($$Min$$)

2. First Quartile ($$Q_1$$)

3. Median ($$M$$)

4. Third Quartile ($$Q_3$$)

5. Maximum ($$Max$$)

We will use the simplest way to compute the quartiles by hand that corresponds to how a TI-83 or TI-84 calculator computes them. Some software will compute the quartiles with a more complex method and get slightly different answers.

We already know how to compute the median, and the minimum and maximum are just the smallest and largest values in the data set.

The first quartile, $$Q_1$$, is the 25th percentile of the data set. We will compute it by taking the median of the lower half of the data set. For example, the exam scores data set has $$n=35$$ data values. The median, $$M=77$$, was the 18th data value. So $$Q_1$$ will be the median of the 17 values below 77.

Similarly, the third quartile, $$Q_3$$, is the 75th percentile of the data set. It is the median of the data values greater than the median.

The five-number summary of the exam scores data set.

1.$$Min=29$$ (the 1st or minimum value)

1. $$Q_1=65$$ (the 9th value)

2. $$M=77$$ (the 18th or middle value)

3. $$Q_3=87$$ (the 27th value)

4. $$Max=100$$ (the 35th or largest value)

The range is a simple measure of variability that is $$Range=Max-Min$$. The interquartile range is $$IQR=Q_3-Q_1$$, which measures the variability of the middle 50% of the distribution. Here, the range=71 and $$IQR=22$$.

The famous statistician John Tukey has a simple rule for determining if a point in a data set is small/large enough to be an outlier.

First, compute the Step, where $$Step=1.5*IQR$$. In the exam scores example, Step=33.

Next, subtract the step from the first quartile. $$Q_1-Step=65-33=32$$. Any exam scores below 32 are outliers. In our problem, the score of 29 is an outlier on the low end.

Finally, add the step to the third quartile. $$Q_3+Step=87+33=120$$. Any exam scores above 120 are outliers. In our problem, no points qualify as outliers on the high end. These values 32 and 120 are sometimes called fences; outliers are ‘outside’ the fences.

The boxplot is a graph that displays information from the five-number summary, along with outliers. The vertical, or y-axis, has the range of data values. Horizontal lines are drawn at the first quartile, median, and third quartile, and are connected with vertical lines to form a ‘box’. Sometimes the boxplot is oriented such that the x-axis is used to display the range of values rather than the y-axis.

‘Whiskers’ are vertical lines that are drawn from the quartiles to the smallest/largest values that are NOT outliers.

Points that are outliers are displayed with a symbol such as an asterisk or circle to clearly identify their outlier status.

The student who got a very low score of 29, is indicated as an outlier.

A weakness of the range is that it only uses the two most extreme values in the data set. The IQR is better, but it would be preferable to have a statistic that uses all values in the data set in an effort to measure the average deviation’ or distance from the mean.

The deviation is defined as $$x-\bar{x}$$. For example, if Dr. X is 42 and the average college professor is 48, the deviation is 42-48=-6, or Dr. X is six years younger than average.

Unfortunately, the sum of deviations, $$\sum (x-\bar{x})$$ will equal zero for all data sets. Therefore, we cannot just compute the average deviation’.

Occasionally we take the absolute value of the deviations, but the standard method for computing variability is based on squared deviations.

If we have a sample of data (i.e. a portion of a larger population), the variance is computed as: $s^2=\frac{\sum (x-\bar{x})^2}{n-1}$

Less commonly, if our data represents an entire population, the variance is: $\sigma^2=\frac{\sum (x-\mu)^2}{n}$

The standard deviation (either $$s$$ or $$\sigma$$) is the square root of variance.

An example of computing the variance for a sample of $$n=5$$ ages, where $$\bar{x}=44.2$$ years.

$$x$$ $$x-\bar{x}$$ $$(x-\bar{x})^2$$
29 -15.2 231.04
35 -9.2 84.64
42 -2.2 4.84
50 5.8 33.64
65 20.8 432.64
$$\sum$$: 221 0 786.80

So the variance is $s^2=\frac{786.8}{5-1}=196.7$ where the unit is years squared.

It is usually more convenient to take the square root. The standard deviation is $$s=\sqrt{196.7}=14.02$$ years.

Obviously we want to use technology for large data sets.

## 1.24 Contingency Tables

A contingency table or cross-tabulation (shortened to cross-tab) is a frequency distribution table that displays information about two variables simultaneously. Usually these variables are categorical factors but can be numerical variables that have been grouped together. For example, we might have one variable represent the sex of a customer in the store and the second variable be age, where age groups such as 18-29, 30-44, 45-64, 65+ are used.

Our hypothetical example looks at the ice cream preferences of a sample of people from two different states. I constructed a stacked bar chart to try to show the differences in ice cream preference between the states. I think you can see that blue portion (for Dippin’ Dots) is bigger in Kentucky than Vermont, and the red portion (for Ben and Jerry’s) is larger for Vermont than Kentucky.

I also included pie charts for both states, although I think it is an inferior choice than the stacked bar charts.

##          Ben and Jerry Dairy Queen Dippin' Dots
## Kentucky            25          38           57
## Vermont             48          42           36
## # A tibble: 6 × 4
## # Groups:   State [2]
##   State    Store          Freq  Prop
##   <chr>    <chr>         <dbl> <dbl>
## 1 Kentucky Ben and Jerry    25 0.208
## 2 Vermont  Ben and Jerry    48 0.381
## 3 Kentucky Dairy Queen      38 0.317
## 4 Vermont  Dairy Queen      42 0.333
## 5 Kentucky Dippin' Dots     57 0.475
## 6 Vermont  Dippin' Dots     36 0.286

We can get marginal totals and percentages for both the rows (state) and columns (ice cream). We can also get conditional percentages for ice cream preference based on state; if these sets of percentages are the same for the two states, then we would say that they are indepedent (they aren’t here).

Here’s an example regarding the gender of STA 135 students from a previous semester and whether or not Los Portales was your favorite (number one choice) of the three restaurants or not, given as TRUE if it was your favorite and FALSE if not.

##
##          FALSE TRUE
##   Female    19   17
##   Male      13   10

This table is much closer to being statistically independent than the ice cream table.

## 1.25 Marginal Distributions

Marginal distributions is just another term for finding the distribution for a single variable at a time. Let’s take the contingency table for whether or not Los Portales was a student’s favorite restaurant by the student’s reported gender.

Let’s look at the row totals and the marginal distribution for gender (the row variable) first. There are $$19+17=36$$ females and $$13+10=23$$ males in the sample, with a total of $$n=59$$ students. The marginal distributions is:

$Female: \frac{36}{59}=0.610=61\%$

$Male: \frac{23}{59}=0.390=39\%$

Do the same with the columns to get the marginal distribution based on Los Portales being your favorite restaurant in town. $$19+32=32$$ students said it was not their favorite (FALSE), while $$17+10=27$$ said that it was their favorite (TRUE).

$\text{Not favorite (FALSE):} \frac{32}{59}=54.2\%$

$\text{Favorite (TRUE):} \frac{27}{59}=45.8\%$

## 1.26 Conditional Distributions

A conditional distribution looks at the percentages for a variable GIVEN that the other variable takes on a specific value. In political polling, it is very common to look at conditional distributions to see who is the favorite candidate based on a variable such as gender, race, region of country, etc. It is often of interest to know, for example, if one candidate is more or less popular when comparing men versus women, whites versus people of color, etc.

Here, we’ll see if there is a difference between female and male students in whether they ranked Los Portales as their favorite restaurant.

$Favorite,Female: \frac{17}{36}=47.2\%$

$Favorite,Male: \frac{10}{23}=43.5\%$

If the two variables are statisically independent, these two percentages would be exactly the same. They aren’t exactly equal, but are pretty close, so it doesn’t seem like there’s a meaningful differece between female and male students in terms of liking Los Portales.