Module 1 Introduction to Data

1.1 Module Overview

What is statistics? There are two ways to think about this:

  1. Facts and data, organized or summarized in such a way that they provide useful information about something.
  2. The science of analyzing, organizing, and summarizing data.

As a field, Statistics provides tools for scientists, practitioners, and laypeople to better understand data. You may find yourself using knowledge from this course in a research lab, while reading a research report, or even while watching the news!

Module Learning Objectives/Outcomes

After completing Module 1, you will:

  1. Understand basic statistical terminology.
  2. Produce data using sampling and experimental design techniques.
  3. Organize and visualize data using techniques for exploratory data analysis.
  4. Identify the shape of a data set.
  5. Understand and interpret graphical displays.

This module’s outcomes correspond to course outcomes (1) organize, summarize, and interpret data in tabular, graphical, and pictorial formats and (2) organize and interpret bivariate data and learn simple linear regression and correlation.

1.2 Statistics Terminology

There are two ways to think about statistics:

  1. Descriptive statistics are methods for describing information.

For example, 66% of eligible voters voted in the 2020 presidential election (the highest turnout since 1900!).

  1. Inferential statistics are methods for drawing inference (making decisions about something we are uncertain about).

For example, a poll suggests that 75% of voters will select a Candidate A. People haven’t voted yet, so we don’t know what will happen, but we could reasonably conclude that Candidate A will win the election.

Data is factual information. We collect data from a population, the collection of all individuals or items a researcher is interested in.

  • Collecting data from an entire population is called a census.
    • This is complicated and expensive! There’s a reason the United States only does a census every 10 years.
  • We can also take a sample, a subset of the population we get data from.
    • If you think of the population as a pie, the sample is a small slice. If it’s a good pie, the small slice will tell you that.

1.3 Data Basics

Data are often organized in what we call a data matrix. If you’ve ever seen data in a spreadsheet, that’s a data matrix!
Age Gender Smoker Marital Status
Person 1 45 Male yes married
Person 2 23 Female no single
Person 3 36 Other no married
Person 4 29 Female no single

Each row represents one observation (also called observational units, cases, or subjects). These are the individuals or items in the sample.

Each column represents a variable, the characteristic or thing being measured. Think of variables as measurements that can vary from one observation to the next.

There are two types of variable:

  1. Numeric or quantitative variables take numeric values AND it is sensible to do math with those values.
    1. Discrete numeric variables take numeric values with jumps. Typically, this means they can only take whole number values. A count of something is often discrete - counting the number of pets you have, for example.
    2. Continuous numeric variables take values “between the jumps.” Typically, this means they can take decimal values.
  2. Categorical or qualitative variables take values that are categories.

The “Does it make sense?” test:

  • Sometimes, categories can be represented by numbers. Ask yourself if it makes sense to do math with those numbers. If it doesn’t make sense, it’s probably a categorical variable. (Ex: zip codes)
  • If you’re unsure whether a variable is discrete or continuous, pick a number with some decimal places - like 1.83 - and ask yourself if that value makes sense. If it doesn’t, it’s probably discrete. (Ex: number of siblings)

1.4 Sampling

How do we get samples? We want a sample that represents our population. Representative samples reflect the relevant characteristics of our population.

In general, we get representative samples by selecting our samples at random and with an adequate sample size.

A non-representative sample is said to be biased. For example, if we used a sample of chihuahuas to represent all dogs, the sample would be biased.

These can be a result of convenience sampling, choosing a sample based on ease.

In our daily lives, common sources of biases are anecdotal evidence and availability bias. Anecdotal evidence is data based on personal experience or observation. Typically this consists of only one or two observations and is NOT representative of the population.

Example: anecdotal evidence. A friend tells you their grandpa smoked a pack of cigarettes a day and lived to be 100. Does this mean that cigarettes will help you live to 100? no!

Availability bias is your brain’s tendency to think that examples of things that come readily to mind are more representative than is actually the case.

Example: availability bias. Shark attacks. Shark attacks are actually extremely uncommon, but the media tends to report on extreme anecdotes, making us more prone to this kind of bias!

We avoid bias by taking random samples. One type of random sample is a simple random sample. We can think of this as “raffle sampling,” like drawing names out of a hat. Each case (or each possible sample) has an equal chance of being selected. Knowing that A is selected doesn’t tell us anything about whether B is selected. Instead of drawing from a hat, we usually use a random number generator using a computer.

1.5 Experimental Design

When we do research, we have two options:

  1. Conduct an experiment, where researchers assign treatments to cases.
    1. Treatments are experimental conditions.
    2. In an experiment, cases may also be called experimental units (items or individuals on which the experiment is performed).
  2. Conduct an observational study, where no conditions are assigned. These are often done for ethical reasons, like examining the impacts of smoking cigarettes. Can you think of another example?

Experiments allow us to infer causation. Observational studies do not.

Experimental design principles:

  • Control: two or more treatments are compared.
  • Randomization: experimental units are assigned to treatment groups (usually and preferably at random).
  • Replication: a large enough sample size is used to test each treatment many times (on many different experimental units).
  • Blocking: if variables other than treatment are likely to have an impact on study outcome, we use blocks.
    • For example, I might separate patients in a medical study into “high risk” and “low risk” blocks. I would randomly assign all of the high risk patients to a treatment and then randomly assign all of the low risk patients to a treatment.
    • This helps ensure an even distribution of high/low risk patients in each treatment group.

An experiment without blocking has a completely randomized design; an experiment with blocking has a randomized block design.

In an experimental setting, we talk about

  • Response variable: the characteristic of the experimental outcome being measured or observed.
  • Factor: a variable whose impact on the response variable is of interest in the experiment.
  • Levels: the possible values of a factor.
  • Treatment: each experimental condition (based on combinations of factor levels).

In human subjects research, we do a little extra work:

  • If subjects do not know what treatment group they are in, the study is called blind.
    • We use a placebo (fake treatment) to achieve this.
  • If neither subject nor the researchers who interact with them know the treatment group, it is double blind.

This helps avoid bias caused by placebo effect, doctor’s expectations for outcome, etc.!

1.6 Frequency Distributions

1.6.1 Qualitative Variables

Frequency (count): the number of times a particular value occurs.

A frequency distribution lists each distinct value with its frequency.

Class Frequency
freshman 12
sophomore 10
junior 3
senior 5

A bar plot is a graphical representation of a frequency distribution. Each bar’s height is based on the frequency of the corresponding category.

The bar plot above shows the class level breakdown for students in an Introductory Statistics course. Take a moment to notice how the bars match up with the frequency distribution above.

Relative frequency is the ratio of the frequency to the total number of observations.

\[ \text{relative frequency} = \frac{\text{frequency}}{\text{number of observations}} \]

This is also called the proportion. The percentage can be obtained by multiplying the proportion by 100.

A relative frequency distribution lists each distinct value with its relative frequency.

Class Frequency Relative Frequecy
freshman 12 \(12/30 = 0.4\)
sophomore 10 \(10/30 \approx 0.3333\)
junior 3 \(3/30 = 0.1\)
senior 5 \(5/30 \approx 0.1667\)

1.6.2 Quantitative Variables

We can also apply this concept to numeric data. A dot plot is one graphical representation of this. A dot plot shows a number line with dots drawn above the line. Each dot represents a single point.

For example, the dot plot above shows a sample where the number 1 appears 3 times, the number 5 appears 6 times, etc.

We would also like to be able to visualize larger, more complex data sets. We can do this using bins, which group numeric data into equal-width consecutive intervals.

Example: A random sample of weights from 10 men, ages 18-24:

\[\quad 218.1 \quad 151.3 \quad 178.7 \quad 187.0 \quad 165.8 \quad 188.7 \quad 175.4 \quad 182.5 \quad 187.5 \quad 165.0\]

The minimum (smallest value) is 151.3 and the maximum (largest value) is 218.1. There are lots of ways to break these into “bins,” but what about…

  • 150 - 170
  • 170 - 190
  • 190 - 210
  • 210 - 230

Each bin has an equal width of 20, but if someone a weight of 190, would I use the second or third bin?? We need there to be no overlap. Instead, we can use:

Weight Count
150 - <170 3
170 - <190 6
190 - <210 0
210 - <230 1

We will visualize this using a histogram, which is a lot like a bar plot but for numeric data:

This is what we call a frequency histogram because each bar height reflects the frequency of that bin. We can also create a relative frequency histogram which displays the relative frequency instead of the frequency:

Notice that these last two histograms look the same except for the numbers on the vertical axis! This gives us insight into the shape of the data distribution, literally how the values are distributed across the bins. The part of the distribution that “trails off” to one or both sides is called a tail of the distribution.

When a histogram trails off to one side, we say it is skewed (right-skewed if it trails off to the right, left-skewed if it trails off to the left). Data sets with roughly equal tails are symmetric.

We can also use a histogram to identify modes. For numeric data, especially continuous variables, we think of modes as prominent peaks.

  • Unimodal: one prominent peak.
  • Bimodal: two prominent peaks.
  • Multimodal: three or more prominent peaks.

Finally, we can also “smooth out” these histograms and use a smooth curve to examine the shape of the distribution. Below are the smooth curve versions of the distributions shown in the four histograms used to demonstrate skew and symmetry.