7 Midterm Review

So far in this class we have covered the basics of probability theory, random variables, and principles of sampling.

What was the point of all this? What’s the big picture?

We started the class talking though the relationship between samples and the population. The population is an unknowable thing, and we want to use a singular sample of data to make inferences about that population. On first blush that seems impossible. I get a survey of 5000 people where 43% support the President…. what can I possibly say about every American?

Our journey through probability helped us to re-frame that question.

Critically, we learned how we can use the rules and axioms of probability to model variation due to random chance. Specifically, we learned the rules of probability to help us to generate random variables, which are descriptions of various events that can happen with attached probabilities. Once we understand the potential variation due to random chance we can assess the likelihood of observing our data given an assumption about the population.

For example, we looked at what the expected number of left handed people would be for 14 people:

plot(0:14,dbinom(0:14, 14, prob=.1), pch=16, xlab="Number of Left Handed People", ylab="Probability")

We compared this probability to our observation that 6/14 people were left handed. In this case the probability of that occurring due to random chance was very low. We may decide to reject that this distribution is what is generating Presidents.

To do a more relevant example: The Atlanta Braves, who won 64.2% of their games in the regular season just lost 3/4 games to the Philadelphia Phillies in the National Leagues Division Series. Certainly this will lead to all sorts of re-working and re-toolings of their franchise. But what is likely to occur by random chance to a team that wins 64.2% of their games over the course of 4 games:

dbinom(0:4, 4, .642)
#> [1] 0.01642601 0.11782680 0.31694752 0.37892050 0.16987916

1.6% of the time they win no games, and 11.7% they win only 1 game (what happened). This is, by no means, a rare outcome! Things that happen 10% of the time happen all the time. The Braves were just unlucky (but are also trash and deserved to lose to our Fightin’ Phils’).

All of this requires us to know the shape of the random variable in the population. That’s easy for things like the binomial distribution. It was more difficult when we started to think about having a singular mean and not knowing the random variable which produced it.

Through the LLN and CLT we discovered that the sampling distrbution of the sample mean (the random variable which produces means) is a very predictable thing. Because of this, we can go through a similar process of observing a sample, postulating something about the population, and then determining the likelihood of our sample given that postulate.

If I was to sum up what we’ve learned so far in a few sentences I would say:

Populations/Data Generating Processes are unknowable. The best we can do is to make an assumption about the population, use our knowledge of random variables to model the uncertainty and range of possibilities that assumption would generate, and then estimate the likelihood that assumption is true given our sample of data.

The following goes section by section, highlighting some of the key lessons from each.

7.1 Probability

7.1.1 Basics of probability:

Probability denotes the expected long-term frequency of events. A definition that I find helpful:

The chance of something gives the percentage of time it is expected to happen, when the basic process is done over and over again, independently and under the same conditions.

To formalize our look into probability, two key terms are events and a sample space. Events are simply outcomes of a particular thing: the outcome of the roll of a dice or a flip of a coin; or the outcome of a presidential election; or whether a randomly sampled individual is favorable towards the President or not. A sample space is set of all possible outcomes of the thing we are considering.

When probability of all events is equal we can then define the probability of an event occuring by dividing the number of elements in the event by the total number of events in the sample space.

There are a couple of “rules” (axioms) about probability that are helpful to keep in mind.

  1. The probability of any event A is non-negative: \(P(A)\geq 0\).

  2. The probability that one of the outcomes in the sample space occurs is 1. \(P(\Omega)==1\)

  3. \(P(AorB) = P(A) + P(B) - P(A\&B)\) (What does this reduce to if A and B are mutually exclusive?)

7.1.2 Simulating probability:

Basic simulation of a roll of the dice:

set.seed(19104)
dice <- c(1,2,3,4,5,6)
result <- rep(NA,10000)
  for(i in 1:10000){
    result[i] <- sample(dice,1,replace=T)
  }
table(result)
#> result
#>    1    2    3    4    5    6 
#> 1684 1653 1710 1682 1676 1595

We know that each number of a fair dice will come up 1/6 of the time, so our expectation would be that each of these numbers would be drawn \(10000/6=1666\) times. Obviously with random chance we don’t get that number exactly, but each is quite close.

7.1.3 Permutations and Combinations

Permutations are sequences where the order matters. The formula is \(_nP_k = \frac{n!}{(n-k)!}\), which gives the number of permutations where \(n\) is the number of options and \(k\) is the number of selections.

A combination differs from a permutation because with a combination the order does not matter. A lottery is a great example of this. You win the lottery if the numbers you selected come out of the bin, regardless of the order of the balls. The formula is \(_nC_k = \frac{n!}{k!(n-k)!}\).

What is the difference between these two (mathematically, and what does it do?)

Calculating combinations in R:

choose(70,5)
#> [1] 12103014

Calculating permutations in R (why does this work?:

(choose(365,20)*factorial(20))
#> [1] 1.036691e+51

7.2 Probability 2

Conditional probability is about sub-setting. If we are theoretically interested in P(A|B=b), we want to think about sub-setting to the cases for B where that’s true, and recalculating the probability of A with that new denominator.

The formula for a conditional probability is \(P(A|B) = \frac{P(A\&B)}{P(B)}\).

Two events are independent if:

\[ P(A|B)=P(A)\\ \& \\ P(B|A)=P(B) \]

7.3 Discrete RVs

All of the probability up to this point has assumed equal probability between events. We have also mostly dealt with things that were fairly countable.

We used the example of determining if Mahome’s performance in the Superbowl would be particular or not, with the background that he had a certain known success race (say 63%) and could throw any number of passes (20,21,22,23….). We want to be able to determine for any combination of success rates and attempts what the probaiblity was of all number of successes.

To determine this we derived the binomial formula:

\[ P(K=k) = _nC_k * P(k)^k *P(1-k)^{n-k} \]

The first part tells us the number of ways to have k successes in n trials. The second part tells us for each those ways to have k successes what the probability is.

To get the second part we also derived that when two events are independent:

\[ P(A|B) = \frac{P(A\&B)}{P(B)}\\ P(A|B) = P(A) \; iff \perp\\ \therefore P(A) = \frac{P(A\&B)}{P(B)}\\ P(A\&B) = P(A) * P(B) \; iff \perp \]

7.4 Discrets RVs 2

We moved on from there to formally define a RV: When we talk about “random variables” this is a different thing than the variables in a dataset (which are samples of data). When we talk about random variables we are talking about population level distributions that produce sample of data, with each of those samples being a little bit different, yet following the same underlying truth.

In short, Random Variables are helpful because they allow us to compare a sample of data to the theoretical truth of what should happen. In other words, we can compare a sample of data to the population.

Formally, a random variable assigns a number and a probability to each event. Because we are defining events with a number and a probability, this allows us to do math! Which is good!

7.4.1 PMF and CDF

The probability mass function and cumulative distribution function are completments of one another. With one you can derive the other.

So for this PMF

\[ f(x) = P(C=c) = \begin{cases} .1 \text{ if } 0 \\ .25 \text{ if } 1 \\ .5 \text{ if } 2 \\ .15 \text { if } 3 \\ \end{cases} \]

We can derive this CDF:

\[ F(x) = P(C \leq c) = \begin{cases} 0 \text{ if } < 0\\ .1 \text{ if } \geq 0 \& < 1 \\ .35 \text{ if } \geq 1 \& <2 \\ .85 \text{ if } \geq 2 \& < 3 \\ 1 \text { if } \geq 3 \\ \end{cases} \]

7.4.2 Expectation and Variance

For any random variable we can describe the expectation and variance. These are population features of random variables and do not apply to samples of data.

The expected value of a random variable is: \(E[X] = \sum_x xp(x)\)

The variance of a random variable is: \(V[X] = E[(X-E(X))^2]\) or \(V[X] = E(X^2) - E(X)^2\)

7.4.3 Key Discrete RVs

A Bernoulli RV describes a single success or failure. It has one parameter, \(\pi\) the probability of success. The expected value is \(\pi\) and the variance is \(\pi*(1-\pi)\). What is an example of a Bernoulli RV?

A binomial RV describes a sequence of Bernoulli RVs. It has two parameters. \(\pi\) the probability of success of each Bernoulli, and \(n\), the number of Bernoulli draws. The expected value is \(n*\pi\) and the variance is \(n*\pi*(1-\pi)\). What is an example of a Binomial RV?

7.4.4 Continuous RVs

A continuous RV can take on an infinite number of values within a given interval. As such, the probability of a continuous CV taking on any particular value is 0, as if it was anything greater than 0 than the CDF would not approach 1, but infinity.

Instead of Probability Mass Functions, continuous variables have Probability Density Functions, which give the probability for a range of values. To get a CDF from a PDF we integrate from negative infinity to any given value. To get a PDF from a CDF we take the derivative at any given value.

The first important continuous distribution we considered was the uniform distribution, which takes on an equal probability within a given interval defined by end-points \(a\) and \(b\). The PDF of a uniform RV is given by:

\[ f(X) = \begin{cases} \frac{1}{b-a} \text{ if } x\geq a x\leq b\\ 0 \text{ otherwise} \end{cases} \]

The expected value of a uniform RV is given by: \(E[X] = \int_a^b x*f(x) \partial x = \text{Math!} = \frac{a+b}{2}\).

The variance of a uniform RV is given by: \(V[X] = E[X^2] - E[X]^2 = \text{Math!} = \frac{1}{12}(b-a)^2\).

The normal random variable is the most important random variable. It has two parmaters: \(\mu\) the central tendency of the distribution, and \(\sigma\) the standard deviation of the distribution. The normal distribution is always symmetrical. The PDF of the normal distribution is defined by:

\[ f(x) = \frac{1}{\sqrt{(2\pi \sigma)}} e^{- \frac{1}{2\sigma^2}(x-\mu)} \]

But given this if we have a \(\mu\) and a \(\sigma\) we can conclude anything we want about the normal distribution.

7.4.5 The standard normal.

We discussed how the properties of the normal distribution meant that if we make the following translation to any normal distribution:

\[ Z = \frac{X-\mu}{\sigma} \]

We are left with the standard normal distribution: \(N \sim (0,1\).

The standard normal distribution is helpful because it allows us to talk about normal distribution in “standard deviation units”. If you tell me that 2.57 units in the standard normal contains 99% of all probability mass, that means that 2.57 standard deviations from the mean contains 99% of the probability mass in all normal distributions.

7.4.6 Using R to calculate probability using random variables

We learned that we can use R to calculate various probabilities using built in functions. The key thing here is that we have already worked out the shape of these probability distributions, and if you know the parmaters to define them, R can give you a large number of answers. See notes!

7.5 Sampling

The probability distribution random variables discussed thus far were populations that we could sample from. We did not, at this point have tools to determine how good of a sample we will have when we increase the sample size from 1 to 2 to 3…. to infinity.

We learned three key principles of sampling:

  1. The larger a sample the more likely our sample mean equals the expected value of the random variable we are sampling from (LLN).
  2. For any sample size, the expected value of the sample mean is the expected value of the random variable we are sampling from.
  3. The distribution of all possible estimates – the sampling distribution, – is normally distributed with it’s mean centered on the RV’s expected value, and variance determined by the variance of the RV and the sample size. \(\bar{X_n} \leadsto N(\mu=E[X],\sigma^2 = V[X]/n)\)

To be clear: these are principles which are about the sample mean specifically, but will eventually apply to all estimates that we form from a sample.

Critically: the sample mean itself is drawn from a random variable with a known shape (above). That distribution is the sampling distribution, and the standard deviation of that sampling distribution is the standard error. Becuase that shape is known we can project with a great deal of confidence what a large number of sample means will look like if we understand the distribution from which they are being pulled.

To be clear about the above: X is any distribution. It could be binomial, Bernoulli, uniform, normal, totally crazy. It doesn’t matter. But if we know the expected value and variance of those distribution, then we can project with confidence what estimated sample means from those distributions will look like.

7.5.1 What makes a good estimator?

Population parmaters are true, fixed, values. We use an estimator (a method) to produce an estimate (a single number in our sample) which is one of an infinite number of estimates that can be drawn from an infinite number of samples.

An estimator is unbiased if:

\[ E[estimate-truth]=0 \]

We can determine unbiasedness through sampling, but also through mathematical proof.

For the latter, we wish to know whether:

\[ E[\phi] -E[X]=0 \] Where \(\phi\) is any estimator.

The key features of the expectation operator were that:

\[ E[cX] = c*E[X] \]

Where \(c\) is a constant.

\[ E[X_1] = E[X] \]

The expected value of any observation is equal to the overall expected value of the random variable.

More than ubiasedness, we want our estimators to be consistent, which is when they are unbiased and the variance of the estimator gets smaller as \(n\) increases.

We can take the variance of our estimator to determine this through mathematical proof.

For the variance estimator:

\[ V[cX] = c^2*V[X] \]

If we take a constant out of the variance operator it gets squared.

\[ V[X_1]=V[X] \] The variance of any given observation is also equal to the overall variance of the random variable.

We can measure if the estimator is consistent by whether the MSE/RMSE goes to 0 when sample size goes to infinity, where MSE:

\[ MSE = Variance +Bias^2 = V[\hat{\theta}] + (E[\hat{\theta}- \theta])^2 \]