2.3 General Math

Chebyshev’s Inequality Let X be a random variable with mean \(\mu\) and standard deviation \(\sigma\). Then for any positive number k:

\[ P(|X-\mu| < k\sigma) \ge 1 - \frac{1}{k^2} \]

Chebyshev’s Inequality does not require that X be normally distributed

Maclaurin series expansion for

\[ e^z = 1 + z + \frac{z^2}{2!} + \frac{z^3}{3!} + ... \]

Geometric series:

\[ s_n=\sum_{k=1}^{n}ar^{n-1}=\frac{a(1-r^n)}{1-r} \]

if |r| < 1

\[ s=\sum_{k=1}^{\infty}ar^{n-1}=\frac{a}{1-r} \]

2.3.1 Law of large numbers

Let \(X_1,X_2,...\) be an infinite sequence of independent and identically distributed (i.i.d) Then, the sample average is

\[ \bar{X}_n =\frac{1}{n} (X_1 + ... + X_n) \]

converges to the expected value (\(\bar{X}_n \rightarrow \mu\)) as \(n \rightarrow \infty\)

\[ Var(X_i) = Var(\frac{1}{n}(X_1 + ... + X_n)) = \frac{1}{n^2}Var(X_1 + ... + X_n)= \frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n} \]

The difference between Weak Law and Strong Law regards the mode of convergence Weak Law

The sample average converges in probability towards the expected value

\[ \bar{X}_n \rightarrow^{p} \mu \]

when \(n \rightarrow \infty\)

\[ \lim_{n\to \infty}P(|\bar{X}_n - \mu| > \epsilon) = 0 \]

The sample mean from a iid random sample (\(\{ x_i \}_{i=1}^n\)) from any population with a finite mean and finite variance \(\sigma^2\) is ca consistent estimation for the population mean \(\mu\)

\[ plim(\bar{x})=plim(n^{-1}\sum_{i=1}^{n}x_i) =\mu \] Strong Law

The sample average converges almost surely to the expected value

\[ \bar{X}_n \rightarrow^{a.s} \mu \]

when \(n \rightarrow \infty\)


\[ P(\lim_{n\to \infty}\bar{X}_n =\mu) =1 \]

2.3.2 Law of Iterated Expectation

Let X, Y be random variables. Then,

\[ E(X) = E(E(X|Y)) \]

means that the expected value of X can be calculated from the probability distribution of X|Y and Y

2.3.3 Convergence Convergence in Probability

  • \(n \rightarrow \infty\), an estimator (random variable) that is close to the true value.
  • The random variable \(\theta_n\) converges in probability to a constant c if

\[ \lim_{n\to \infty}P(|\theta_n - c| \ge \epsilon) = 0 \]

for any positive \(\epsilon\)


\[ plim(\theta_n)=c \]


\[ \theta_n \rightarrow^p c \]

Properties of Convergence in Probability

  • Slutsky’s Theorem: for a continuous function g(.), if \(plim(\theta_n)= \theta\) then \(plim(g(\theta_n)) = g(\theta)\)

  • if \(\gamma_n \rightarrow^p \gamma\) then

    • \(plim(\theta_n + \gamma_n)=\theta + \gamma\) + \(plim(\theta_n \gamma_n) = \theta \gamma\) + \(plim(\theta_n/\gamma_n) = \theta/\gamma\) if \(\gamma \neq 0\)
  • Also hold for random vectors/ matrices Convergence in Distribution

  • As \(n \rightarrow \infty\), the distribution of a random variable may converge towards another (“fixed”) distribution.
  • The random variable \(X_n\) with CDF \(F_n(x)\) converges in distribution to a random variable X with CDF \(F(X)\) if

\[ \lim_{n\to \infty}|F_n(x) - F(x)| = 0 \]

at all points of continuity of \(F(X)\)

Notation F(x) is the limiting distribution of \(X_n\) or \(X_n \rightarrow^d X\)

  • E(X) is the limiting mean (asymptotic mean)
  • Var(X) is the limiting variance (asymptotic variance)


\[ E(X) \neq \lim_{n\to \infty}E(X_n) \\ Avar(X_n) \neq \lim_{n\to \infty}Var(X_n) \]

Properties of Convergence in Distribution

  • Continuous Mapping Theorem: for a continuous function g(.), if \(X_n \to^{d} g(X)\) then \(g(X_n) \to^{d} g(X)\)

  • if \(Y_n\to^{d} c\), then

    • \(X_n + Y_n \to^{d} X + c\)

    • \(Y_nX_n \to^{d} cX\)

    • \(X_nY_n \to^{d} X/c\) if \(c \neq 0\)

  • also hold for random vectors/matrices Summary

Properties of Convergence

Probability Distribution
Slutsky’s Theorem: for a continuous function g(.), if \(plim(\theta_n)= \theta\) then \(plim(g(\theta_n)) = g(\theta)\) Continuous Mapping Theorem: for a continuous function g(.), if \(X_n \to^{d} g(X)\) then \(g(X_n) \to^{d} g(X)\)
if \(\gamma_n \rightarrow^p \gamma\) then if \(Y_n\to^{d} c\), then
\(plim(\theta_n + \gamma_n)=\theta + \gamma\) \(X_n + Y_n \to^{d} X + c\)
\(plim(\theta_n \gamma_n) = \theta \gamma\) \(Y_nX_n \to^{d} cX\)
\(plim(\theta_n/\gamma_n) = \theta/\gamma\) if \(\gamma \neq 0\) \(X_nY_n \to^{d} X/c\) if \(c \neq 0\)

Convergence in Probability is stronger than Convergence in Distribution. However, Convergence in Distribution does not guarantee Convergence in Probability

2.3.4 Sufficient Statistics


  • describes the extent to which the sample provides support for any particular parameter value.
  • Higher support corresponds to a higher value for the likelihood
  • The exact value of any likelihood is meaningless,
  • The relative value, (i.e., comparing two values of \(\theta\)), is informative.

\[ L(\theta_0; y) = P(Y = y | \theta = \theta_0) = f_Y(y;\theta_0) \]

Likelihood Ratio

\[ \frac{L(\theta_0;y)}{L(\theta_1;y)} \]

Likelihood Function

For a given sample, you can create likelihoods for all possible values of \(\theta\), which is called likelihood function

\[ L(\theta) = L(\theta; y) = f_Y(y;\theta) \]

In a sample of size n, the likelihood function takes the form of a product

\[ L(\theta) = \prod_{i=1}^{n}f_i (y_i;\theta) \]

Equivalently, the log likelihood function

\[ l(\theta) = \sum_{i=1}^{n} logf_i(y_i;\theta) \]

Sufficient statistics

  • A statistic, T(y), is any quantity that can be calculated purely from a sample (independent of \(\theta\))
  • A statistic is sufficient if it conveys all the available information about the parameter.

\[ L(\theta; y) = c(y)L^*(\theta;T(y)) \]

Nuisance parameters If we are interested in a parameter (e.g., mean). Other parameters requiring estimation (e.g., standard deviation) are nuisance parameters. We can replace nuisance parameters in likelihood function with their estimates to create a **profile likelihood*.

2.3.5 Parameter transformations

log-odds transformation

\[ Log odds = g(\theta)= ln[\frac{\theta}{1-\theta}] \]

log transformation