2.3 General Math

2.3.1 Number Sets

Notation Denotes Examples
\(\emptyset\) Empty set No members
\(\mathbb{N}\) Natural numbers \(\{1, 2, ...\}\)
\(\mathbb{Z}\) Integers \(\{ ..., -1, 0, 1, ...\}\)
\(\mathbb{Q}\) Rational numbers including fractions
\(\mathbb{R}\) Real numbers
\(\mathbb{C}\) Complex numbers

2.3.2 Summation Notation and Series

Chebyshev’s Inequality Let \(X\) be a random variable with mean \(\mu\) and standard deviation \(\sigma\). Then for any positive number \(k\):

\[ P(|X-\mu| < k\sigma) \ge 1 - \frac{1}{k^2} \]

Chebyshev’s Inequality does not require that \(X\) be normally distributed

Geometric sum

\[ \sum_{k=0}^{n-1} ar^k = a\frac{1-r^n}{1-r} \]

where \(r \neq 1\)

Geometric series

\[ \sum_{k=0}^\infty ar^k = \frac{a}{1-r} \]

where \(|r| <1\)

Binomial theorem

\[ (x + y)^n = \sum_{k=0}^n \binom{n}{k} x^{n-k} y^k \]

where \(n \ge 0\)

Binomial series

\[ \sum_k \binom{\alpha}{k} x^k = (1 +x)^\alpha \]

\(|x| < 1\) if \(\alpha \neq n \ge 0\)

Telescoping sum

When terms of a sum cancel each other out, leaving one term (i.e., it collapses like a telescope), we call it a telescoping sum

\[ \sum_{a \le k < b} \Delta F(k) = F(b) - F(a) \]

where \(a \le b\) and \(a, b \in \mathbb{Z}\)

Vandermonde convolution

\[ \sum_k \binom{r}{k} \binom{s}{n-k} = \binom{r+s}{n} \]

\(n \in \mathbb{Z}\)

Exponential series

\[ \sum_{k=0}^\infty \frac{x^k}{k!} = e^x \]

where \(x \in \mathbb{C}\)

Taylor series

\[ \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k = f(x) \]

where \(|x-a| < R =\) radius of convergence

when \(a = 0\), we have

Maclaurin series expansion for

\[ e^z = 1 + z + \frac{z^2}{2!} + \frac{z^3}{3!} + ... \]

Euler’s summation formula

\[\sum_{a \le k < b} f(k) = \int_a^b f(x) dx + \sum_{k=1}^m\frac{B_k}{k!} f^{(k-1)}(x) |_a^b \\+ (-1)^{m+1} \int^b_a \frac{B_m (x-|x|)}{m!} f^{(m)}(x)dx\] where \(a,b, c \in \mathbb{Z}\) and \(a \le b, m \ge 1\)

when \(m = 1\), we have trapezoidal rule

\[ \sum_{a \le k < b} f(k) \approx \int_a^b f(x) dx - \frac{1}{2} (f(b) - f(a)) \]

2.3.3 Taylor Expansion

A differentiable function, \(G(x)\) can be written as an infinite sum of its derivatives.

More specifically, an infinitely differentiable \(G(x)\) evaluated at \(a\) is

\[ G(x) = G(a) + \frac{G'(a)}{1!} (x-a) + \frac{G''(a)}{2!}(x-a) + \frac{G'''(a)}{3!}(x-a)^3 + \dots \]

2.3.4 Law of large numbers

Let \(X_1,X_2,...\) be an infinite sequence of independent and identically distributed (i.i.d)

Then, the sample average is

\[ \bar{X}_n =\frac{1}{n} (X_1 + ... + X_n) \]

converges to the expected value (\(\bar{X}_n \rightarrow \mu\)) as \(n \rightarrow \infty\)

\[ Var(X_i) = Var(\frac{1}{n}(X_1 + ... + X_n)) = \frac{1}{n^2}Var(X_1 + ... + X_n)= \frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n} \]

Note: The connection between the LLN and the normal distribution lies in the Central Limit Theorem. The CLT states that, regardless of the original distribution of a dataset, the distribution of the sample means will tend to follow a normal distribution as the sample size becomes larger.

The difference between Weak Law and Strong Law regards the mode of convergence Weak Law

The sample average converges in probability towards the expected value

\[ \bar{X}_n \rightarrow^{p} \mu \]

when \(n \rightarrow \infty\)

\[ \lim_{n\to \infty}P(|\bar{X}_n - \mu| > \epsilon) = 0 \]

The sample mean of an iid random sample (\(\{ x_i \}_{i=1}^n\)) from any population with a finite mean and finite variance \(\sigma^2\) iis a consistent estimator of the population mean \(\mu\)

\[ plim(\bar{x})=plim(n^{-1}\sum_{i=1}^{n}x_i) =\mu \] Strong Law

The sample average converges almost surely to the expected value

\[ \bar{X}_n \rightarrow^{a.s} \mu \]

when \(n \rightarrow \infty\)


\[ P(\lim_{n\to \infty}\bar{X}_n =\mu) =1 \]

2.3.5 Law of Iterated Expectation

Let \(X, Y\) be random variables. Then,

\[ E(X) = E(E(X|Y)) \]

means that the expected value of X can be calculated from the probability distribution of \(X|Y\) and \(Y\)

2.3.6 Convergence Convergence in Probability

  • \(n \rightarrow \infty\), an estimator (random variable) that is close to the true value.
  • The random variable \(\theta_n\) converges in probability to a constant \(c\) if

\[ \lim_{n\to \infty}P(|\theta_n - c| \ge \epsilon) = 0 \]

for any positive \(\epsilon\)


\[ plim(\theta_n)=c \]


\[ \theta_n \rightarrow^p c \]

Properties of Convergence in Probability

  • Slutsky’s Theorem: for a continuous function g(.), if \(plim(\theta_n)= \theta\) then \(plim(g(\theta_n)) = g(\theta)\)

  • if \(\gamma_n \rightarrow^p \gamma\) then

    • \(plim(\theta_n + \gamma_n)=\theta + \gamma\) + \(plim(\theta_n \gamma_n) = \theta \gamma\) + \(plim(\theta_n/\gamma_n) = \theta/\gamma\) if \(\gamma \neq 0\)
  • Also hold for random vectors/ matrices Convergence in Distribution

  • As \(n \rightarrow \infty\), the distribution of a random variable may converge towards another (“fixed”) distribution.
  • The random variable \(X_n\) with CDF \(F_n(x)\) converges in distribution to a random variable \(X\) with CDF \(F(X)\) if

\[ \lim_{n\to \infty}|F_n(x) - F(x)| = 0 \]

at all points of continuity of \(F(X)\)

Notation \(F(x)\) is the limiting distribution of \(X_n\) or \(X_n \rightarrow^d X\)

  • \(E(X)\) is the limiting mean (asymptotic mean)
  • \(Var(X)\) is the limiting variance (asymptotic variance)


\[ \begin{aligned} E(X) &\neq \lim_{n\to \infty}E(X_n) \\ Avar(X_n) &\neq \lim_{n\to \infty}Var(X_n) \end{aligned} \]

Properties of Convergence in Distribution

  • Continuous Mapping Theorem: for a continuous function g(.), if \(X_n \to^{d} g(X)\) then \(g(X_n) \to^{d} g(X)\)

  • If \(Y_n\to^{d} c\), then

    • \(X_n + Y_n \to^{d} X + c\)

    • \(Y_nX_n \to^{d} cX\)

    • \(X_nY_n \to^{d} X/c\) if \(c \neq 0\)

  • also hold for random vectors/matrices Summary

Properties of Convergence

Probability Distribution
Slutsky’s Theorem: for a continuous function g(.), if \(plim(\theta_n)= \theta\) then \(plim(g(\theta_n)) = g(\theta)\) Continuous Mapping Theorem: for a continuous function g(.), if \(X_n \to^{d} g(X)\) then \(g(X_n) \to^{d} g(X)\)
if \(\gamma_n \rightarrow^p \gamma\) then if \(Y_n\to^{d} c\), then
\(plim(\theta_n + \gamma_n)=\theta + \gamma\) \(X_n + Y_n \to^{d} X + c\)
\(plim(\theta_n \gamma_n) = \theta \gamma\) \(Y_nX_n \to^{d} cX\)
\(plim(\theta_n/\gamma_n) = \theta/\gamma\) if \(\gamma \neq 0\) \(X_nY_n \to^{d} X/c\) if \(c \neq 0\)

Convergence in Probability is stronger than Convergence in Distribution.

Hence, Convergence in Distribution does not guarantee Convergence in Probability

2.3.7 Sufficient Statistics


  • describes the extent to which the sample provides support for any particular parameter value.
  • Higher support corresponds to a higher value for the likelihood
  • The exact value of any likelihood is meaningless,
  • The relative value, (i.e., comparing two values of \(\theta\)), is informative.

\[ L(\theta_0; y) = P(Y = y | \theta = \theta_0) = f_Y(y;\theta_0) \]

Likelihood Ratio

\[ \frac{L(\theta_0;y)}{L(\theta_1;y)} \]

Likelihood Function

For a given sample, you can create likelihoods for all possible values of \(\theta\), which is called likelihood function

\[ L(\theta) = L(\theta; y) = f_Y(y;\theta) \]

In a sample of size n, the likelihood function takes the form of a product

\[ L(\theta) = \prod_{i=1}^{n}f_i (y_i;\theta) \]

Equivalently, the log likelihood function

\[ l(\theta) = \sum_{i=1}^{n} logf_i(y_i;\theta) \]

Sufficient statistics

  • A statistic, \(T(y)\), is any quantity that can be calculated purely from a sample (independent of \(\theta\))
  • A statistic is sufficient if it conveys all the available information about the parameter.

\[ L(\theta; y) = c(y)L^*(\theta;T(y)) \]

Nuisance parameters If we are interested in a parameter (e.g., mean). Other parameters requiring estimation (e.g., standard deviation) are nuisance parameters. We can replace nuisance parameters in likelihood function with their estimates to create a profile likelihood.

2.3.8 Parameter transformations

log-odds transformation

\[ Log odds = g(\theta)= ln[\frac{\theta}{1-\theta}] \]