2.1 Matrix Theory

Matrix A represents the original matrix. It’s a 2x2 matrix with elements aij, where i represents the row and j represents the column.

A=[a11a12a21a22] A is the transpose of A. The transpose of a matrix flips its rows and columns.

A=[a11a21a12a22]

Fundamental properties and rules of matrices, essential for understanding operations in linear algebra:

(ABC)=CBA(Transpose reverses order in a product)A(B+C)=AB+AC(Distributive property)ABBA(Multiplication is not commutative)(A)=A(Double transpose is the original matrix)(A+B)=A+B(Transpose of a sum is the sum of transposes)(AB)=BA(Transpose reverses order in a product)(AB)1=B1A1(Inverse reverses order in a product)A+B=B+A(Addition is commutative)AA1=I(Matrix times its inverse is identity) These properties are critical in solving systems of equations, optimizing models, and performing data transformations.

If a matrix A has an inverse, it is called invertible. If A does not have an inverse, it is referred to as singular.

The product of two matrices A and B is computed as:

A=[a11a12a13a21a22a23][b11b12b13b21b22b23b31b32b33]=[a11b11+a12b21+a13b313i=1a1ibi23i=1a1ibi33i=1a2ibi13i=1a2ibi23i=1a2ibi3]

Quadratic Form

Let a be a 3×1 vector. The quadratic form involving a matrix B is given by:

aBa=3i=13j=1aibijaj

Length of a Vector

The length (or 2-norm) of a vector a, denoted as ||a||, is defined as the square root of the inner product of the vector with itself:

||a||=aa

2.1.1 Rank of a Matrix

The rank of a matrix refers to:

  • The dimension of the space spanned by its columns (or rows).
  • The number of linearly independent columns or rows.

For an n×k matrix A and a k×k matrix B, the following properties hold:

  • rank(A)min
  • \text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A'}) = \text{rank}(\mathbf{A'A}) = \text{rank}(\mathbf{AA'})
  • \text{rank}(\mathbf{AB}) = \min(\text{rank}(\mathbf{A}), \text{rank}(\mathbf{B}))
  • \mathbf{B} is invertible (non-singular) if and only if \text{rank}(\mathbf{B}) = k.

2.1.2 Inverse of a Matrix

In scalar algebra, if a = 0, then 1/a does not exist.

In matrix algebra, a matrix is invertible if it is non-singular, meaning it has a non-zero determinant and its inverse exists. A square matrix \mathbf{A} is invertible if there exists another square matrix \mathbf{B} such that:

\mathbf{AB} = \mathbf{I} \quad \text{(Identity Matrix)}.

In this case, \mathbf{A}^{-1} = \mathbf{B}.

For a 2 \times 2 matrix:

\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}

The inverse is:

\mathbf{A}^{-1} = \frac{1}{ad-bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}

This inverse exists only if ad - bc \neq 0, where ad - bc is the determinant of \mathbf{A}.

For a partitioned block matrix:

\begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} = \begin{bmatrix} \mathbf{(A-BD^{-1}C)^{-1}} & \mathbf{-(A-BD^{-1}C)^{-1}BD^{-1}} \\ \mathbf{-D^{-1}C(A-BD^{-1}C)^{-1}} & \mathbf{D^{-1}+D^{-1}C(A-BD^{-1}C)^{-1}BD^{-1}} \end{bmatrix}

This formula assumes that \mathbf{D} and \mathbf{A - BD^{-1}C} are invertible.

Properties of the Inverse for Non-Singular Matrices

  1. \mathbf{(A^{-1})^{-1}} = \mathbf{A}
  2. For a non-zero scalar b, \mathbf{(bA)^{-1} = b^{-1}A^{-1}}
  3. For a matrix \mathbf{B}, \mathbf{(BA)^{-1} = B^{-1}A^{-1}} (only if \mathbf{B} is non-singular).
  4. \mathbf{(A^{-1})' = (A')^{-1}} (the transpose of the inverse equals the inverse of the transpose).
  5. Never notate \mathbf{1/A}; use \mathbf{A^{-1}} instead.

Notes: - The determinant of a matrix determines whether it is invertible. For square matrices, a determinant of 0 means the matrix is singular and has no inverse.
- Always verify the conditions for invertibility, particularly when dealing with partitioned or block matrices.

2.1.3 Definiteness of a Matrix

A symmetric square k \times k matrix \mathbf{A} is classified based on the following conditions:

  • Positive Semi-Definite (PSD): \mathbf{A} is PSD if, for any non-zero k \times 1 vector \mathbf{x}: \mathbf{x'Ax \geq 0}.

  • Negative Semi-Definite (NSD): \mathbf{A} is NSD if, for any non-zero k \times 1 vector \mathbf{x}: \mathbf{x'Ax \leq 0}.

  • Indefinite: \mathbf{A} is indefinite if it is neither PSD nor NSD.

The identity matrix is always positive definite (PD).

Example

Let \mathbf{x} = (x_1, x_2)', and consider a 2 \times 2 identity matrix \mathbf{I}:

\begin{aligned} \mathbf{x'Ix} &= (x_1, x_2) \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ &= (x_1, x_2) \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ &= x_1^2 + x_2^2 \geq 0. \end{aligned}

Thus, \mathbf{I} is PD because \mathbf{x'Ix} > 0 for all non-zero \mathbf{x}.

Properties of Definiteness

  1. Any variance-covariance matrix is PSD.
  2. A matrix \mathbf{A} is PSD if and only if there exists a matrix \mathbf{B} such that: \mathbf{A = B'B}.
  3. If \mathbf{A} is PSD, then \mathbf{B'AB} is also PSD for any conformable matrix \mathbf{B}.
  4. If \mathbf{A} and \mathbf{C} are non-singular, then \mathbf{A - C} is PSD if and only if \mathbf{C^{-1} - A^{-1}} is PSD.
  5. If \mathbf{A} is PD (or ND), then \mathbf{A^{-1}} is also PD (or ND).

Notes

  • An indefinite matrix \mathbf{A} is neither PSD nor NSD. This concept does not have a direct counterpart in scalar algebra.
  • If a square matrix is PSD and invertible, then it is PD.

Examples of Definiteness

  1. Invertible / Indefinite: \begin{bmatrix} -1 & 0 \\ 0 & 10 \end{bmatrix}

  2. Non-Invertible / Indefinite: \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}

  3. Invertible / PSD: \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

  4. Non-Invertible / PSD: \begin{bmatrix} 0 & 0 \\ 0 & 1 \end{bmatrix}

2.1.4 Matrix Calculus

Consider a scalar function y = f(x_1, x_2, \dots, x_k) = f(x), where x is a 1 \times k row vector.

2.1.4.1 Gradient (First-Order Derivative)

The gradient, or the first-order derivative of f(x) with respect to the vector x, is given by:

\frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots \\ \frac{\partial f(x)}{\partial x_k} \end{bmatrix}

2.1.4.2 Hessian (Second-Order Derivative)

The Hessian, or the second-order derivative of f(x) with respect to x, is a symmetric matrix defined as:

\frac{\partial^2 f(x)}{\partial x \partial x'} = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & \frac{\partial^2 f(x)}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_k} \\ \frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & \frac{\partial^2 f(x)}{\partial x_2^2} & \cdots & \frac{\partial^2 f(x)}{\partial x_2 \partial x_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(x)}{\partial x_k \partial x_1} & \frac{\partial^2 f(x)}{\partial x_k \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_k^2} \end{bmatrix}

2.1.4.3 Derivative of a Scalar Function with Respect to a Matrix

Let f(\mathbf{X}) be a scalar function, where \mathbf{X} is an n \times p matrix. The derivative is:

\frac{\partial f(\mathbf{X})}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial f(\mathbf{X})}{\partial x_{11}} & \cdots & \frac{\partial f(\mathbf{X})}{\partial x_{1p}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\mathbf{X})}{\partial x_{n1}} & \cdots & \frac{\partial f(\mathbf{X})}{\partial x_{np}} \end{bmatrix}

2.1.4.4 Common Matrix Derivatives

  1. If \mathbf{a} is a vector and \mathbf{A} is a matrix independent of \mathbf{y}:
    • \frac{\partial \mathbf{a'y}}{\partial \mathbf{y}} = \mathbf{a}
    • \frac{\partial \mathbf{y'y}}{\partial \mathbf{y}} = 2\mathbf{y}
    • \frac{\partial \mathbf{y'Ay}}{\partial \mathbf{y}} = (\mathbf{A} + \mathbf{A'})\mathbf{y}
  2. If \mathbf{X} is symmetric:
    • \frac{\partial |\mathbf{X}|}{\partial x_{ij}} = \begin{cases} X_{ii}, & i = j \\ X_{ij}, & i \neq j \end{cases} where X_{ij} is the (i,j)-th cofactor of \mathbf{X}.
  3. If \mathbf{X} is symmetric and \mathbf{A} is a matrix independent of \mathbf{X}:
    • \frac{\partial \text{tr}(\mathbf{XA})}{\partial \mathbf{X}} = \mathbf{A} + \mathbf{A'} - \text{diag}(\mathbf{A}).
  4. If \mathbf{X} is symmetric, let \mathbf{J}_{ij} be a matrix with 1 at the (i,j)-th position and 0 elsewhere:
    • \frac{\partial \mathbf{X}^{-1}}{\partial x_{ij}} = \begin{cases} -\mathbf{X}^{-1}\mathbf{J}_{ii}\mathbf{X}^{-1}, & i = j \\ -\mathbf{X}^{-1}(\mathbf{J}_{ij} + \mathbf{J}_{ji})\mathbf{X}^{-1}, & i \neq j \end{cases}.

2.1.5 Optimization in Scalar and Vector Spaces

Optimization is the process of finding the minimum or maximum of a function. The conditions for optimization differ depending on whether the function involves a scalar or a vector. Below is a comparison of scalar and vector optimization:

Condition Scalar Optimization Vector Optimization
First-Order Condition \frac{\partial f(x_0)}{\partial x} = 0 \frac{\partial f(x_0)}{\partial x} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}

Second-Order Condition

For convex functions, this implies a minimum.

\frac{\partial^2 f(x_0)}{\partial x^2} > 0 \frac{\partial^2 f(x_0)}{\partial x \partial x'} > 0
For concave functions, this implies a maximum. \frac{\partial^2 f(x_0)}{\partial x^2} < 0 \frac{\partial^2 f(x_0)}{\partial x \partial x'} < 0

Key Concepts

  1. First-Order Condition:
    • The first-order derivative of the function must equal zero at a critical point. This holds for both scalar and vector functions:
      • In the scalar case, \frac{\partial f(x)}{\partial x} = 0 identifies critical points.
      • In the vector case, \frac{\partial f(x)}{\partial x} is a gradient vector, and the condition is satisfied when all elements of the gradient are zero.
  2. Second-Order Condition:
    • The second-order derivative determines whether the critical point is a minimum, maximum, or saddle point:
      • For scalar functions, \frac{\partial^2 f(x)}{\partial x^2} > 0 implies a local minimum, while \frac{\partial^2 f(x)}{\partial x^2} < 0 implies a local maximum.
      • For vector functions, the Hessian matrix \frac{\partial^2 f(x)}{\partial x \partial x'} must be:
        • Positive Definite: For a minimum (convex function).
        • Negative Definite: For a maximum (concave function).
        • Indefinite: For a saddle point (neither minimum nor maximum).
  3. Convex and Concave Functions:
    • A function f(x) is:
      • Convex if \frac{\partial^2 f(x)}{\partial x^2} > 0 or the Hessian \frac{\partial^2 f(x)}{\partial x \partial x'} is positive definite.
      • Concave if \frac{\partial^2 f(x)}{\partial x^2} < 0 or the Hessian is negative definite.
    • Convexity ensures global optimization for minimization problems, while concavity ensures global optimization for maximization problems.
  4. Hessian Matrix:
    • In vector optimization, the Hessian \frac{\partial^2 f(x)}{\partial x \partial x'} plays a crucial role in determining the nature of critical points:
      • Positive definite Hessian: All eigenvalues are positive.
      • Negative definite Hessian: All eigenvalues are negative.
      • Indefinite Hessian: Eigenvalues have mixed signs.

2.1.6 Cholesky Decomposition

In statistical analysis and numerical linear algebra, decomposing matrices into more tractable forms is crucial for efficient computation. One such important factorization is the Cholesky Decomposition. It applies to Hermitian (in the complex case) or symmetric (in the real case), positive-definite matrices.

Given an n \times n positive-definite matrix A, the Cholesky Decomposition states:

A = L L^{*},

where:

  • L is a lower-triangular matrix with strictly positive diagonal entries.
  • L^{*} denotes the conjugate transpose of L (simply the transpose L^{T} for real matrices).

Cholesky Decomposition is both computationally efficient and numerically stable, making it a go-to technique for many applications—particularly in statistics where we deal extensively with covariance matrices, linear systems, and probability distributions.

Before diving into how we compute a Cholesky Decomposition, we need to clarify what it means for a matrix to be positive-definite. For a real symmetric matrix A:

  • A is positive-definite if for every nonzero vector x, we have x^T A \, x > 0.
  • Alternatively, you can characterize positive-definiteness by noting that all eigenvalues of A are strictly positive.

Many important matrices in statistics—particularly covariance or precision matrices—are both symmetric and positive-definite.


2.1.6.1 Existence

A real n \times n matrix A that is symmetric and positive-definite always admits a Cholesky Decomposition A = L L^T. This theorem guarantees that for any covariance matrix in statistics—assuming it is valid (i.e., positive-definite)—we can decompose it via Cholesky.

2.1.6.2 Uniqueness

If we additionally require that the diagonal entries of L are strictly positive, then L is unique. That is, no other lower-triangular matrix with strictly positive diagonal entries will produce the same factorization. This uniqueness is helpful for ensuring consistent numerical outputs in software implementations.

2.1.6.3 Constructing the Cholesky Factor L

Given a real, symmetric, positive-definite matrix A \in \mathbb{R}^{n \times n}, we want to find the lower-triangular matrix L such that A = LL^T. One way to do this is by using a simple step-by-step procedure (often part of standard linear algebra libraries):

  1. Initialize L to be an n \times n zero matrix.
  2. Iterate through the rows i = 1, 2, \dots, n:
    1. For each row i, compute L_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}.
    2. For each column j = i+1, i+2, \dots, n: L_{ji} = \frac{1}{L_{ii}} \Bigl(A_{ji} - \sum_{k=1}^{i-1} L_{jk} L_{ik}\Bigr).
    3. All other entries of L remain zero or are computed in subsequent steps.
  3. Result: L is lower-triangular, and L^T is its transpose.

Cholesky Decomposition is roughly half the computational cost of a more general LU Decomposition. Specifically, it requires on the order of \frac{1}{3} n^3 floating-point operations (flops), making it significantly more efficient in practice than other decompositions for positive-definite systems.


2.1.6.4 Illustrative Example

Consider a small 3 \times 3 positive-definite matrix:

A = \begin{pmatrix} 4 & 2 & 4 \\ 2 & 5 & 6 \\ 4 & 6 & 20 \end{pmatrix}.

We claim A is positive-definite (one could check by calculating principal minors or verifying x^T A x > 0 for all x \neq 0). We find L step-by-step:

  1. Compute L_{11}:
    L_{11} = \sqrt{A_{11}} = \sqrt{4} = 2.
  2. Compute L_{21} and L_{31}:
    • L_{21} = \frac{A_{21}}{L_{11}} = \frac{2}{2} = 1.
    • L_{31} = \frac{A_{31}}{L_{11}} = \frac{4}{2} = 2.
  3. Compute L_{22}:
    L_{22} = \sqrt{A_{22} - L_{21}^2} = \sqrt{5 - 1^2} = \sqrt{4} = 2.
  4. Compute L_{32}:
    L_{32} = \frac{A_{32} - L_{31} L_{21}}{L_{22}} = \frac{6 - (2)(1)}{2} = \frac{4}{2} = 2.
  5. Compute L_{33}:
    L_{33} = \sqrt{A_{33} - (L_{31}^2 + L_{32}^2)} = \sqrt{20 - (2^2 + 2^2)} = \sqrt{20 - 8} = \sqrt{12} = 2\sqrt{3}.

Thus,

L = \begin{pmatrix} 2 & 0 & 0 \\ 1 & 2 & 0 \\ 2 & 2 & 2\sqrt{3} \end{pmatrix}.

One can verify L L^T = A.


2.1.6.5 Applications in Statistics

2.1.6.5.1 Solving Linear Systems

A common statistical problem is solving A x = b for x. For instance, in regression or in computing Bayesian posterior modes, we often need to solve linear equations with covariance or precision matrices. With A = LL^T:

  1. Forward Substitution: Solve L \, y = b.
  2. Backward Substitution: Solve L^T x = y.

This two-step process is more stable and efficient than directly inverting A (which is typically discouraged due to numerical issues).

2.1.6.5.2 Generating Correlated Random Vectors

In simulation-based statistics (e.g., Monte Carlo methods), we often need to generate random draws from a multivariate normal distribution \mathcal{N}(\mu, \Sigma), where \Sigma is the covariance matrix. The steps are:

  1. Generate a vector z \sim \mathcal{N}(0, I) of independent standard normal variables.
  2. Compute x = \mu + Lz, where \Sigma = LL^T.

Then x has the desired covariance structure \Sigma. This technique is widely used in Bayesian statistics (e.g., MCMC sampling) and financial modeling (e.g., portfolio simulations).

2.1.6.5.3 Gaussian Processes and Kriging

In Gaussian Process modeling (common in spatial statistics, machine learning, and geostatistics), we frequently work with large covariance matrices that describe the correlations between observed data points:

\Sigma = \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix},

where k(\cdot, \cdot) is a covariance (kernel) function. We may need to invert or factorize \Sigma repeatedly to evaluate the log-likelihood:

\log \mathcal{L}(\theta) \sim - \tfrac{1}{2} \bigl( y - m(\theta) \bigr)^T \Sigma^{-1} \bigl( y - m(\theta) \bigr) - \tfrac{1}{2} \log \bigl| \Sigma \bigr|,

where m(\theta) is the mean function and \theta are parameters. Using the Cholesky factor L of \Sigma helps:

  • \Sigma^{-1} can be implied by solving systems with L instead of explicitly computing the inverse.
  • \log|\Sigma| can be computed as 2 \sum_{i=1}^n \log L_{ii}.

Hence, Cholesky Decomposition becomes the backbone of Gaussian Process computations.

2.1.6.5.4 Bayesian Inference with Covariance Matrices

Many Bayesian models—especially hierarchical models—assume a multivariate normal prior on parameters. Cholesky Decomposition is used to:

  • Sample from these priors or from posterior distributions.
  • Regularize large covariance matrices.
  • Speed up Markov Chain Monte Carlo (MCMC) computations by factorizing covariance structures.

2.1.6.6 Other Notes

  1. Numerical Stability Considerations

Cholesky Decomposition is considered more stable than a general LU Decomposition when applied to positive-definite matrices. Since no row or column pivots are required, rounding errors can be smaller. Of course, in practice, software implementations can vary, and extremely ill-conditioned matrices can still pose numerical challenges.

  1. Why We Don’t Usually Compute \mathbf{A}^{-1}

It is common in statistics (especially in older texts) to see formulas involving \Sigma^{-1}. However, computing an inverse explicitly is often discouraged because:

  • It is numerically less stable.
  • It requires more computations.
  • Many tasks that appear to need \Sigma^{-1} can be done more efficiently by solving systems via the Cholesky factor L.

Hence, “solve, don’t invert” is a common mantra. If you see an expression like \Sigma^{-1} b, you can use the Cholesky factors L and L^T to solve \Sigma x = b by forward and backward substitution, bypassing the direct inverse calculation.

  1. Further Variants and Extensions
  • Incomplete Cholesky: Sometimes used in iterative solvers where a full Cholesky factorization is too expensive, especially for large sparse systems.
  • LDL^T Decomposition: A variant that avoids taking square roots; used for positive semi-definite or indefinite systems, but with caution about pivoting strategies.