Chapter 2 Types of Matrices

“I, the Triangle divine, work out God’s will within the square and serve my fellowmen.” -Mantram

Matrices are versatile structures that form the backbone of numerous mathematical operations. Among them, square, diagonal, identity, and zero matrices stand out due to their unique properties and applications.

2.1 Square Matrices

A square matrix has an equal number of rows and columns, denoted as an $n \times n$ matrix. This symmetry allows for various operations such as finding determinants and eigenvalues, which are not defined for non-square matrices.

2.2 Diagonal Matrices

A diagonal matrix is a special kind of square matrix where all off-diagonal elements are zero. Only the elements on the main diagonal may be non-zero. Diagonal matrices simplify many computations, as their determinant is simply the product of the diagonal elements, and they are easy to invert when all diagonal elements are non-zero.

2.3 Identity Matrices

The identity matrix is a diagonal matrix where all the diagonal elements are equal to 1. It serves as the multiplicative identity in matrix operations, much like the number 1 in arithmetic. Multiplying any square matrix by an identity matrix of compatible size leaves the original matrix unchanged.

2.4 Zero Matrices

A zero matrix is one where all elements are zero. It acts as the additive identity in matrix operations, meaning that adding a zero matrix to any other matrix leaves the original matrix unchanged.

2.5 Properties of Matrices

Matrices exhibit several key properties that are crucial for their manipulation and application in linear algebra:

Dimensions: Matrices can be characterized by their dimensions, expressed as $m \times n$ , where $m$ is the number of rows and $n$ is the number of columns.
Determinant: A square matrix has a determinant, a scalar value that provides important information about the matrix, such as whether it is invertible.
Invertibility: A matrix is invertible if there exists another matrix that yields the identity matrix when multiplied with it. Only square matrices can be invertible.
Rank: The rank of a matrix is the maximum number of linearly independent row or column vectors in the matrix. It indicates the dimension of the vector space spanned by its rows or columns.

Square matrices of rank 2,3,4 and 5:

$\begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \begin{bmatrix} a & b & c & d \\ e & f & g & h \\ i & j & k & l \\ m & n & o & p \end{bmatrix} \begin{bmatrix} a & b & c & d & e \\ f & g & h & i & j \\ k & l & m & n & o \\ p & q & r & s & t \\ u & v & w & x & y \end{bmatrix}$

Identity matrices of rank 2,3,4 and 5:

$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix}$

Diagonal matrices of ranks 2, 3, 4, and 5:

$\begin{bmatrix} a & 0 \\ 0 & b \end{bmatrix} \begin{bmatrix} a & 0 & 0 \\ 0 & b & 0 \\ 0 & 0 & c \end{bmatrix} \begin{bmatrix} a & 0 & 0 & 0 \\ 0 & b & 0 & 0 \\ 0 & 0 & c & 0 \\ 0 & 0 & 0 & d \end{bmatrix} \begin{bmatrix} a & 0 & 0 & 0 & 0 \\ 0 & b & 0 & 0 & 0 \\ 0 & 0 & c & 0 & 0 \\ 0 & 0 & 0 & d & 0 \\ 0 & 0 & 0 & 0 & e \end{bmatrix}$

2.6 Groups of Matrices

There are groups of matrices in the algebraic sense, and they play a significant role in various areas of mathematics, particularly in linear algebra and group theory. Here’s how matrices can form groups:

Matrix Groups: A matrix group is a set of matrices that satisfies the group axioms under matrix multiplication. These axioms are:
- Closure: If $A$ and $B$ are matrices in the group, then the product $AB$ is also in the group.
- Associativity: Matrix multiplication is associative, meaning for any matrices $A$ , $B$ , and $C$ , the equation $(AB)C = A(BC)$ holds.
- Identity Element: There exists an identity matrix $I$ in the group such that for any matrix $A$ in the group, $AI = IA = A$ .
- Inverse Element: For each matrix $A$ in the group, there exists an inverse matrix $A^{-1}$ such that $AA^{-1} = A^{-1}A = I$ .
Non-Commutativity: While matrix multiplication is associative, it is generally not commutative. This means that for matrices $A$ and $B$ , $AB$ may not equal $BA$ . However, non-commutativity does not prevent matrices from forming groups; it simply means that these groups are non-abelian (i.e., not commutative).
Examples of Matrix Groups:
- General Linear Group $GL(n, \mathbb{R})$ : This is the group of all invertible $n \times n$ matrices with real entries. It is a fundamental example of a matrix group.
- Special Linear Group $SL(n, \mathbb{R})$ : This consists of all $n \times n$ matrices with determinant 1. It is a subgroup of the general linear group.
- Orthogonal Group $O(n)$ : This group includes all $n \times n$ orthogonal matrices, which satisfy $A^TA = I$ .
- Unitary Group $U(n)$ : This consists of all $n \times n$ unitary matrices, which satisfy $A^*A = I$ , where $A^*$ is the conjugate transpose of $A$ .

2.7 Optional Reading

Example of a conjugate transpose

For instance, let’s go through an example of finding the conjugate transpose (also known as the Hermitian transpose) of a matrix.

Consider the following complex matrix $A$ :

$A = \begin{pmatrix} 1 + 2i & 3 - i \\ 2 + i & 4 \end{pmatrix}$

To find the conjugate transpose $A^*$ , you need to:

Take the complex conjugate of each element in the matrix. This means changing the sign of the imaginary part.
Transpose the matrix, which involves swapping rows and columns.

Let’s apply these steps:

Complex Conjugate:
- The complex conjugate of $1 + 2i$ is $1 - 2i$ .
- The complex conjugate of $3 - i$ is $3 + i$ .
- The complex conjugate of $2 + i$ is $2 - i$ .
- The complex conjugate of $4$ (a real number) is $4$ .
Transpose:
- Swap the rows and columns of the matrix.

So, the conjugate transpose $A^*$ is:

$A^* = \begin{pmatrix} 1 - 2i & 2 - i \\ 3 + i & 4 \end{pmatrix}$

This matrix $A^*$ is the conjugate transpose of the original matrix $A$ .

When dealing with real numbers, the concept of a conjugate transpose simplifies to just the transpose of the matrix, since the complex conjugate of a real number is the number itself. Let’s go through an example using a real matrix.

Consider the following real matrix $B$ :

$B = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix}$

To find the transpose of this matrix (which is the same as the conjugate transpose for real matrices), you swap the rows and columns:

The transpose $B^T$ is:

$B^T = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}$

This matrix $B^T$ is the transpose of the original matrix $B$ . For real matrices, the conjugate transpose is simply the transpose.

In summary, matrices can indeed form groups, and these groups are essential in understanding the structure and symmetries of various mathematical systems. The lack of commutativity in matrix multiplication simply means that many matrix groups are non-abelian, which adds to their richness and complexity.

2.7.1 GLG and GLM

General linear groups (GLG) and generalized linear models (GLM) are related in that they both involve linear algebra concepts, but they serve different purposes and are used in different contexts.

General Linear Group (GL(n, R)):
- This is a mathematical concept from group theory and linear algebra.
- It consists of all invertible $n \times n$ matrices with real (or complex) entries.
- The group is defined under the operation of matrix multiplication.
- It is used in various areas of mathematics to study symmetries and transformations, particularly in geometry and algebra.
Generalized Linear Models (GLMs):
- This is a statistical concept used in data analysis and modeling.
- GLMs extend linear regression models to allow for response variables that have error distribution models other than a normal distribution.
- They are used to model relationships between a dependent variable and one or more independent variables.
- GLMs include models like logistic regression, Poisson regression, and others, which are used for different types of data (e.g., binary, count data).

While both involve linear algebra, the general linear group is a theoretical construct used in pure mathematics, whereas generalized linear models are practical tools used in statistics for data analysis. They are not directly related but share a common foundation in linear algebra.

Generalized Linear Models (GLMs) use matrix algebra extensively to solve problems. Here’s how matrix algebra is involved in GLMs:

Model Representation: GLMs are often represented in matrix form, which allows for efficient computation and manipulation. The model can be expressed as:

$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$

where:
- $\mathbf{Y}$ is the vector of observed responses.
- $\mathbf{X}$ is the matrix of predictor variables (design matrix).
- $\boldsymbol{\beta}$ is the vector of coefficients to be estimated.
- $\boldsymbol{\epsilon}$ is the vector of errors.
Estimation of Coefficients: The coefficients $\boldsymbol{\beta}$ are typically estimated using methods like maximum likelihood estimation, which involves solving equations that are often expressed and solved using matrix operations.
Matrix Operations: Operations such as matrix multiplication, transposition, and inversion are commonly used in the estimation process and in deriving statistical properties of the estimators.
Efficiency: Using matrix algebra allows for efficient computation, especially when dealing with large datasets and multiple predictors, as it leverages optimized numerical algorithms.

In summary, matrix algebra is a fundamental tool in the implementation and computation of GLMs, enabling the handling of complex models and large datasets efficiently.

Example of a Generalized Linear Model (GLM) Analysis: Logistic Regression

2.7.2 Dataset Description

Let’s consider a dataset from a healthcare study aimed at understanding the factors influencing the likelihood of developing a particular disease. The dataset contains the following variables for 500 patients:

Age: The age of the patient.
Gender: Male or Female.
BMI: Body Mass Index.
Smoking Status: Categorical variable indicating whether the patient is a current smoker, former smoker, or never smoked.
Disease Status: A binary outcome variable where 1 indicates the presence of the disease and 0 indicates its absence.

2.7.3 Type of GLM Used

For this analysis, we’ll use a logistic regression model, which is suitable for predicting binary outcomes. Logistic regression models the probability that a binary dependent variable equals one of the categories (e.g., disease presence).

2.7.4 Steps Involved in the Analysis

Preprocessing the Data:
- Convert categorical variables such as Gender and Smoking Status into numerical format using techniques like one-hot encoding.
Model Setup:
- Define the logistic regression model with Disease Status as the dependent variable and Age, Gender, BMI, and Smoking Status as independent variables.
Estimation of Coefficients:
- The likelihood of having the disease is modeled using the logistic function: $\text{log-odds} = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{Gender}) + \beta_3(\text{BMI}) + \beta_4(\text{Smoking Status})$
- Use maximum likelihood estimation to find the coefficients $\beta_0, \beta_1, \beta_2, \beta_3,$ and $\beta_4$ .
Model Fitting:
- Fit the logistic regression model to the dataset using statistical software or programming libraries like R, Python’s statsmodels, or scikit-learn.
Validation:
- Check the model’s goodness-of-fit using metrics like the Akaike Information Criterion (AIC) and perform cross-validation to assess predictive accuracy.

2.7.5 Interpretation of Results

Assume the logistic regression yields the following coefficients:

$\beta_0 = -2.5$
$\beta_1 = 0.03$ for Age
$\beta_2 = 0.5$ for Gender (Male)
$\beta_3 = 0.1$ for BMI
$\beta_4 = 1.2$ for Smoking Status (Current Smoker)

Interpretation:

Intercept ( $\beta_0$ ): The baseline log-odds of having the disease is $-2.5$ when all predictors are zero.
Age: For each additional year of age, the log-odds of disease presence increases by 0.03, indicating older age is associated with a higher likelihood of the disease.
Gender: Males have 0.5 higher log-odds of having the disease compared to females, suggesting a gender effect.
BMI: Each unit increase in BMI is associated with a 0.1 increase in the log-odds of the disease, indicating a positive association with body mass.
Smoking Status: Current smokers have 1.2 higher log-odds of disease presence compared to non-smokers, highlighting smoking as a significant risk factor.

This GLM analysis provides insights into how various demographic and lifestyle factors contribute to the risk of developing the disease, offering valuable guidance for targeted prevention strategies.

2.7.6 Matrix algebra in GLM

Matrix algebra plays a pivotal role in estimating coefficients for a Generalized Linear Model (GLM), particularly in logistic regression. Here’s a step-by-step breakdown of the process:

2.7.6.1 Design Matrix and Vectors

Design Matrix ( $\mathbf{X}$ ):
- The design matrix $\mathbf{X}$ contains the predictor variables for all observations. Each row represents an observation, and each column represents a predictor variable, including a column for the intercept.
Response Vector ( $\mathbf{Y}$ ):
- The response vector $\mathbf{Y}$ contains the observed binary outcomes for each observation, typically coded as 0 or 1.
Coefficient Vector ( $\boldsymbol{\beta}$ ):
- The coefficient vector $\boldsymbol{\beta}$ contains the parameters to be estimated, including the intercept and coefficients for each predictor variable.

2.7.6.2 Estimation via Maximum Likelihood

In logistic regression, we model the log-odds of the probability of the outcome as a linear combination of the predictors:

$\text{logit}(p) = \mathbf{X}\boldsymbol{\beta}$

where $p$ is the probability of the outcome being 1.

2.7.6.3 Maximum Likelihood Estimation (MLE)

The coefficients $\boldsymbol{\beta}$ are estimated by maximizing the likelihood function, which is often done using iterative algorithms due to the non-linear nature of the logistic function. Here’s how matrix operations are involved:

Likelihood Function:
- The likelihood function is based on the probability distribution of the data. For logistic regression, it’s derived from the Bernoulli distribution.
Iterative Optimization:
- Iterative methods such as the Newton-Raphson or Fisher Scoring algorithm are employed to find the optimal coefficients. These methods involve matrix operations like multiplication and inversion.

2.7.7 Newton-Raphson Method

This method involves the following steps:

Compute the Score Vector ( $\mathbf{U}$ ):
- The score vector $\mathbf{U} = \mathbf{X}^T(\mathbf{Y} - \boldsymbol{\mu})$ measures the gradient of the log-likelihood function, where $\boldsymbol{\mu}$ is the expected response vector based on the current estimates of $\boldsymbol{\beta}$ .
Compute the Hessian Matrix ( $\mathbf{H}$ ):
- The Hessian matrix $\mathbf{H} = -\mathbf{X}^T\mathbf{W}\mathbf{X}$ represents the second derivative of the log-likelihood function, where $\mathbf{W}$ is a diagonal matrix of weights based on the variance of the observations.
Update Coefficients:
- The coefficients are updated iteratively using: $\boldsymbol{\beta}_{\text{new}} = \boldsymbol{\beta}_{\text{old}} - \mathbf{H}^{-1}\mathbf{U}$
- This involves matrix inversion ( $\mathbf{H}^{-1}$ ) and multiplication, critical to refining the estimates until convergence.

2.7.7.1 Role of Matrix Inversion and Multiplication

Matrix inversion is crucial in solving systems of equations that define the update steps in iterative methods. Efficient computation of these operations ensures that the algorithm converges quickly and accurately to the optimal coefficient estimates.

Thus, matrix algebra provides a structured and efficient framework to estimate the coefficients in GLMs, enabling the handling of complex models and large datasets with precision and speed.

2.7.8 Contrasting GML and GLG

Let’s break down the relationship between the matrix operations used in Generalized Linear Models (GLMs) and those in the General Linear Group $\text{GL}(n, \mathbb{R})$ .

2.7.8.1 Matrix Operations in GLMs

Matrix Multiplication: Used extensively in GLMs for operations like calculating predicted values and updating coefficients.
Matrix Inversion: Often used in the estimation of coefficients, especially in methods like the Newton-Raphson algorithm.
Transpose: Used in various calculations, such as computing the score vector or Hessian matrix.
Addition and Subtraction: Used in iterative methods to update estimates.

2.7.8.2 Operations in the General Linear Group $\text{GL}(n, \mathbb{R})$

Matrix Multiplication: The primary operation defining the group, focusing on invertible matrices.
Inversion: Every element (matrix) in the group has an inverse within the group.
Identity Matrix: Serves as the identity element in the group.

2.7.8.3 Differences and Overlaps

Inclusion in GLMs but not in $\text{GL}(n, \mathbb{R})$ :
- Addition and Subtraction: These operations are not part of the group structure in $\text{GL}(n, \mathbb{R})$ , which focuses on multiplication and inversion.
- Non-Invertible Matrices: GLMs may involve non-invertible matrices in intermediate calculations, whereas $\text{GL}(n, \mathbb{R})$ only includes invertible matrices.
Inclusion in $\text{GL}(n, \mathbb{R})$ but not specifically highlighted in GLMs:
- Group Structure: The focus on group properties like closure, associativity, and the existence of inverses is specific to $\text{GL}(n, \mathbb{R})$ and not a direct concern in GLM computations.

In summary, while GLMs use many matrix operations that are also relevant to the general linear group, they do not strictly adhere to the group structure of $\text{GL}(n, \mathbb{R})$ . The operations in GLMs are more about practical computation for statistical estimation, whereas $\text{GL}(n, \mathbb{R})$ is about the theoretical properties of invertible matrices under multiplication.

2.7.9 GLG for complex numbers

General Linear Groups can be defined over other fields besides the real numbers, including the complex numbers. The notation for the General Linear Group over the complex numbers is $\text{GL}(n, \mathbb{C})$ , where $\mathbb{C}$ represents the field of complex numbers.

2.7.10 General Linear Groups Over Different Fields

$\text{GL}(n, \mathbb{C})$ :
- This group consists of all $n \times n$ invertible matrices with complex number entries.
- It is used in contexts where complex numbers are natural, such as quantum mechanics and certain areas of engineering and physics.
Other Fields:
- General Linear Groups can be defined over any field, not just $\mathbb{R}$ or $\mathbb{C}$ . This includes finite fields, which are used in coding theory and cryptography.

2.7.11 Why Use $\text{GL}(n, \mathbb{R})$ ?

Real Numbers: Many applications in statistics, economics, and natural sciences naturally involve real numbers, making $\text{GL}(n, \mathbb{R})$ a practical choice.
Simplicity: Real numbers are often simpler to work with computationally compared to complex numbers, especially in applications where imaginary components are not needed.
Specific Applications: Some problems are inherently real-valued, such as those involving measurements or probabilities, which are naturally modeled using real numbers.

2.7.12 Why Use $\text{GL}(n, \mathbb{C})$ ?

Complex Analysis: In fields like electrical engineering and quantum physics, complex numbers are essential, making $\text{GL}(n, \mathbb{C})$ more appropriate.
Mathematical Completeness: Complex numbers provide a more complete field, where every non-constant polynomial has a root, which can be advantageous in theoretical mathematics.

In summary, the choice of field for a General Linear Group depends on the context and requirements of the problem at hand. While $\text{GL}(n, \mathbb{R})$ is common due to the prevalence of real numbers in many applications, $\text{GL}(n, \mathbb{C})$ and other fields are equally valid and useful in their respective domains.

2.7.13 Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model. It is primarily associated with frequentist statistical inference, but it also plays a role in Bayesian inference. Here’s a breakdown of its use in both contexts:

Definition: MLE seeks to find the parameter values that maximize the likelihood function, which measures how well the model explains the observed data. The likelihood function is constructed based on the probability of the observed data given the parameters.
Process:
1. Define the Likelihood Function: Based on the probability distribution of the data.
2. Maximize the Likelihood: Use calculus or numerical methods to find the parameter values that maximize this function.
Applications: MLE is widely used in various fields, including economics, biology, and machine learning, for parameter estimation in models like linear regression, logistic regression, and more complex models.

2.7.13.1 Use in Frequentist Inference

Primary Role: MLE is a cornerstone of frequentist inference. It provides point estimates of parameters without incorporating prior beliefs or distributions.
Properties: Under certain conditions, MLEs are consistent (converge to the true parameter value as sample size increases), efficient (achieve the lowest possible variance), and asymptotically normal (distribution approaches normality as sample size grows).

2.7.13.2 Use in Bayesian Inference

Role in Bayesian Inference: While MLE itself is a frequentist method, it can be used within Bayesian frameworks.
- Prior Distribution: In Bayesian inference, prior beliefs about parameters are combined with the likelihood (often derived from MLE) to form the posterior distribution.
- MAP Estimation: Maximum A Posteriori (MAP) estimation is a Bayesian counterpart to MLE, where the mode of the posterior distribution is found. It incorporates both the likelihood and the prior.
Integration with Bayesian Methods: MLE can serve as a starting point for Bayesian analysis, providing initial estimates that can be refined with prior information.

In Bayesian inference, the posterior distribution is a fundamental concept that combines prior beliefs with new evidence from data. The likelihood component is a crucial part of this process. Here’s how it fits into the Bayesian framework:

2.7.13.3 Components of the Posterior Distribution

The posterior distribution is given by Bayes’ theorem:

$P(\theta | \text{data}) = \frac{P(\text{data} | \theta) \cdot P(\theta)}{P(\text{data})}$

Where:

$P(\theta | \text{data})$ : The posterior distribution, representing the updated beliefs about the parameter $\theta$ after observing the data.
$P(\text{data} | \theta)$ : The likelihood, which measures how probable the observed data is, given the parameter $\theta$ .
$P(\theta)$ : The prior distribution, representing the initial beliefs about the parameter before observing the data.
$P(\text{data})$ : The marginal likelihood or evidence, a normalizing constant ensuring the posterior distribution sums to one.

2.7.13.4 The Likelihood Component

Definition: The likelihood $P(\text{data} | \theta)$ quantifies the probability of the observed data under different parameter values. It is derived from the statistical model assumed for the data.
Role in Bayesian Inference:
- Evidence from Data: The likelihood reflects the information provided by the data about the parameter $\theta$ .
- Combining with Prior: In the Bayesian framework, the likelihood is combined with the prior distribution to update beliefs and form the posterior distribution.
Calculation: The likelihood is often the same function used in Maximum Likelihood Estimation (MLE) in frequentist statistics, but in Bayesian inference, it is used in conjunction with the prior.

2.7.13.5 Importance

Influence on Posterior: The shape and location of the likelihood function significantly influence the posterior distribution, especially when the data is informative.
Balancing Prior and Data: The likelihood helps balance prior beliefs with new data, allowing for a dynamic update of parameter estimates.

In the context of Bayesian inference, “informative data” refers to data that provides a significant amount of information about the parameters of interest, allowing for a meaningful update of prior beliefs. Here’s what it means in more detail:

2.7.13.5.1 Informative Data

High Impact on Posterior: When data is informative, it has a strong influence on the posterior distribution. This means the likelihood function is sharply peaked or distinct, indicating that the data strongly supports certain parameter values over others.
Contrast with Prior: Informative data can significantly shift the posterior distribution away from the prior distribution, especially if the prior is vague or non-informative. This results in a posterior that is more reflective of the data than the prior beliefs.

2.7.13.5.2 Characteristics of Informative Data

Large Sample Size: More data points generally provide more information, leading to a more precise estimation of parameters.
High Quality: Data with low noise and high accuracy can provide clearer insights into the parameter values.
Strong Signal: Data that clearly distinguishes between different parameter values (e.g., clear trends or patterns) is considered informative.

2.7.13.5.3 Impact on Bayesian Analysis

Posterior Concentration: With informative data, the posterior distribution tends to be more concentrated around the true parameter values, reducing uncertainty.
Reduced Influence of Prior: When data is highly informative, the choice of prior has less impact on the posterior, as the data provides strong evidence.

Informative data are data that provides clear, strong evidence about the parameters being estimated, allowing the likelihood to play a dominant role in shaping the posterior distribution. This leads to more confident and precise parameter estimates.

Therefore, the likelihood component of the posterior distribution is a measure of how well different parameter values explain the observed data. It plays a critical role in updating prior beliefs to form the posterior distribution in Bayesian inference.

In summary, MLE is primarily a frequentist tool for parameter estimation, but it also has applications in Bayesian inference, particularly in forming the likelihood component of the posterior distribution. In Bayesian contexts, it is often complemented by prior information to provide a more comprehensive parameter estimation approach.

2.7.14 On likelihood and probability

The terms “likelihood” and “probability” are often used interchangeably in casual conversation, but they have distinct technical meanings in statistics, particularly in the context of statistical inference. Here’s a breakdown of the differences:

2.7.14.1 Probability

Definition: Probability refers to the likelihood of a specific outcome or event occurring, given a set of parameters or conditions. It is a measure of uncertainty or chance associated with random events.
Context: Probability is used to describe the distribution of outcomes in a random process. For example, the probability of rolling a six on a fair die is $\frac{1}{6}$ .
Function: In a probability distribution, the probability function assigns probabilities to different outcomes or events. For example, in a normal distribution, the probability density function (PDF) gives the probability of observing a value within a specific range.

2.7.14.2 Likelihood

Definition: Likelihood is a function of the parameters of a statistical model, given specific observed data. It measures how plausible a particular set of parameter values is, given the data.
Context: Likelihood is used in parameter estimation and model fitting. It is central to methods like Maximum Likelihood Estimation (MLE), where the goal is to find the parameter values that maximize the likelihood function.
Function: Unlike probability, which is a function of outcomes given parameters, likelihood is a function of parameters given outcomes. For example, in a normal distribution, the likelihood function evaluates how likely different mean and variance values are, given the observed data.

2.7.14.3 Key Differences

Directionality:
- Probability: $P(\text{data} | \theta)$ — the probability of data given parameters.
- Likelihood: $L(\theta | \text{data})$ — the likelihood of parameters given data.
Normalization:
- Probability: Probabilities sum to one over all possible outcomes.
- Likelihood: Likelihoods do not sum to one; they are relative measures used for comparison.
Use in Inference:
- Probability: Used to predict future events or assess the chance of outcomes.
- Likelihood: Used to estimate model parameters and assess model fit.

In summary, while both concepts deal with uncertainty and data, probability is about predicting outcomes given parameters, whereas likelihood is about evaluating parameters given observed data.