5.3 Maximum Likelihood

The Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the parameters of a model by maximizing the likelihood of observing the given data. The premise is to find the parameter values that maximize the probability (or likelihood) of the observed data.

The likelihood function, denoted as $L(\theta)$ , is expressed as:

$L(\theta) = \prod_{i=1}^{n} f(y_i|\theta)$

where:

$f(y|\theta)$ is the probability density or mass function of observing a single value of $Y$ given the parameter $\theta$ .
The product runs over all $n$ observations.

For different types of data, $f(y|\theta)$ can take different forms. For example, if $y$ is dichotomous (e.g., success/failure), then the likelihood function becomes:

$L(\theta) = \prod_{i=1}^{n} \theta^{y_i} (1-\theta)^{1-y_i}$

Here, $\hat{\theta}$ is the Maximum Likelihood Estimator (MLE) if:

$L(\hat{\theta}) > L(\theta_0), \quad \forall \theta_0 \text{ in the parameter space.}$

See Distributions for a review on variable distributions.

5.3.1 Motivation for MLE

Suppose we know the conditional distribution of $Y$ given $X$ , denoted as:

$f_{Y|X}(y, x; \theta)$

where $\theta$ is an unknown parameter of the distribution. Sometimes, we are only concerned with the unconditional distribution $f_Y(y; \theta)$ .

For a sample of independent and identically distributed (i.i.d.) data, the joint probability of the sample is:

$f_{Y_1, \ldots, Y_n|X_1, \ldots, X_n}(y_1, \ldots, y_n, x_1, \ldots, x_n; \theta) = \prod_{i=1}^{n} f_{Y|X}(y_i, x_i; \theta)$

The joint distribution, evaluated at the observed data, defines the likelihood function. The goal of MLE is to find the parameter $\theta$ that maximizes this likelihood.

To estimate $\theta$ , we maximize the likelihood function:

$\max_{\theta} \prod_{i=1}^{n} f_{Y|X}(y_i, x_i; \theta)$

In practice, it is easier to work with the natural logarithm of the likelihood (log-likelihood), as it transforms the product into a sum:

$\max_{\theta} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \theta))$

Solving for the Maximum Likelihood Estimator

First-Order Condition: Solve the first derivative of the log-likelihood function with respect to $\theta$ :

$\frac{\partial}{\partial \theta} \ell(\theta) \;=\; \frac{\partial}{\partial \theta} \ln L(\theta) \;=\; \frac{\partial}{\partial \theta} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \hat{\theta}_{MLE})) = 0$

This yields the critical points where the likelihood is maximized. This derivative, sometimes written as $U(\theta)$ , is called the score. Intuitively, the log-likelihood’s “peak” indicates the parameter value(s) that make the observed data “most likely.”
Second-Order Condition: Verify that the second derivative of the log-likelihood function is negative at the critical point:

$\frac{\partial^2}{\partial \theta^2} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \hat{\theta}_{MLE})) < 0$

This ensures that the solution corresponds to a maximum.

Examples of Likelihood Functions

Unconditional Poisson Distribution

The Poisson distribution models count data, such as the number of website visits in a day or product orders per hour. Its likelihood function is:

$L(\theta) = \prod_{i=1}^{n} \frac{\theta^{y_i} e^{-\theta}}{y_i!}$

Exponential Distribution

The exponential distribution is often used to model the time between events, such as the time until a machine fails. Its probability density function (PDF) is:

$f_{Y|X}(y, x; \theta) = \frac{\exp(-y / (x \theta))}{x \theta}$

The joint likelihood for $n$ observations is:

$L(\theta) = \prod_{i=1}^{n} \frac{\exp(-y_i / (x_i \theta))}{x_i \theta}$

By taking the logarithm, we obtain the log-likelihood for ease of maximization.

5.3.2 Key Quantities for Inference

Score Function
The score is given by
$U(\theta) \;=\; \frac{d}{d\theta} \ell(\theta).$
Setting $U(\hat{\theta}_{\mathrm{MLE}}) = 0$ yields the critical points of the log-likelihood, from which we can find $\hat{\theta}_{\mathrm{MLE}}$ .
Observed Information
The second derivative of the log-likelihood, taken at the MLE, is called the observed information:

$I_O(\theta) \;=\; - \frac{d^2}{d\theta^2} \ell(\theta).$

(The negative sign is often included so that $I_O(\theta)$ is positive if $\ell(\theta)$ is concave near its maximum. In some texts, you will see it defined without the negative sign, but the idea is the same: it measures the “pointedness” or curvature of $\ell(\theta)$ at its maximum.)
Fisher Information
The Fisher Information (or expected information) is the expectation of the observed information over the distribution of the data:

$I(\theta) \;=\; \mathbb{E}\left[I_O(\theta)\right].$

It quantifies how much information the data carry about the parameter $\theta$ . A larger Fisher information suggests that you can estimate $\theta$ more precisely.
Approximate Variance of $\hat{\theta}_{\mathrm{MLE}}$
One of the key results from standard asymptotic theory is that, for large $n$ , the variance of $\hat{\theta}_{\mathrm{MLE}}$ can be approximated by the inverse of the Fisher information:

$\mathrm{Var}\left(\hat{\theta}_{\mathrm{MLE}}\right) \;\approx\; I(\theta)^{-1}.$

This also lays the groundwork for constructing confidence intervals for $\theta$ in large samples.

5.3.3 Assumptions of MLE

MLE has desirable properties—consistency, asymptotic normality, and efficiency—but these do not come “for free.” Instead, they rely on certain assumptions. Below is a breakdown of the main regularity conditions. These conditions are typically mild in many practical settings (for example, in exponential families, such as the normal distribution), but need to be checked in more complex models.

High-Level Regulatory Assumptions

Independence and Identical Distribution (iid)
The sample $\{(x_i, y_i)\}$ is usually assumed to be composed of independent and identically distributed observations. This independence assumption simplifies the likelihood to a product of individual densities: $L(\theta) = \prod_{i=1}^n f_{Y\mid X}(y_i, x_i; \theta).$ In practice, if you have dependent data (e.g., time series, spatial data), modifications are required in the likelihood function.
Same Density Function
All observations must come from the same conditional probability density function $f_{Y\mid X}(\cdot,\cdot;\theta)$ . If the model changes across observations, you cannot simply multiply all of them together in one unified likelihood.
Multivariate Normality (for certain models)
In many practical cases—especially for continuous outcomes—you might assume (multivariate) normal distributions with finite second or fourth moments (Little 1988). Under these assumptions, the MLE for the mean vector and covariance matrix is consistent and (under further conditions) asymptotically normal. This assumption is quite common in regression, ANOVA, and other classical statistical frameworks.

5.3.3.1 Large Sample Properties of MLE

5.3.3.1.1 Consistency of MLE

Definition: An estimator $\hat{\theta}_n$ is consistent if it converges in probability to the true parameter value $\theta_0$ as the sample size $n \to \infty$ :

$\hat{\theta}_n \;\to^p\; \theta_0.$

For the MLE, a set of regularity conditions $R1$ – $R4$ is commonly used to ensure consistency:

R1
If $\theta \neq \theta_0$ , then
$f_{Y\mid X}(y_i, x_i; \theta) \;\neq\; f_{Y\mid X}(y_i, x_i; \theta_0).$

In simpler terms, the model is identifiable: no two distinct parameter values generate the exact same distribution for the data.
R2
The parameter space $\Theta$ is compact (closed and bounded), and it contains the true parameter $\theta_0$ . This ensures that $\theta$ lies in a “nice” region (no parameter going to infinity, etc.), making it easier to prove that a maximum in that space indeed exists.
R3
The log-likelihood function $\ln(f_{Y\mid X}(y_i, x_i; \theta))$ is continuous in $\theta$ with probability $1$ . Continuity is important so that we can apply theorems (like the Continuous Mapping Theorem or the Extreme Value Theorem) to find maxima.
R4
The expected supremum of the absolute value of the log-likelihood is finite:

$\mathbb{E}\left(\sup_{\theta \in \Theta} \left|\ln(f_{Y\mid X}(y_i, x_i; \theta))\right|\right) < \infty.$

This is a technical condition that helps ensure we can “exchange” expectations and suprema, a step needed in many consistency proofs.

When these conditions are satisfied, you can show via standard arguments (e.g., the Law of Large Numbers, uniform convergence of the log-likelihood) that:

$\hat{\theta}_{\mathrm{MLE}} \;\to^p\; \theta_0 \quad (\text{consistency}).$

5.3.3.1.2 Asymptotic Normality of MLE

Definition: An estimator $\hat{\theta}_n$ is asymptotically normal if

$\sqrt{n}\,(\hat{\theta}_n - \theta_0) \;\to^d\; \mathcal{N}\left(0,\Sigma\right),$

where $\to^d$ denotes convergence in distribution and $\Sigma$ is some covariance matrix. For the MLE, $\Sigma$ is typically $I(\theta_0)^{-1}$ , where $I(\theta_0)$ is the Fisher information evaluated at the true parameter.

Beyond $R1$ – $R4$ , we need the following additional assumptions:

R5
The true parameter $\theta_0$ is in the interior of the parameter space $\Theta$ . If $\theta_0$ sits on the boundary, different arguments are required to handle edge effects.
R6
The pdf $f_{Y\mid X}(y_i, x_i; \theta)$ is twice continuously differentiable (in $\theta$ ) and strictly positive in a neighborhood $N$ of $\theta_0$ . This allows us to use second-order Taylor expansions around $\theta_0$ to get the approximate distribution of $\hat{\theta}_{\mathrm{MLE}}$ .
R7
The following integrals are finite in some neighborhood $N$ of $\theta_0$ :
- $\int \sup_{\theta \in N} \left\|\frac{\partial f_{Y\mid X}(y_i, x_i; \theta)}{\partial \theta} \right\|\, d(y,x) < \infty$ .
- $\int \sup_{\theta \in N} \left\|\frac{\partial^2 f_{Y\mid X}(y_i, x_i; \theta)}{\partial \theta \partial \theta'} \right\|\, d(y,x) < \infty$ .
- $\mathbb{E}\left(\sup_{\theta \in N} \left\|\frac{\partial^2 \ln(f_{Y\mid X}(y_i, x_i; \theta))}{\partial \theta \partial \theta'} \right\|\right) < \infty$ .
These conditions ensure that differentiating inside integrals is justified (via the dominated convergence theorem) and that we can expand the log-likelihood in a Taylor series safely.
R8
The information matrix $I(\theta_0)$ exists and is nonsingular:

$I(\theta_0) = \mathrm{Var}\left(\frac{\partial}{\partial \theta} \ln\left(f_{Y\mid X}(y_i, x_i; \theta_0)\right)\right) \neq 0.$

Nonsingularity implies there is enough information in the data to estimate $\theta$ uniquely.

Under $R1$ – $R8$ , you can show that

$\sqrt{n}\,(\hat{\theta}_{\mathrm{MLE}} - \theta_0) \to^d \mathcal{N}\left(0,\,I(\theta_0)^{-1}\right).$

This result is central to frequentist inference, allowing you to construct approximate confidence intervals and hypothesis tests using the normal approximation for large $n$ .

5.3.4 Properties of MLE

Having established in earlier sections that Maximum Likelihood Estimators (MLEs) are consistent (Consistency of MLE) and asymptotically normal (Asymptotic Normality of MLE) under standard regularity conditions, we now highlight additional properties that make MLE a powerful estimation technique.

Asymptotic Efficiency

Definition: An estimator is asymptotically efficient if it attains the smallest possible asymptotic variance among all consistent estimators (i.e., it achieves the Cramér-Rao Lower Bound).
Interpretation: In large samples, MLE typically has smaller standard errors than other consistent estimators that do not fully use the assumed distributional form.
Implication: When the true model is correctly specified, MLE is the most efficient among a broad class of estimators, leading to more precise inference for $\theta$ .
- Cramér-Rao Lower Bound (CRLB): A theoretical lower limit on the variance of any unbiased (or asymptotically unbiased) estimator C. R. Rao (1992).
- When MLE Meets CRLB: Under correct specification and standard regularity conditions, the asymptotic variance of the MLE matches the CRLB, making it asymptotically efficient.
- Interpretation: Achieving CRLB means no other unbiased estimator can consistently outperform MLE in terms of variance for large $n$ .

Invariance

Core Idea: If $\hat{\theta}$ is the MLE for $\theta$ , then for any smooth transformation $g(\theta)$ , the MLE for $g(\theta)$ is simply $g(\hat{\theta})$ .
Example: If $\theta$ is a mean parameter and you want the MLE for the variance $\theta^2$ , you can just square the MLE for $\theta$ .
Key Point: This invariance property saves considerable effort—there is no need to re-derive a new likelihood for the transformed parameter.

Explicit vs. Implicit MLE

Explicit MLE:
Occurs when the score equation can be solved in closed form. A classic example is the MLE for the mean and variance in a normal distribution.
Implicit MLE:
Happens when no closed-form solution exists. Iterative numerical methods, such as Newton-Raphson, Expectation-Maximization (EM), or other optimization algorithms, are used to find $\hat{\theta}$ .

Distributional Mis-Specification

Definition: If you assume a distribution for $f_{Y|X}(\cdot;\theta)$ that does not reflect the true data-generating process, the MLE may become inconsistent or biased in finite samples.
Quasi-MLE:
- A strategy to handle certain forms of mis-specification.
- If the chosen distribution belongs to a flexible class or meets certain conditions (e.g., generalized linear models with a robust link), the resulting parameter estimates can remain consistent for some parameters of interest.
Nonparametric & Semiparametric Approaches:
- Require minimal or no distributional assumptions.
- More robust to mis-specification but can be harder to implement and may exhibit higher variance or require larger sample sizes to achieve comparable precision.

5.3.5 Practical Considerations

Use Cases
- MLE is extremely popular for:
  - Binary Outcomes (logistic regression)
  - Count Data (Poisson regression)
  - Strictly Positive Outcomes (Gamma regression)
  - Heteroskedastic Settings (models with variance related to mean, e.g., GLM)
Distributional Assumptions
- The efficiency gains of MLE stem from using a specific probability model.
- If the assumed model closely reflects the data-generating process, MLE gives accurate parameter estimates and reliable standard errors.
- MLE assumes knowledge of the conditional distribution of the outcome variable. This assumption parallels the normality assumption in linear regression models (e.g., A6 Normal Distribution).
- If severely mis-specified, consider robust or semi-/nonparametric methods.
Comparison with OLS: See Comparison of MLE and OLS for more details.
- Ordinary Least Squares is a special case of MLE when errors are normally distributed and homoscedastic.
- In more general settings (e.g., non-Gaussian or heteroskedastic data), MLE can outperform OLS in terms of smaller standard errors and better inference.
Numerical Stability & Computation
- For complex likelihoods, iterative methods can fail to converge or converge to local maxima.
- Proper initialization and diagnostics (e.g., checking multiple start points) are crucial.

5.3.6 Comparison of MLE and OLS

While Maximum Likelihood Estimation is a powerful estimation method, it does not solve all of the challenges associated with Ordinary Least Squares. Below is a detailed comparison highlighting similarities, differences, and limitations.

Key Points of Comparison

Inference Methods:
- MLE:
  - Joint inference is typically conducted using log-likelihood calculations, such as likelihood ratio tests or information criteria (e.g., AIC, BIC).
  - These methods replace the use of F-statistics commonly associated with OLS.
- OLS:
  - Relies on the F-statistic for hypothesis testing and joint inference.
Sensitivity to Functional Form:
- Both MLE and OLS are sensitive to the functional form of the model. Incorrect specification (e.g., linear vs. nonlinear relationships) can lead to biased or inefficient estimates in both cases.
Perfect Collinearity and Multicollinearity:
- Both methods are affected by collinearity:
  - Perfect collinearity (e.g., two identical predictors) makes parameter estimation impossible.
  - Multicollinearity (highly correlated predictors) inflates standard errors, reducing the precision of estimates.
- Neither MLE nor OLS directly resolves these issues without additional measures, such as regularization or variable selection.
Endogeneity:
- Problems like omitted variable bias or simultaneous equations affect both MLE and OLS:
  - If relevant predictors are omitted, estimates from both methods are likely to be biased and inconsistent.
  - Similarly, in systems of simultaneous equations, both methods yield biased results unless endogeneity is addressed through instrumental variables or other approaches.
- MLE, while efficient under correct model specification, does not inherently address endogeneity.

Situations Where MLE and OLS Differ

Comparative Summary of MLE and OLS Across Estimation, Assumptions, and Interpretation
Aspect	MLE	OLS
Estimator Efficiency	Efficient for correctly specified distributions.	Efficient under Gauss-Markov assumptions.
Assumptions about Errors	Requires specifying a distribution (e.g., normal, binomial).	Requires only mean-zero errors and homoscedasticity.
Use of Likelihood	Based on maximizing the likelihood function for parameter estimation.	Based on minimizing the sum of squared residuals.
Model Flexibility	More flexible (supports various distributions, non-linear models).	Primarily linear models (extensions for non-linear exist).
Interpretation	Log-likelihood values guide model comparison (AIC/BIC).	R-squared and adjusted R-squared measure fit.

Practical Considerations

When to Use MLE:
- Situations where the dependent variable is:
  - Binary (e.g., logistic regression)
  - Count data (e.g., Poisson regression)
  - Skewed or bounded (e.g., survival models)
- When the model naturally arises from a probabilistic framework.
When to Use OLS:
- Suitable for continuous dependent variables with approximately linear relationships between predictors and outcomes.
- Simpler to implement and interpret when the assumptions of linear regression are reasonably met.

5.3.7 Applications of MLE

MLE is widely used across various applications to estimate parameters in models tailored for specific data structures. Below are key applications of MLE, categorized by problem type and estimation method.

Applications of Maximum Likelihood Estimation in Nonlinear and Limited Dependent Variable Models
Model Type	Examples	Key Characteristics	Common Estimation Methods	Additional Notes
Corner Solution Models	Hours worked Donations to charity Household consumption of a good	Dependent variable is often censored at zero (or another threshold). Large fraction of observations at the corner (e.g., 0 hours, 0 donations).	Tobit regression (latent variable approach with censoring)	Useful when a continuous outcome has a mass point at zero but also positive values (e.g., 30% of individuals donate $0, the rest donate > $0).
Non-Negative Count Models	Number of arrests Number of cigarettes smoked Doctor visits per year	Dependent variable consists of non-negative integer counts. Possible overdispersion (variance > mean).	Poisson regression, Negative Binomial regression	Poisson assumes mean = variance, so often Negative Binomial is preferred for real data. Zero-inflated models (ZIP/ZINB) may be used for data with excess zeros.
Multinomial Choice Models	Demand for different car brands Votes in a primary election Choice of travel mode	Dependent variable is a categorical choice among 3+ alternatives. Each category is distinct, with no inherent ordering (e.g., brand A, B, or C).	Multinomial logit, Multinomial probit	Extension of binary choice (logit/probit) to multiple categories. Independence of Irrelevant Alternatives (IIA) can be a concern for the multinomial logit.
Ordinal Choice Models	Self-reported happiness (low/medium/high) Income level brackets Likert-scale surveys	Dependent variable is ordered (e.g., low < medium < high). Distances between categories are not necessarily equal.	Ordered logit, Ordered probit	Probit/logit framework adapted to preserve ordinal information. Interprets latent continuous variable mapped to discrete ordered categories.

5.3.7.1 Binary Response Models

A binary response variable ( $y_i$ ) follows a Bernoulli distribution:

$f_Y(y_i; p) = p^{y_i}(1-p)^{(1-y_i)}$

where $p$ is the probability of success. For conditional models, the likelihood becomes:

$f_{Y|X}(y_i, x_i; p(.)) = p(x_i)^{y_i}(1 - p(x_i))^{(1-y_i)}$

To model $p(x_i)$ , we use a function of $x_i$ and unknown parameters $\theta$ . A common approach involves a latent variable model:

$\begin{aligned} y_i &= 1\{y_i^* > 0 \}, \\ y_i^* &= x_i \beta - \epsilon_i, \end{aligned}$

where:

$y_i^*$ is an unobserved (latent) variable.
$\epsilon_i$ is a random variable with mean 0, representing unobserved noise.

Rewriting in terms of observed data:

$y_i = 1\{x_i \beta > \epsilon_i\}.$

The probability function becomes:

$\begin{aligned} p(x_i) &= P(y_i = 1 | x_i) \\ &= P(x_i \beta > \epsilon_i | x_i) \\ &= F_{\epsilon|X}(x_i \beta | x_i), \end{aligned}$

where $F_{\epsilon|X}(.)$ is the cumulative distribution function (CDF) of $\epsilon_i$ . Assuming independence of $\epsilon_i$ and $x_i$ , the probability function simplifies to:

$p(x_i) = F_\epsilon(x_i \beta).$

The conditional expectation function is equivalent:

$E(y_i | x_i) = P(y_i = 1 | x_i) = F_\epsilon(x_i \beta).$

Common Distributional Assumptions

Probit Model:
- Assumes $\epsilon_i$ follows a standard normal distribution.
- $F_\epsilon(.) = \Phi(.)$ , where $\Phi(.)$ is the standard normal CDF.
Logit Model:
- Assumes $\epsilon_i$ follows a standard logistic distribution.
- $F_\epsilon(.) = \Lambda(.)$ , where $\Lambda(.)$ is the logistic CDF.

Steps to Derive MLE for Binary Models

Specify the Log-Likelihood:
- For a chosen distribution (e.g., normal for Probit or logistic for Logit), the log-likelihood is:
  
  $\ln(f_{Y|X}(y_i, x_i; \beta)) = y_i \ln(F_\epsilon(x_i \beta)) + (1 - y_i) \ln(1 - F_\epsilon(x_i \beta)).$
Maximize the Log-Likelihood:
- Find the parameter estimates that maximize the log-likelihood:
  
  $\hat{\beta}_{MLE} = \underset{\beta}{\text{argmax}} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \beta)).$

Properties of Probit and Logit Estimators

Consistency and Asymptotic Normality:
- Probit and Logit estimators are consistent and asymptotically normal if:
  - A2 Full Rank: $E(x_i' x_i)$ exists and is non-singular.
  - A5 Data Generation (Random Sampling): $\{y_i, x_i\}$ are iid (or stationary and weakly dependent).
  - Distributional assumptions on $\epsilon_i$ hold (e.g., normal or logistic, independent of $x_i$ ).
Asymptotic Efficiency:
- Under these assumptions, Probit and Logit estimators are asymptotically efficient with variance:
  
  $I(\beta_0)^{-1} = \left[E\left(\frac{(f_\epsilon(x_i \beta_0))^2}{F_\epsilon(x_i \beta_0)(1 - F_\epsilon(x_i \beta_0))} x_i' x_i \right)\right]^{-1},$
  
  where $f_\epsilon(x_i \beta_0)$ is the PDF (derivative of the CDF).

Interpretation of Binary Response Models

Binary response models, such as Probit and Logit, estimate the probability of an event occurring ( $y_i = 1$ ) given predictor variables $x_i$ . However, interpreting the estimated coefficients ( $\beta$ ) in these models differs significantly from linear models. Below, we explore how to interpret these coefficients and the concept of partial effects.

Interpreting $\beta$ in Binary Response Models

In binary response models, the coefficient $\beta_j$ represents the average change in the latent variable $y_i^*$ (an unobserved variable) for a one-unit change in $x_{ij}$ . While this provides insight into the direction of the relationship:

Magnitudes of $\beta_j$ do not have a direct, meaningful interpretation in terms of $y_i$ .
Direction of $\beta_j$ is meaningful:
- $\beta_j > 0$ : A positive association between $x_{ij}$ and the probability of $y_i = 1$ .
- $\beta_j < 0$ : A negative association between $x_{ij}$ and the probability of $y_i = 1$ .

Partial Effects in Nonlinear Binary Models

To interpret the effect of a change in a predictor $x_{ij}$ on the probability of an event occurring ( $P(y_i = 1|x_i)$ ), we use the partial effect:

$E(y_i | x_i) = F_\epsilon(x_i \beta),$

where $F_\epsilon(.)$ is the cumulative distribution function (CDF) of the error term $\epsilon_i$ (e.g., standard normal for Probit, logistic for Logit). The partial effect is the derivative of the expected probability with respect to $x_{ij}$ :

$PE(x_{ij}) = \frac{\partial E(y_i | x_i)}{\partial x_{ij}} = f_\epsilon(x_i \beta) \beta_j,$

where:

$f_\epsilon(.)$ is the probability density function (PDF) of the error term $\epsilon_i$ .
$\beta_j$ is the coefficient associated with $x_{ij}$ .

Key Characteristics of Partial Effects

Scaling Factor:
- The partial effect depends on a scaling factor, $f_\epsilon(x_i \beta)$ , which is derived from the density function $f_\epsilon(.)$ .
- The scaling factor varies depending on the values of $x_i$ , making the partial effect nonlinear and context-dependent.
Non-Constant Partial Effects:
- Unlike linear models where coefficients directly represent constant marginal effects, the partial effect in binary models changes based on $x_i$ .
- For example, in a Logit model, the partial effect is largest when $P(y_i = 1 | x_i)$ is around 0.5 (the midpoint of the S-shaped logistic curve) and smaller at the extremes (close to 0 or 1).

Single Values for Partial Effects

In practice, researchers often summarize partial effects using either:

Partial Effect at the Average (PEA):
- The partial effect is calculated for an “average individual,” where $x_i = \bar{x}$ (the sample mean of predictors): $PEA = f_\epsilon(\bar{x}\hat{\beta}) \hat{\beta}_j.$
- This provides a single, interpretable value but assumes the average effect applies to all individuals.
Average Partial Effect (APE):
- The average of all individual-level partial effects across the sample: $APE = \frac{1}{n} \sum_{i=1}^{n} f_\epsilon(x_i \hat{\beta}) \hat{\beta}_j.$
- This accounts for the nonlinearity of the partial effects and provides a more accurate summary of the marginal effect in the population.

Comparing Partial Effects in Linear and Nonlinear Models

Linear Models:
- Partial effects are constant: $APE = PEA$ .
- The coefficients directly represent the marginal effects on $E(y_i | x_i)$ .
Nonlinear Models:
- Partial effects are not constant due to the dependence on $f_\epsilon(x_i \beta)$ .
- As a result, $APE \neq PEA$ in general.

References

Little, Roderick JA. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values.” Journal of the American Statistical Association 83 (404): 1198–1202.

Rao, C Radhakrishna. 1992. “Information and the Accuracy Attainable in the Estimation of Statistical Parameters.” In Breakthroughs in Statistics: Foundations and Basic Theory, 235–47. Springer.