5.3 Maximum Likelihood
The Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the parameters of a model by maximizing the likelihood of observing the given data. The premise is to find the parameter values that maximize the probability (or likelihood) of the observed data.
The likelihood function, denoted as L(θ), is expressed as:
L(θ)=n∏i=1f(yi|θ)
where:
- f(y|θ) is the probability density or mass function of observing a single value of Y given the parameter θ.
- The product runs over all n observations.
For different types of data, f(y|θ) can take different forms. For example, if y is dichotomous (e.g., success/failure), then the likelihood function becomes:
L(θ)=n∏i=1θyi(1−θ)1−yi
Here, ˆθ is the Maximum Likelihood Estimator (MLE) if:
L(ˆθ)>L(θ0),∀θ0 in the parameter space.
See Distributions for a review on variable distributions.
5.3.1 Motivation for MLE
Suppose we know the conditional distribution of Y given X, denoted as:
fY|X(y,x;θ)
where θ is an unknown parameter of the distribution. Sometimes, we are only concerned with the unconditional distribution fY(y;θ).
For a sample of independent and identically distributed (i.i.d.) data, the joint probability of the sample is:
fY1,…,Yn|X1,…,Xn(y1,…,yn,x1,…,xn;θ)=n∏i=1fY|X(yi,xi;θ)
The joint distribution, evaluated at the observed data, defines the likelihood function. The goal of MLE is to find the parameter θ that maximizes this likelihood.
To estimate θ, we maximize the likelihood function:
max
In practice, it is easier to work with the natural logarithm of the likelihood (log-likelihood), as it transforms the product into a sum:
\max_{\theta} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \theta))
Solving for the Maximum Likelihood Estimator
First-Order Condition: Solve the first derivative of the log-likelihood function with respect to \theta:
\frac{\partial}{\partial \theta} \ell(\theta) \;=\; \frac{\partial}{\partial \theta} \ln L(\theta) \;=\; \frac{\partial}{\partial \theta} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \hat{\theta}_{MLE})) = 0
This yields the critical points where the likelihood is maximized. This derivative, sometimes written as U(\theta), is called the score. Intuitively, the log-likelihood’s “peak” indicates the parameter value(s) that make the observed data “most likely.”
Second-Order Condition: Verify that the second derivative of the log-likelihood function is negative at the critical point:
\frac{\partial^2}{\partial \theta^2} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \hat{\theta}_{MLE})) < 0
This ensures that the solution corresponds to a maximum.
Examples of Likelihood Functions
- Unconditional Poisson Distribution
The Poisson distribution models count data, such as the number of website visits in a day or product orders per hour. Its likelihood function is:
L(\theta) = \prod_{i=1}^{n} \frac{\theta^{y_i} e^{-\theta}}{y_i!}
The exponential distribution is often used to model the time between events, such as the time until a machine fails. Its probability density function (PDF) is:
f_{Y|X}(y, x; \theta) = \frac{\exp(-y / (x \theta))}{x \theta}
The joint likelihood for n observations is:
L(\theta) = \prod_{i=1}^{n} \frac{\exp(-y_i / (x_i \theta))}{x_i \theta}
By taking the logarithm, we obtain the log-likelihood for ease of maximization.
5.3.2 Key Quantities for Inference
Score Function
The score is given by
U(\theta) \;=\; \frac{d}{d\theta} \ell(\theta).
Setting U(\hat{\theta}_{\mathrm{MLE}}) = 0 yields the critical points of the log-likelihood, from which we can find \hat{\theta}_{\mathrm{MLE}}.Observed Information
The second derivative of the log-likelihood, taken at the MLE, is called the observed information:I_O(\theta) \;=\; - \frac{d^2}{d\theta^2} \ell(\theta).
(The negative sign is often included so that I_O(\theta) is positive if \ell(\theta) is concave near its maximum. In some texts, you will see it defined without the negative sign, but the idea is the same: it measures the “pointedness” or curvature of \ell(\theta) at its maximum.)
Fisher Information
The Fisher Information (or expected information) is the expectation of the observed information over the distribution of the data:I(\theta) \;=\; \mathbb{E}\bigl[I_O(\theta)\bigr].
It quantifies how much information the data carry about the parameter \theta. A larger Fisher information suggests that you can estimate \theta more precisely.
Approximate Variance of \hat{\theta}_{\mathrm{MLE}}
One of the key results from standard asymptotic theory is that, for large n, the variance of \hat{\theta}_{\mathrm{MLE}} can be approximated by the inverse of the Fisher information:\mathrm{Var}\bigl(\hat{\theta}_{\mathrm{MLE}}\bigr) \;\approx\; I(\theta)^{-1}.
This also lays the groundwork for constructing confidence intervals for \theta in large samples.
5.3.3 Assumptions of MLE
MLE has desirable properties—consistency, asymptotic normality, and efficiency—but these do not come “for free.” Instead, they rely on certain assumptions. Below is a breakdown of the main regularity conditions. These conditions are typically mild in many practical settings (for example, in exponential families, such as the normal distribution), but need to be checked in more complex models.
High-Level Regulatory Assumptions
Independence and Identical Distribution (iid)
The sample \{(x_i, y_i)\} is usually assumed to be composed of independent and identically distributed observations. This independence assumption simplifies the likelihood to a product of individual densities: L(\theta) = \prod_{i=1}^n f_{Y\mid X}(y_i, x_i; \theta). In practice, if you have dependent data (e.g., time series, spatial data), modifications are required in the likelihood function.Same Density Function
All observations must come from the same conditional probability density function f_{Y\mid X}(\cdot,\cdot;\theta). If the model changes across observations, you cannot simply multiply all of them together in one unified likelihood.Multivariate Normality (for certain models)
In many practical cases—especially for continuous outcomes—you might assume (multivariate) normal distributions with finite second or fourth moments (Little 1988). Under these assumptions, the MLE for the mean vector and covariance matrix is consistent and (under further conditions) asymptotically normal. This assumption is quite common in regression, ANOVA, and other classical statistical frameworks.
5.3.3.1 Large Sample Properties of MLE
5.3.3.1.1 Consistency of MLE
Definition: An estimator \hat{\theta}_n is consistent if it converges in probability to the true parameter value \theta_0 as the sample size n \to \infty:
\hat{\theta}_n \;\to^p\; \theta_0.
For the MLE, a set of regularity conditions R1–R4 is commonly used to ensure consistency:
R1
If \theta \neq \theta_0, then
f_{Y\mid X}(y_i, x_i; \theta) \;\neq\; f_{Y\mid X}(y_i, x_i; \theta_0).In simpler terms, the model is identifiable: no two distinct parameter values generate the exact same distribution for the data.
R2
The parameter space \Theta is compact (closed and bounded), and it contains the true parameter \theta_0. This ensures that \theta lies in a “nice” region (no parameter going to infinity, etc.), making it easier to prove that a maximum in that space indeed exists.R3
The log-likelihood function \ln(f_{Y\mid X}(y_i, x_i; \theta)) is continuous in \theta with probability 1. Continuity is important so that we can apply theorems (like the Continuous Mapping Theorem or the Extreme Value Theorem) to find maxima.R4
The expected supremum of the absolute value of the log-likelihood is finite:\mathbb{E}\!\Bigl(\sup_{\theta \in \Theta} \bigl|\ln(f_{Y\mid X}(y_i, x_i; \theta))\bigr|\Bigr) \;<\;\infty.
This is a technical condition that helps ensure we can “exchange” expectations and suprema, a step needed in many consistency proofs.
When these conditions are satisfied, you can show via standard arguments (e.g., the Law of Large Numbers, uniform convergence of the log-likelihood) that:
\hat{\theta}_{\mathrm{MLE}} \;\to^p\; \theta_0 \quad (\text{consistency}).
5.3.3.1.2 Asymptotic Normality of MLE
Definition: An estimator \hat{\theta}_n is asymptotically normal if
\sqrt{n}\,(\hat{\theta}_n - \theta_0) \;\to^d\; \mathcal{N}\bigl(0,\Sigma\bigr),
where \to^d denotes convergence in distribution and \Sigma is some covariance matrix. For the MLE, \Sigma is typically I(\theta_0)^{-1}, where I(\theta_0) is the Fisher information evaluated at the true parameter.
Beyond R1–R4, we need the following additional assumptions:
R5
The true parameter \theta_0 is in the interior of the parameter space \Theta. If \theta_0 sits on the boundary, different arguments are required to handle edge effects.R6
The pdf f_{Y\mid X}(y_i, x_i; \theta) is twice continuously differentiable (in \theta) and strictly positive in a neighborhood N of \theta_0. This allows us to use second-order Taylor expansions around \theta_0 to get the approximate distribution of \hat{\theta}_{\mathrm{MLE}}.R7
The following integrals are finite in some neighborhood N of \theta_0:- \displaystyle \int \sup_{\theta \in N} \Bigl\|\partial f_{Y\mid X}(y_i, x_i; \theta)/\partial \theta \Bigr\|\; d(y,x) < \infty.
- \displaystyle \int \sup_{\theta \in N} \Bigl\|\partial^2 f_{Y\mid X}(y_i, x_i; \theta)/\partial \theta \partial \theta' \Bigr\|\; d(y,x) < \infty.
- \displaystyle \mathbb{E}\Bigl(\sup_{\theta \in N} \Bigl\|\partial^2 \ln(f_{Y\mid X}(y_i, x_i; \theta))/\partial \theta \partial \theta' \Bigr\|\Bigr) < \infty.
These conditions ensure that differentiating inside integrals is justified (via the dominated convergence theorem) and that we can expand the log-likelihood in a Taylor series safely.
R8
The information matrix I(\theta_0) exists and is nonsingular:I(\theta_0) \;=\; \mathrm{Var}\Bigl(\frac{\partial}{\partial \theta} \ln\bigl(f_{Y\mid X}(y_i, x_i; \theta_0)\bigr)\Bigr) \;\neq\; 0.
Nonsingularity implies there is enough information in the data to estimate \theta uniquely.
Under R1–R8, you can show that
\sqrt{n}\,(\hat{\theta}_{\mathrm{MLE}} - \theta_0) \;\to^d\; \mathcal{N}\Bigl(0,\,I(\theta_0)^{-1}\Bigr).
This result is central to frequentist inference, allowing you to construct approximate confidence intervals and hypothesis tests using the normal approximation for large n.
5.3.4 Properties of MLE
Having established in earlier sections that Maximum Likelihood Estimators (MLEs) are consistent (Consistency of MLE) and asymptotically normal (Asymptotic Normality of MLE) under standard regularity conditions, we now highlight additional properties that make MLE a powerful estimation technique.
- Asymptotic Efficiency
- Definition: An estimator is asymptotically efficient if it attains the smallest possible asymptotic variance among all consistent estimators (i.e., it achieves the Cramér-Rao Lower Bound).
- Interpretation: In large samples, MLE typically has smaller standard errors than other consistent estimators that do not fully use the assumed distributional form.
- Implication: When the true model is correctly specified, MLE is the most efficient among a broad class of estimators, leading to more precise inference for \theta.
- Cramér-Rao Lower Bound (CRLB): A theoretical lower limit on the variance of any unbiased (or asymptotically unbiased) estimator C. R. Rao (1992).
- When MLE Meets CRLB: Under correct specification and standard regularity conditions, the asymptotic variance of the MLE matches the CRLB, making it asymptotically efficient.
- Interpretation: Achieving CRLB means no other unbiased estimator can consistently outperform MLE in terms of variance for large n.
- Invariance
- Core Idea: If \hat{\theta} is the MLE for \theta, then for any smooth transformation g(\theta), the MLE for g(\theta) is simply g(\hat{\theta}).
- Example: If \theta is a mean parameter and you want the MLE for the variance \theta^2, you can just square the MLE for \theta.
- Key Point: This invariance property saves considerable effort—there is no need to re-derive a new likelihood for the transformed parameter.
- Explicit vs. Implicit MLE
- Explicit MLE:
Occurs when the score equation can be solved in closed form. A classic example is the MLE for the mean and variance in a normal distribution. - Implicit MLE:
Happens when no closed-form solution exists. Iterative numerical methods, such as Newton-Raphson, Expectation-Maximization (EM), or other optimization algorithms, are used to find \hat{\theta}.
Distributional Mis-Specification
- Definition: If you assume a distribution for f_{Y|X}(\cdot;\theta) that does not reflect the true data-generating process, the MLE may become inconsistent or biased in finite samples.
- Quasi-MLE:
- A strategy to handle certain forms of mis-specification.
- If the chosen distribution belongs to a flexible class or meets certain conditions (e.g., generalized linear models with a robust link), the resulting parameter estimates can remain consistent for some parameters of interest.
- Nonparametric & Semiparametric Approaches:
- Require minimal or no distributional assumptions.
- More robust to mis-specification but can be harder to implement and may exhibit higher variance or require larger sample sizes to achieve comparable precision.
5.3.5 Practical Considerations
- Use Cases
- MLE is extremely popular for:
- Binary Outcomes (logistic regression)
- Count Data (Poisson regression)
- Strictly Positive Outcomes (Gamma regression)
- Heteroskedastic Settings (models with variance related to mean, e.g., GLMs)
- MLE is extremely popular for:
- Distributional Assumptions
- The efficiency gains of MLE stem from using a specific probability model.
- If the assumed model closely reflects the data-generating process, MLE gives accurate parameter estimates and reliable standard errors.
- MLE assumes knowledge of the conditional distribution of the outcome variable. This assumption parallels the normality assumption in linear regression models (e.g., A6 Normal Distribution).
- If severely mis-specified, consider robust or semi-/nonparametric methods.
- Comparison with OLS: See Comparison of MLE and OLS for more details.
- Ordinary Least Squares is a special case of MLE when errors are normally distributed and homoscedastic.
- In more general settings (e.g., non-Gaussian or heteroskedastic data), MLE can outperform OLS in terms of smaller standard errors and better inference.
- Numerical Stability & Computation
- For complex likelihoods, iterative methods can fail to converge or converge to local maxima.
- Proper initialization and diagnostics (e.g., checking multiple start points) are crucial.
5.3.6 Comparison of MLE and OLS
While Maximum Likelihood Estimation is a powerful estimation method, it does not solve all of the challenges associated with Ordinary Least Squares. Below is a detailed comparison highlighting similarities, differences, and limitations.
Key Points of Comparison
- Inference Methods:
- MLE:
- Joint inference is typically conducted using log-likelihood calculations, such as likelihood ratio tests or information criteria (e.g., AIC, BIC).
- These methods replace the use of F-statistics commonly associated with OLS.
- OLS:
- Relies on the F-statistic for hypothesis testing and joint inference.
- MLE:
- Sensitivity to Functional Form:
- Both MLE and OLS are sensitive to the functional form of the model. Incorrect specification (e.g., linear vs. nonlinear relationships) can lead to biased or inefficient estimates in both cases.
- Perfect Collinearity and Multicollinearity:
- Both methods are affected by collinearity:
- Perfect collinearity (e.g., two identical predictors) makes parameter estimation impossible.
- Multicollinearity (highly correlated predictors) inflates standard errors, reducing the precision of estimates.
- Neither MLE nor OLS directly resolves these issues without additional measures, such as regularization or variable selection.
- Both methods are affected by collinearity:
- Endogeneity:
- Problems like omitted variable bias or simultaneous equations affect both MLE and OLS:
- If relevant predictors are omitted, estimates from both methods are likely to be biased and inconsistent.
- Similarly, in systems of simultaneous equations, both methods yield biased results unless endogeneity is addressed through instrumental variables or other approaches.
- MLE, while efficient under correct model specification, does not inherently address endogeneity.
- Problems like omitted variable bias or simultaneous equations affect both MLE and OLS:
Situations Where MLE and OLS Differ
Aspect | MLE | OLS |
---|---|---|
Estimator Efficiency | Efficient for correctly specified distributions. | Efficient under Gauss-Markov assumptions. |
Assumptions about Errors | Requires specifying a distribution (e.g., normal, binomial). | Requires only mean-zero errors and homoscedasticity. |
Use of Likelihood | Based on maximizing the likelihood function for parameter estimation. | Based on minimizing the sum of squared residuals. |
Model Flexibility | More flexible (supports various distributions, non-linear models). | Primarily linear models (extensions for non-linear exist). |
Interpretation | Log-likelihood values guide model comparison (AIC/BIC). | R-squared and adjusted R-squared measure fit. |
Practical Considerations
- When to Use MLE:
- Situations where the dependent variable is:
- Binary (e.g., logistic regression)
- Count data (e.g., Poisson regression)
- Skewed or bounded (e.g., survival models)
- When the model naturally arises from a probabilistic framework.
- Situations where the dependent variable is:
- When to Use OLS:
- Suitable for continuous dependent variables with approximately linear relationships between predictors and outcomes.
- Simpler to implement and interpret when the assumptions of linear regression are reasonably met.
5.3.7 Applications of MLE
MLE is widely used across various applications to estimate parameters in models tailored for specific data structures. Below are key applications of MLE, categorized by problem type and estimation method.
Model Type | Examples | Key Characteristics | Common Estimation Methods | Additional Notes |
---|---|---|---|---|
Corner Solution Models | Hours worked Donations to charity Household consumption of a good |
Dependent variable is often censored at zero (or another threshold). Large fraction of observations at the corner (e.g., 0 hours, 0 donations). |
Tobit regression (latent variable approach with censoring) |
Useful when a continuous outcome has a mass point at zero but also positive values (e.g., 30% of individuals donate $0, the rest donate > $0). |
Non-Negative Count Models | Number of arrests Number of cigarettes smoked Doctor visits per year |
Dependent variable consists of non-negative integer counts. Possible overdispersion (variance > mean). |
Poisson regression, Negative Binomial regression |
Poisson assumes mean = variance, so often Negative Binomial is preferred for real data. Zero-inflated models (ZIP/ZINB) may be used for data with excess zeros. |
Multinomial Choice Models | Demand for different car brands Votes in a primary election Choice of travel mode |
Dependent variable is a categorical choice among 3+ alternatives. Each category is distinct, with no inherent ordering (e.g., brand A, B, or C). |
Multinomial logit, Multinomial probit |
Extension of binary choice (logit/probit) to multiple categories. Independence of Irrelevant Alternatives (IIA) can be a concern for the multinomial logit. |
Ordinal Choice Models | Self-reported happiness (low/medium/high) Income level brackets Likert-scale surveys |
Dependent variable is ordered (e.g., low < medium < high). Distances between categories are not necessarily equal. |
Ordered logit, Ordered probit |
Probit/logit framework adapted to preserve ordinal information. Interprets latent continuous variable mapped to discrete ordered categories. |
5.3.7.1 Binary Response Models
A binary response variable (y_i) follows a Bernoulli distribution:
f_Y(y_i; p) = p^{y_i}(1-p)^{(1-y_i)}
where p is the probability of success. For conditional models, the likelihood becomes:
f_{Y|X}(y_i, x_i; p(.)) = p(x_i)^{y_i}(1 - p(x_i))^{(1-y_i)}
To model p(x_i), we use a function of x_i and unknown parameters \theta. A common approach involves a latent variable model:
\begin{aligned} y_i &= 1\{y_i^* > 0 \}, \\ y_i^* &= x_i \beta - \epsilon_i, \end{aligned}
where:
- y_i^* is an unobserved (latent) variable.
- \epsilon_i is a random variable with mean 0, representing unobserved noise.
Rewriting in terms of observed data:
y_i = 1\{x_i \beta > \epsilon_i\}.
The probability function becomes:
\begin{aligned} p(x_i) &= P(y_i = 1 | x_i) \\ &= P(x_i \beta > \epsilon_i | x_i) \\ &= F_{\epsilon|X}(x_i \beta | x_i), \end{aligned}
where F_{\epsilon|X}(.) is the cumulative distribution function (CDF) of \epsilon_i. Assuming independence of \epsilon_i and x_i, the probability function simplifies to:
p(x_i) = F_\epsilon(x_i \beta).
The conditional expectation function is equivalent:
E(y_i | x_i) = P(y_i = 1 | x_i) = F_\epsilon(x_i \beta).
Common Distributional Assumptions
- Probit Model:
- Assumes \epsilon_i follows a standard normal distribution.
- F_\epsilon(.) = \Phi(.), where \Phi(.) is the standard normal CDF.
- Logit Model:
- Assumes \epsilon_i follows a standard logistic distribution.
- F_\epsilon(.) = \Lambda(.), where \Lambda(.) is the logistic CDF.
Steps to Derive MLE for Binary Models
- Specify the Log-Likelihood:
For a chosen distribution (e.g., normal for Probit or logistic for Logit), the log-likelihood is:
\ln(f_{Y|X}(y_i, x_i; \beta)) = y_i \ln(F_\epsilon(x_i \beta)) + (1 - y_i) \ln(1 - F_\epsilon(x_i \beta)).
- Maximize the Log-Likelihood:
Find the parameter estimates that maximize the log-likelihood:
\hat{\beta}_{MLE} = \underset{\beta}{\text{argmax}} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \beta)).
Properties of Probit and Logit Estimators
- Consistency and Asymptotic Normality:
- Probit and Logit estimators are consistent and asymptotically normal if:
- A2 Full Rank: E(x_i' x_i) exists and is non-singular.
- A5 Data Generation (Random Sampling): \{y_i, x_i\} are iid (or stationary and weakly dependent).
- Distributional assumptions on \epsilon_i hold (e.g., normal or logistic, independent of x_i).
- Probit and Logit estimators are consistent and asymptotically normal if:
- Asymptotic Efficiency:
Under these assumptions, Probit and Logit estimators are asymptotically efficient with variance:
I(\beta_0)^{-1} = \left[E\left(\frac{(f_\epsilon(x_i \beta_0))^2}{F_\epsilon(x_i \beta_0)(1 - F_\epsilon(x_i \beta_0))} x_i' x_i \right)\right]^{-1},
where f_\epsilon(x_i \beta_0) is the PDF (derivative of the CDF).
Interpretation of Binary Response Models
Binary response models, such as Probit and Logit, estimate the probability of an event occurring (y_i = 1) given predictor variables x_i. However, interpreting the estimated coefficients (\beta) in these models differs significantly from linear models. Below, we explore how to interpret these coefficients and the concept of partial effects.
- Interpreting \beta in Binary Response Models
In binary response models, the coefficient \beta_j represents the average change in the latent variable y_i^* (an unobserved variable) for a one-unit change in x_{ij}. While this provides insight into the direction of the relationship:
- Magnitudes of \beta_j do not have a direct, meaningful interpretation in terms of y_i.
- Direction of \beta_j is meaningful:
- \beta_j > 0: A positive association between x_{ij} and the probability of y_i = 1.
- \beta_j < 0: A negative association between x_{ij} and the probability of y_i = 1.
- Partial Effects in Nonlinear Binary Models
To interpret the effect of a change in a predictor x_{ij} on the probability of an event occurring (P(y_i = 1|x_i)), we use the partial effect:
E(y_i | x_i) = F_\epsilon(x_i \beta),
where F_\epsilon(.) is the cumulative distribution function (CDF) of the error term \epsilon_i (e.g., standard normal for Probit, logistic for Logit). The partial effect is the derivative of the expected probability with respect to x_{ij}:
PE(x_{ij}) = \frac{\partial E(y_i | x_i)}{\partial x_{ij}} = f_\epsilon(x_i \beta) \beta_j,
where:
f_\epsilon(.) is the probability density function (PDF) of the error term \epsilon_i.
\beta_j is the coefficient associated with x_{ij}.
- Key Characteristics of Partial Effects
- Scaling Factor:
- The partial effect depends on a scaling factor, f_\epsilon(x_i \beta), which is derived from the density function f_\epsilon(.).
- The scaling factor varies depending on the values of x_i, making the partial effect nonlinear and context-dependent.
- Non-Constant Partial Effects:
- Unlike linear models where coefficients directly represent constant marginal effects, the partial effect in binary models changes based on x_i.
- For example, in a Logit model, the partial effect is largest when P(y_i = 1 | x_i) is around 0.5 (the midpoint of the S-shaped logistic curve) and smaller at the extremes (close to 0 or 1).
- Single Values for Partial Effects
In practice, researchers often summarize partial effects using either:
- Partial Effect at the Average (PEA):
- The partial effect is calculated for an “average individual,” where x_i = \bar{x} (the sample mean of predictors): PEA = f_\epsilon(\bar{x}\hat{\beta}) \hat{\beta}_j.
- This provides a single, interpretable value but assumes the average effect applies to all individuals.
- Average Partial Effect (APE):
- The average of all individual-level partial effects across the sample: APE = \frac{1}{n} \sum_{i=1}^{n} f_\epsilon(x_i \hat{\beta}) \hat{\beta}_j.
- This accounts for the nonlinearity of the partial effects and provides a more accurate summary of the marginal effect in the population.
- Comparing Partial Effects in Linear and Nonlinear Models
- Linear Models:
- Partial effects are constant: APE = PEA.
- The coefficients directly represent the marginal effects on E(y_i | x_i).
- Nonlinear Models:
- Partial effects are not constant due to the dependence on f_\epsilon(x_i \beta).
- As a result, APE \neq PEA in general.