Chapter 4 Maximum Likelihood

4.1 Motivation

We’ve discovered MoM estimation, which is a (usually) pretty simple but often not realistic type of estimator. In this section, we’ll engage in another estimation method: Maximum Likelihood Estimation (MLE) for short, one of the largest concepts in Inference. You’ll find this to be more intricate but simultaneously more relevant than the MoM approach; indeed, MLE shows up in many, many statistical adventures.

4.2 Maximum Likelihood Estimation

If you wanted to sum up Method of Moments (MoM) estimators in one sentence, you would say “estimates for parameters in terms of the sample moments.” For MLEs (Maximum Likelihood Estimators), you would say “estimators for a parameter that maximize the likelihood, or probability, of the observed data.” That is, based on the data you collect, an MLE is the most likely value for the parameter you’re trying to guess.

This is pretty intuitive. We collect data, and then, based on that data, there are likely values for the true parameter. For example, if we sampled 100 college men and found an average weight of 140 pounds, it’s more likely that the true average weight is 142 pounds than the true average weight being 195 pounds or 103 pounds. Of course, this is qualitative; MLEs will allow us to completely quantify this approach.

The MLE approach centers around the likelihood function. Basically, this is the likelihood, or PDF, of the data given the parameters. That’s the definition in English; in statistics, we would write:

\[f(X_1,X_2,...,X_n|\theta)\]

Where, remember, $X_1,X_2,...,X_n$ are the $n$ observations that we sample, and each are random variables governed by the underlying model with $\theta$ as a parameter. Essentially, then, this function gives us the probability of observing the data we observed, or $X_1,X_2,...,X_n$ for different parameter values of $\theta$. Intuitively, we want to maximize this; that is, find the value of $\theta$ for which this is the highest. That makes sense, right? The value of $\theta$ that maximizes this function is the value of the parameter that makes the data we observed the most likely. If we find this value for $\theta$, we find the MLE, written as $\hat{\theta}_{MLE}$.

So, that’s the likelihood function. Remember in the intro to this book how we mentioned that often we make the assumption of i.i.d. observations; that is, every individual we sample is independent and follows the same distribution? Well, if we employ that assumption here, the likelihood function simplifies quite a bit. Remember that if two random variables are independent, their joint PDF is the product of the two marginal PDFs. So, if $X_1,X_2,...,X_n$ are i.i.d., then the joint PDF is one big product:

\[f(X_1,X_2,...,X_n|\theta) = f(X_1|\theta)f(X_2|\theta)...f(X_n|\theta) = \prod_{i=1}^n f(X_i|\theta)\]

If you haven’t seen the $\prod$ notation before, it’s the exact same as a $\sum$, but instead of taking the sum of the values you take their product. That is, this product means multiply all of the PDFs from $X_1$ to $X_n$.

Neat! This is starting to look like something relatively easy to work with. However, Let’s think about what sort of value we got if we actually tried this for a specific distribution. Let’s think about a specific $f(X_i|\theta)$; say, $f(X_7|\theta)$. If we actually ran the experiment, this $f(X_7|\theta)$ is basically saying “what’s the probability (or density, in the continuous case) that the $7^{th}$ person had the value that they did (maybe we’re measuring weight, and the $7^{th}$ person weighed 150 pounds) given some parameter (maybe we’re assuming that the mean weight is 145)?”

It’s important to note that many of the $f(X_i|\theta)$ are likely going to be pretty small, and we’re multiplying all of them together! What if $n = 100$? Then we’re multiplying a bunch of really small things together, which of course makes something that’s really, really small!

Remember, the whole point was to maximize this function by picking the correct $\theta$; however, if we have a bunch of output that’s really, really small for every $\theta$ we try, it may be computationally impossible to compare them (too small for a human or even a computer to reasonably work with). So, this gives us a reason to we introduce the log likelihood function.

The log-likelihood function is just what it sounds like; it’s the log (base $e$, or natural log) of the likelihood function!

\[l(\theta) = log\Big(\prod_{i=1}^n f(X_i|\theta)\Big) = \sum_{i=1}^n log\big(f(X_i|\theta)\big)\]

Where $l(\theta)$ is notation that means ‘log-likelihood function.’

You can see that this essentially solves our smallness problem, because now instead of multiplying a lot of small things and making them smaller, we are adding them and making them bigger!

Another cool feature of the log likelihood function is that the value of $\theta$ that maximizes the log likelihood is the same value that maximizes the likelihood (thanks to the nice smooth, mathematical properties of the log function). The point is that we can maximize the log likelihood function and get the correct maximum likelihood estimator.

That’s about it! Now, we just have to maximize the log-likelihood function, and this gives us $\hat{\theta}_{MLE}$, or the maximum likelihood estimator. In this book, we’ll (generally) use one of the two following methods to maximize these functions:

Derive the function, set it equal to zero, and solve for the parameter.
Maximize it in R.

Either of these are totally valid ways to get $\hat{\theta}_{MLE}$. In general, the steps you take for finding the MLE are:

Find the likelihood function (generally just a big product of some PDF you’re familiar with).
Take the log of the likelihood function to get the log likelihood function.
Maximize the log likelihood function with respect the the parameters you are looking for.

This probably seems a little hairy, so we’re going to do an example. We’ll work with the Exponential distribution. Recall that if $X \sim Expo(\lambda)$, the PDF is given by:

\[f(x) = \lambda e^{-x\lambda}\]

So, basically, here’s what’s going to happen. Let’s say we’re interested in the survival time after a cancer diagnosis. You believe that the survival times are Exponentially distributed (many die quickly, some live for a long time). You decide to carry out inference for this problem; that is, you are going to estimate the parameter, $\lambda$, that would govern the Exponential distribution that you think underlies cancer survival. You might get the data (a sample) from a hospital or something. Assume that the observations are i.i.d.

You decide that you want to find the Maximum Likelihood Estimator for $\lambda$. In our notation, that’s $\hat{\lambda}_{MLE}$.

How do we start in the process of finding the MLE? Well, first we need the likelihood function. Recall that this is given by:

\[f(X_1,X_2,...,X_n|\theta)\]

And here, since $\lambda$ is the only parameter we’re interested in, $\lambda$ is $\theta$. Also, we know that the observations (by assumptions) are i.i.d., so we can write the joint PDF as a product of the marginals:

\[\prod_{i=1}^n f(X_i|\lambda)\]

Great. So, what is $f(X_i|\lambda)$? Well, we already said that we believe the $X$s to be Exponential, so conditional on the parameter, $\lambda$, the PDF is just the PDF of an exponential with $\lambda$ as the parameter, or $f(x) = \lambda e^{-x\lambda}$. Let’s plug this into the above product:

\[\prod_{i=1}^n f(X_i|\lambda) = \prod_{i=1}^n\lambda e^{-x_i\lambda} = \lambda^n e^{-\lambda \sum_{i=1}^n x_i}\]

Make sure to convince yourself of that last step; working with products and sums are a key part of inference. We recommend writing out the first few terms in the sequence, then simplifying, until you see a pattern. To break it down further, the $\lambda$ leading coefficient next to the $e$ has nothing to do with the changing $x_i$ values in the product, so we just end up multiplying by $\lambda$ a total of $n$ times, hence the term $\lambda^n$ coming out. Then, we have to deal with a product of exponentials as the $x_i$’s change. Of course, we know that $e^a e^b = e^{a+b}$, you can see how the $x_i$’s in the exponent turn into a sum. Again, try with the first few terms first if you’re unclear.

Now that we have the likelihood function, what’s the next step? Well, remember that if we actually did this for data, we would get a very small number, and thus it’s important for us to take the log of this function before working with it further. So, we take the log. Remember your log rules: that is, $log(a^b) = b*log(a)$, $log(ab) = log(a) + log(b)$, and $log(e^a) = a$, since we’re always working in base $e$, or $ln$, even though we’re too lazy to write it.

\[log(\lambda^n e^{-\lambda \sum_{i=1}^n x_i}) = n\cdot log(\lambda) - \lambda \sum_{i=1}^n x_i\]

That’s our log likelihood function! The only thing left to do is maximize this for $\lambda$, which will give us $\hat{\lambda}_{MLE}$. How do we maximize? We take the derivative first, which yields:

\[\frac{n}{\lambda} - \sum_{i=1}^n x_i\]

And then we set equal to 0 and solve for $\lambda$.

\[\frac{n}{\lambda} - \sum_{i=1}^n x_i = 0 \rightarrow \lambda = \frac{n}{\sum_{i=1}^n x_i }\]

That looks kind of ugly, but recall that the sample mean $\bar{X}$ is $\frac{1}{n} \sum_{i=1}^n x_i$. Hey, that’s just the reciprocal of $\frac{n}{\sum_{i=1}^n x_i }$. What that means is $\frac{n}{\sum_{i=1}^n x_i } = \frac{1}{\bar{X}}$, so we have finally found that:

\[\hat{\lambda}_{MLE} = \frac{1}{\bar{X}}\]

Or the MLE for $\lambda$ is 1 divided by the sample mean.

We could also maximize this function in R. We will define a function FUN_loglik that takes in data x and parameter lambda and optimize for the value of lambda.

# replicate
set.seed(0)


# generate data
lambda <- 5
n <- 20
x <- rexp(n, lambda)


# optimize log likelihood
FUN_loglik <- function(lambda, x) {
  return(n * log(lambda) - lambda * sum(x))
}

optim(par = c(.5), fn = FUN_loglik, x = x, control = list(fnscale = -1))

## Warning in optim(par = c(0.5), fn = FUN_loglik, x = x, control = list(fnscale = -1)): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly

## $par
## [1] 4.464844
## 
## $value
## [1] 9.923761
## 
## $counts
## function gradient 
##       36       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

1 / mean(x)

## [1] 4.464638

We see that the optimization point, $par = 4.64, is close to the MLE, or the reciprocal of the sample mean.

Phew! Does that make sense? We know that generally, if we want to estimate a population mean, we just use the sample mean as our estimator. However, remember that the mean of the Exponential is $\frac{1}{\lambda}$, or, rather, the parameter is the inverse of the mean of the distribution. It makes sense, then, that the we take the inverse of the sample mean to get the MLE for the parameter $\lambda$.

Long story short, this means that (back to our cancer survival example), if you took a sample of survival rates and found an average (sample mean) survival time of 10 months, then the MLE for $\lambda$ would be $\frac{1}{10}$.

Finally, here’s a discussion on MLEs in video form:

Click here to watch this video in your browser

Hopefully this clears up some of the confusion surrounding MLEs. The Exponential is definitely one of the easier examples, and while the process stays relatively constant, you’ll have to do some messy algebra with sums and products as you get into the more advanced distributions. However, if you have a strong base understanding of MLEs, you’re in a good spot.

4.3 Maximum Likelihood Confidence Intervals

Recall that Confidence Intervals give a plausible interval estimate for something we’re interested in. Kind of seems weird to think of an interval with MLEs, but remember that the MLE is just an estimator for a true parameter, and thus we will have some interval of uncertainty around our estimate.

Let’s recall the generic form for a Confidence Interval:

\[estimator \; \pm \; (z-value)*(SD \; of \; estimator)\]

This works for the Normal distribution; basically, we’re plucking off the extreme quantiles of the distribution and taking everything in the middle. Remember that we saw a couple of different variations of this: sometimes the estimator does not follow a Normal distribution; in that case, we used a value from the t distribution instead of a z-value for the Normal. Other times, we didn’t know the standard deviation of an estimator and had to approximate that (with the standard deviation of the sample).

We know for the MLE confidence interval our estimator is, well, the MLE, or $\frac{1}{\bar{X}}$ for the example we just did (yes, that’s a $\bar{X}$ in the denominator, it’s just hard to see). However, what’s the standard deviation? What’s the distribution? Are we allowed to use a z-value, or do we have to do a t-value? To answer these questions, we need to first visit another topic of this section.

4.3.1 Fisher’s Information

We already found out how to find MLEs, but we don’t really know their uncertainty, or variance. Could we find a way to calculate how reliable these estimators are? In the example we did earlier, how could we find the variance of $\frac{1}{\bar{X}}$, our MLE? Is there a general form that we can follow?

Turns out that Fisher’s Information will be useful to us. Remember how the MLE is just the maximum of the log likelihood function? Well, Fisher’s Information is also found from the log likelihood function. Specifically, it’s the negative Expectation of the second derivative of the log likelihood function, where you derive with respect to the parameter in question. In stat terms, we get:

\[I_n(\theta) = -E\big(\frac{\partial^2}{\partial \theta^2} l(\theta)\big)\]

Where again $l(\theta)$ is shorthand for the log likelihood function and $I_n(\theta)$ is shorthand for Fisher’s information. Here, $\theta$ is the parameter in question.

This is a very useful result. Let’s find Fisher’s Information for the Exponential example we did earlier. We already found that the first derivative of the log likelihood function is (note that we use $\lambda$ as the parameter in question, instead of the $\theta$ from above):

\[l'(\lambda) = \frac{n}{\lambda} - \sum_{i=1}^n x_i\]

So we just have to derive again to get the second derivative, take the expectation, and multiply by -1. Taking another derivative yields:

\[\frac{-n}{\lambda^2}\]

What’s the expectation of this? Well, $n$ is a constant, and $\lambda$ is parameter. Neither are random variables, so their expectations are just themselves. Then, we multiply by -1 to get Fisher’s Information:

\[I_n(\lambda) = \frac{n}{\lambda^2}\]

Neat!

You might be saying that we don’t actually know $\lambda$ (and we use $\lambda$ in the calculation above), since it’s the true parameter that we’re actually after. That’s right, so generally we’ll use an estimator for $\lambda$ if we see it; more on this later.

You might be wondering why Fisher’s information is useful; well, we can use it to find the distribution of the MLE! That is, we can find an MLE, but also find its distribution. It’s given by:

\[\hat{\theta}_{MLE} \sim N\big(\theta_0, \frac{1}{I_n(\theta_0)}\big)\]

Where $\theta_0$ is the hypothesized value for the parameter, or something that we’re trying to make inferences about (we won’t go over the proof here). This result holds if $n$ is large, there aren’t too many crazy outliers and you have i.i.d. observations (again, apologies if this is vague, but we won’t be reviewing the proof here).

This is huge, because it allows us to find the confidence interval based on the MLE for our parameters. We know that the MLE has a Normal distribution, so the z-value should work out. We now know the standard deviation of the estimator as well; it’s just the square root of the variance we found above. So, our confidence interval becomes:

\[\hat{\theta}_{MLE} \pm \frac{z^*_{1 - \frac{\alpha}{2}}}{\sqrt{I_n(\theta_0)}}\]

Where $z^*_{1 - \frac{\alpha}{2}}$ is just the z-value that puts $1 - \frac{\alpha}{2}$ weight in each tail. Usually $\alpha = .05$, because we’re testing at the .05 level of significance. Remember, this is a confidence interval for $\theta_0$, or the hypothesized parameter that we’re actually trying to learn something about. However, like we mentioned earlier, we don’t actually know the true parameter value, so we can’t plug in $\theta_0$ into the Fisher Information here. Instead, we’ll plug in our best estimate for the parameter, or $\hat{\theta}_{MLE}$. So, our confidence interval becomes:

\[\hat{\theta}_{MLE} \pm \frac{z^*_{1 - \frac{\alpha}{2}}}{\sqrt{I_n(\hat{\theta}_{MLE})}}\]

Is it ok just to throw in $I_n(\hat{\theta}_{MLE})$ for $I_n(\theta_0)$? The answer for this book: if $n$ is large enough, no big deal, since things become Normal in the long run.

Confused? Probably. No worries; let’s do an example. Remember, this is an MLE based confidence interval for a true parameter. Let’s return to our cancer-survival time example; we guessed that survival times are distributed $Expo(\lambda)$, and we want to make inferences on $\lambda$. We already found an estimate by using the MLE method: we found that $\hat{\lambda}_{MLE}$, the Maximum Likelihood Estimator, was the inverse of the sample mean ($1$ divided by the sample mean). So, if we found a sample survival average time of 10 months, then we would estimate that the survival times are distributed $Expo(\frac{1}{10})$.

However, we’d like to find out how good this estimate is, which we can do by finding a confidence interval. Well, let’s take a look back at the formula for an MLE confidence interval. We already have the $\hat{\theta}_{MLE}$, the MLE estimate; it’s 1 divided by the sample mean. Let’s say that we want a 95$\%$ confidence interval, so our z-value is 1.96, because it’s the z-value that puts a weight of 2.5$\%$ in both tails. Now, what about Fisher’s Information? We found earlier that Fisher’s Information for an Exponential is:

\[I_n(\lambda) = \frac{n}{\lambda^2}\]

However, again, we don’t know $\lambda$, so we use

\[I_n(\hat{\lambda}_{MLE}^2) = \frac{n}{\hat{\lambda}_{MLE}^2}\]

or the MLE estimate of $\lambda$. Well, we know the MLE estimate for $\lambda$ is 1 divided by the sample mean, so this whole thing becomes:

\[\frac{n}{\hat{\lambda}_{MLE}^2} = n \bar{X}^2\]

You could simplify this further, but we will keep this expression for now. Now, we can just go ahead and write our confidence interval:

\[\frac{1}{\bar{X}} \pm \frac{1.96}{\bar{X}\sqrt{n}}\]

So, for example, we said we got a sample mean of 10 months of survival. That means our estimate for $\lambda$ is $\frac{1}{10}$. What’s the confidence interval? Well, say we collected a sample of 36 people. We can plug in to our confidence interval formula:

\[\frac{1}{10} \pm \frac{1.96}{60} = (.067,.133)\]

And we have created an MLE based confidence interval for our true parameter! You can see how if we collected more observations, holding all else constant (larger $n$) the interval would get tighter and tighter, as the Variance of the MLE gets smaller and smaller.

We can generate these confidence intervals in R; unsurprisingly, 96% of the intervals contain the true lambda value (we are testing at the 95% level of significance - this is where the 1.96 term comes from - and this thus gives us another interpretation of the confidence interval’s level of significance: 95% of all intervals will include the true parameter!).

library(data.table)

# replicate
set.seed(0)
nsims <- 100

# parameters and generate sample means
lambda <- 5
n <- 30
samp_mean <- replicate(mean(rexp(n, lambda)), n = 100)

# calculate the confidence intervals
data <- data.table(x = 1:100,
                   lower = 1 / samp_mean - 1.96 / (samp_mean * sqrt(n)),
                   upper = 1 / samp_mean + 1.96 / (samp_mean * sqrt(n)))

# count how many include the true value
mean(data$lower < lambda & data$upper > lambda)

## [1] 0.96

Anyways, there are two major takeaways from this section:

Fisher’s Information, or the negative of the expectation of the second derivative of the log likelihood function with respect to the parameters, allows us to find the distribution of the MLE.
You can build a confidence interval based on your MLE estimate using the MLE and Fisher’s Information, thanks to the distribution of the MLE (Normal with Fisher’s Information as the reciprocal of the Variance).

This likely feels like a lot now, but don’t worry. These things come with practice.

A final point on MLEs and why we like them. We say that MLEs are asymptotically efficient. That means two major things. First, the Bias goes to 0 as $n$, the sample size, goes to infinity. Remember, Bias is $E(\hat{\theta} - \theta)$, or how far off our estimator is from the parameter we’re actually trying to estimate. The ideal is an unbiased estimator, or an estimator who’s expectation is the parameter in question, and thus has bias of 0. Second, asymptotic efficiency means that the Variance of the MLE goes to the reciprocal of Fisher’s Information; that is:

\[Var(\hat{\theta}) \approx \frac{1}{I_n(\theta)}\]

As $n$, the sample size, goes to infinity. This is a result we used freely earlier, but remember that we made the assumption that $n$ was large. We can test this result in R:

# replicate
lambda <- 5
n <- 40

# generate the MLE and check the variance
mle <- replicate(1 / mean(rexp(n, lambda)), n = 100)
var(mle); lambda ^ 2 / n

## [1] 0.7192224

## [1] 0.625

Asymptotic efficiency is a good thing. Remember that we had two basic principles of good estimators: low bias and low variance. Obviously, 0 bias is ideal, and asymptotic efficiency says that MLEs approach 0 bias. How about low variance? We’ll see soon that the inverse of Fisher’s Information, or what the variance of the MLE approaches, is actually very good.

4.4 Miscellaneous Estimation

The heavy lifting of this chapter is done; we’re simply going to define a few concepts that are important with Maximum Likelihood.

- Mean Square Error:

This is another measure of how good an estimator is. Remember that we defined bias as $E(\hat{\theta} - \theta)$, or how far off our estimator is from the true parameter on average. Mean Square Error takes is a little farther:

\[MSE(\hat{\theta}) = E\big((\hat{\theta} - \theta)^2\big)\]

Where $MSE(\hat{\theta})$ is the notation for ‘Mean Squared Error of the Estimator $\hat{\theta}$.’

Why would this interested us? Well, if you did the algebra that we won’t do here and expanded it, you would actually get:

\[E\big((\hat{\theta} - \theta)^2\big) = Var(\hat{\theta}) + \big(E(\hat{\theta}) - \theta \big)^2\]

So, what’s neat about this is it includes the Variance of our estimator, $Var(\hat{\theta})$, and the bias: $\big(E(\hat{\theta}) - \theta \big)^2$ (yes, here it’s the bias squared). So, minimizing mean square error overall is valuable because it accounts for bias and variance of the estimator, two things that we want to be small.

- Efficiency of Two Estimators:

This is just the ratio of the Mean Square Errors of the two estimators. That is, say you have two estimators for $\theta$: they are $\hat{\theta_A}$ and $\hat{\theta_B}$. Maybe one is the MoM estimator, and another is the MLE. Whatever. The efficiency of the two estimators is given by:

\[eff(\hat{\theta}_A, \hat{\theta}_B) = \frac{MSE(\hat{\theta}_A)}{MSE(\hat{\theta}_B)}\]

Where $eff(\hat{\theta}_A, \hat{\theta}_B)$ is just notation for “relative efficiency of estimator $\hat{\theta}_A$ to estimator $\hat{\theta}_B$.”

Obviously, a better estimator has a lower Mean Square Error, so you can look at if the ratio is above or below 1 and see which estimator is better.

- Cramer-Rao Lower Bound

What sounds like a business venture from a Seinfeld stalwart is a bound for the variance of an estimator. This says that, if $\hat{\theta}$ is an unbiased estimator (its expectation is the parameter it’s trying to estimate), then:

\[Var(\hat{\theta}) \geq \frac{1}{I_n(\theta)}\]

That is, the smallest a variance of an estimator could possibly be is the inverse of the Fisher Information for the parameter. If the Variance achieves this lower bound, that is, $Var(\hat{\theta}) = \frac{1}{I_n(\theta)}$, then the estimator is said to be efficient. Recall that we said the MLE is asymptotically efficient; that was because the Variance of the MLE approaches the inverse of the Fisher Information.

There’s a lot in this chapter, but that’s because it’s an important one. It’s really, really important to make sure you’re comfortable with this stuff; don’t be afraid to go through it a couple of times to understand where the MLE comes from, how to find it, how to use it in a Confidence Interval, etc.

4.5 Practice

Let $X_i \sim N(\mu,\sigma^2)$ be a sample of i.i.d. random variables from a population for $i = 1, ..., n$. Find the MLEs for $\mu$ and $\sigma^2$.

(For a practice problem involving Fisher’s Information, check out the next chapter)

Solution: This is a little bit of algebra, but it’s a good example to really hammer down MLE understanding. We can start by writing down the likelihood function, which is, recall, just the product of all the marginal PDFs:

\[lik(\mu,\sigma^2) = \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-(x - \mu)^2}{2\sigma^2}} = \big(\frac{1}{\sqrt{2 \pi \sigma^2}}\big)^n e^{\sum_{i=1}^n \frac{-(x-\mu)^2}{2\sigma^2}}\]

We can take the log of this to get the log-likelihood function:

\[log(\mu,\sigma^2) = -\frac{n}{2}log(2\pi) - \frac{n}{2} log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\]

Remember your log rules here: $log(a^b) = blog(a)$, and $log(\frac{1}{a}) = -log(a)$, and $log(ab) = log(a) + log(b)$.

Now that we’ve got the log-likelihood function, we can (in theory) solve or the MLEs. How do we do that? Well, we derive the log likelihood function in terms of the two parameters, set these equal to 0, and solve the system of equations for our parameters $\mu$ and $\sigma^2$. Let’s first take our derivative with respect to $\mu$. We can notate $\frac{\partial l}{\partial \mu}$ as ‘the derivative of the log likelihood with respect to $\mu$.’

\[\frac{\partial l}{\partial \mu} = \frac{-1}{2\sigma^2} \big(-2 \sum_{i=1}^n x_i + 2n \mu\big)\]

And the derivative with respect to $\sigma^2$.

\[\frac{\partial l}{\partial \sigma^2} = \frac{-n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2\]

Now we can set these equal to 0 and solve. We actually don’t need any substitution for $\mu$, we can just set it equal to 0 and go:

\[\frac{-1}{2\sigma^2} \big(-2 \sum_{i=1}^n x_i + 2n \mu\big) = 0 \rightarrow \hat{\mu_{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i\]

So, intuitively, the MLE estimate for $\mu$ is just the sample mean. That makes sense! Now, let’s plug in the sample mean for $\mu$ in the partial derivative with respect to $\sigma^2$.

\[\frac{-n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \frac{1}{n} \sum_{i=1}^n x_i)^2 = 0\]

\[\frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \frac{1}{n} \sum_{i=1}^n x_i)^2 = \frac{n}{2\sigma^2}\]

\[ \sum_{i=1}^n (x_i - \frac{1}{n} \sum_{i=1}^n x_i)^2 = n\sigma^2\]

\[ \frac{1}{n}\sum_{i=1}^n (x_i - \frac{1}{n} \sum_{i=1}^n x_i)^2 = \hat{\sigma^2}_{MLE}\]

That’s it! This is actually pretty intuitive; it’s the average of the squared differences for each value from the mean. So, we got that:

\[\hat{\mu}_{MLE} = \bar{X}, \; \hat{\sigma^2}_{MLE} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{X})^2\]