Chapter 3 Method of Moments

3.1 Motivation

We started working with basic estimators for parameters in Chapter 1 (sample mean, sample parameter). Method of Moments, or MoM for short, provides the first type of ‘Inference’ estimators that we will look at in this course. While these aren’t used often in practice because of their relative simplicity, they are a good tool to introduce more intricate estimation theory.

3.2 Method of Moments

Let’s discuss a new type of estimator. This is the first ‘new’ estimator learned in Inference, and, like a lot of the concepts in the book, really relies on a solid understanding of the jargon from the first chapter to nail down.

Recall from probability theory hat the moments of a distribution are given by:

\[\mu^k = E(X^k)\]

Where \(\mu^k\) is just our notation for the \(k^{th}\) moment. So, the first moment, or \(\mu\), is just \(E(X)\), as we know, and the second moment, or \(\mu^2\), is \(E(X^2)\). Recall that we could make use of MGFs (moment generating functions) to summarize these moments; don’t worry, we won’t really deal with MGFs much here.

Instead, we are all about inference here; when we see the \(k^{th}\) moment of a distribution, we should think “how could I estimate that?” For example, \(\mu^3\) is just the average value of an individual observation cubed, just like \(\mu\) is the average value of an individual observation. How would we then estimate these values?

In inference, we’re going to use something called sample moments. The \(k^{th}\) sample moment is defined as:

\[\hat{\mu}_k = \frac{1}{n} \sum_{i=1}^n X^k_i\]

Where \(\hat{\mu}_k\) is the \(k^{th}\) sample moment (remember, we put a hat on things when we mean they are estimating something else. Here, \(\hat{\mu}_k\) is estimating the same thing without a hat: \(\mu_k\), or the \(k^{th}\) moment).

Let’s break this down. This is basically saying that if we want \(\mu_k\), or \(E(X^k)\) (they are the same thing), just take a sample of \(n\) people, raise each of their values to the \(k\), add them up and divide by the number of individuals in the sample (\(n\)). If you wanted to estimate the fourth moment for the weight of college males, you would take a sample of some college males, raise each of their weights to the power of 4 and divide by the number of people you sampled.

Does this make sense? Well, consider the case where \(k = 1\), or \(\mu\). This is, of course, just the mean of a distribution. The sample moment for this first moment is given by:

\[\frac{1}{n} \sum_{i=1}^n X_i\]

Which we got by just plugging in \(k=1\) to the above formula for sample moments. Hey-o, that’s the sample mean, or what we’ve long-established is the natural estimator for the true mean! You can see how this builds, then, as we get to higher and higher moments.

Anyways, the takeaway here is that we use sample moments to estimate the actual moments of a distribution. That’s great, and we would be finished if we were asking you to estimate moments of a distribution. However, you’re rarely asked to estimate actual moments; instead, as you’ve seen, you’re generally asked to estimate parameters of a distribution.

So, we know that we can estimate moments with sample moments, and we know that we want to estimate parameters. How can we use these two facts to get what we want; a solid estimate for the parameters? Well, if we can write the parameters of a distribution in terms of that distribution’s moments, and then simply estimate those moments in terms of the sample moments, then we have created an estimator for the parameter in terms of the sample moment.

Whoa…that’s a little crazy, and probably too much of a mouthful right now. Let’s try and learn this with a solid example of the most famous statistical distribution: the Normal.

If we’re doing estimation for a Normal, that means that we believe the underlying model for some real world data is Normal. For example, we might believe that eyelash length for men in Massachusetts is normally distributed. If we want to carry out inference, we have to estimate the parameters; here, the parameters of a Normal distribution are the mean and the variance. So, for this inferential exercise, we have to estimate the mean and the variance.

We already know, from what we learned earlier, that we have natural estimates for the moments of the Normal distribution. What we’re going to do, then, is try and write the moments of a Normal in terms of the parameters (the mean and variance). We only need to write out the first two moments, \(E(X)\) and \(E(X^2)\), since we have to parameters (in general, if you have \(k\) parameters that you want to estimate, you need to write out \(k\) moments).

Let’s go ahead and do that. How do we write \(E(X)\) in terms of \(\mu\) and \(\sigma^2\)? Well, you know quite well that \(E(X)\) is just \(\mu\), since they are both the mean for a Normal distribution. What about writing \(E(X^2)\) in terms of \(\mu\) and \(\sigma\)? Well, this takes a little bit more cleverness. Recall that \(Var(X) = E(X^2) - E(X)^2\). Re-writing this yields \(Var(X) + E(X)^2 = E(X^2)\). We know that, for a Normal distribution, \(Var(X) = \sigma^2\), and \(E(X)^2 = \mu^2\). So, we can write \(E(X^2) = \mu^2 + \sigma^2\), and this yields the system of equations:

\[\mu_1 = \mu\] \[\mu_2 = \sigma^2 + \mu^2\]

Where \(\mu_1\) is notation for the first moment, \(\mu_2\) notation for the second moment, etc.

Well now, we’ve written our moments in terms of the parameters that we’re trying to estimate. We know that we have good estimators (the sample moments) for our moments \(\mu_1\) and \(\mu_2\), so let’s try and solve this system of equations for the parameters in terms of the moments. Well, we now that \(\mu = \mu_1\), so we can plug in \(\mu_1\) for \(\mu\) in the second equation and then solve for \(\sigma^2\). We get that:

\[\mu_2 = \sigma^2 + \mu^2 \rightarrow \sigma^2 = \mu_2 - \mu_1^2\]

So now our two equations for the parameters in terms of the moments are:

\[\mu = \mu_1\] \[\sigma^2 = \mu_2 - \mu_1^2\]

That is, the first parameter, the mean \(\mu\), is equal to the first moment of the distribution, and the second parameter, the variance \(\sigma^2\), is equal to the second moment of the distribution minus the first moment of the distribution squared.

Why did we go through all of that work? Well, recall the ultimate goal of all of this: to estimate the parameters of a distribution. Recall also that we know how to estimate the moments of a distribution; with the sample moments! That is, a good estimate for \(\mu_k\) is \(\frac{1}{n} \sum_{i=1}^n X_i^k\). So, now that we know the parameters in terms of the moments, estimating the parameters is the same as estimating the moments. We can plug in our estimates for the moments and get good estimates for the parameters \(\mu\) and \(\sigma^2\)!

So, the sample moment for \(\mu_1\), by formula, is just \(\frac{1}{n} \sum_{i=1}^n X_i\), and the sample moment for \(\mu_2\) is, again by formula, \(\frac{1}{n} \sum_{i=1}^n X_i^2\). Plugging these in for \(\mu_1\) and \(\mu_2\) yields:

\[\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i\] \[\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n X_i^2 - \big(\frac{1}{n} \sum_{i=1}^n X_i\big)^2\]

Where \(\hat{\mu}\) and \(\hat{\sigma^2}\) are just estimates for the mean and variance, respectively (remember, we put hats on things to indicate that it’s an estimator). We can test this in R by generating data from, say, a \(N(5, 9)\) distribution and using the above MoM estimators to see if we can get close to the original parameters (mean of \(5\) and variance of \(9\)).

# replicate
set.seed(0)
n <- 20
mu <- 5
sigma <- 3

# generate
samples <- replicate(rnorm(n, mu, sigma), n = 100)

# calculate estimates
sample_means <- apply(samples, 2, mean)
sample_var <- apply(samples, 2, function(x) {
  return((1 / n) * sum(x ^ 2) - sum(x / n) ^ 2)
})

# check if estimates are close:
mean(sample_means); mean(sample_var)

## [1] 4.939076

## [1] 8.958363

Looks like we got back to the original parameters.

This is magical! We’ve just written great estimators for our parameters using just data that we’ve sampled. Hopefully you followed what we did here…if not, here’s a checklist that summarizes the process:

Write the moments of the distribution in terms of the parameters (if you have \(k\) parameters, you will have to write out \(k\) moments).
Solve for the parameters in terms of the moments.
Plug in the sample moments for the moments.

Go back and make sure you can follow these steps for what we did here with the Normal. A quick caveat: you may have noticed that we could immediately written the second parameter, \(\sigma^2\), in terms of the first and second moments because we know \(Var(X) = E(X^2) - E(X)^2\). Yes, we did do an extra step here by first writing it backwards and then solving it, but that extra step will come in handy in more advanced situations, so do be sure to follow it in general. Also, although that estimator the second parameter looks ugly, it simplifies nicely to \(\big(\frac{n-1}{n}\big)s^2\), where \(s^2\) is the sample variance. If you want to try and show this on your own, recall that the sample standard deviation is given by:

\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\]

So, why do we like MoM estimators? Well, generally, they’re pretty easy to find. They’re also generally consistent, a term which means that the estimator converges to the true value of the parameter as the sample gets larger and larger (this is the qualitative definition only). In short, they are easy and pretty good estimators!

If the method of moments still isn’t clicking, here is a video review of these estimators:

Click here to watch this video in your browser

And, finally, it might be helpful to try this calculation with a couple of other distributions…

3.3 Practice

Let \(X \sim Gamma(a, \lambda)\). Find the MOM estimators for \(a\) and \(\lambda\).

Solution: This is a classic MoM question. We have two parameters, so we need two moments. We write:

\[\mu_1 = \frac{a}{\lambda}\] \[\mu_2 = \frac{a}{\lambda^2} + \frac{a^2}{\lambda^2}\]

And we solve for \(a\) and \(\lambda\) in terms of \(\mu_1\) and \(\mu_2\). We know that \(a = \lambda \mu_1\) from the first equation, so we can plug this in to the second equation:

\[\mu_2 = \frac{\lambda \mu_1}{\lambda^2} + \frac{\lambda^2 \mu_1^2}{\lambda^2}\]

\[\mu_2 =\frac{\mu_1}{\lambda} + \mu_1^2\] \[\mu_2 - \mu_1^2 = \frac{\mu_1}{\lambda}\] \[\lambda = \frac{\mu_1}{\mu_2 - \mu_1^2}\]

And plugging this in to \(a = \lambda \mu_1\), we get:

\[a = \frac{\mu_1^2}{\mu_2 - \mu_1^2}\]

We plug in the sample moments to estimate \(a\) and \(\lambda\). Our estimator for \(\mu_1\) is \(\hat{\mu_1} = \frac{1}{n} \sum_{i=1}^n X_i\), and our estimator for \(\mu_2\) is \(\hat{\mu_2} = \frac{1}{n} \sum_{i=1}^n X_i^2\). So:

\[\hat{a}_{MOM}= \frac{\hat{\mu}_1^2}{\hat{\mu}_2 - \hat{\mu}_1^2}\]

\[\hat{\lambda}_{MOM} = \frac{\hat{\mu}_1}{\hat{\mu}_2 - \hat{\mu}_1^2}\]

We can see how our estimates do by running some simple R code for a \(Gamma(5, 7)\) distribution.

# replicate
set.seed(0)
x <- rgamma(1000, 5, 7)

# calculate MoM estimates
mu_1 <- mean(x)
mu_2 <- mean(x ^ 2)

mu_1 ^ 2 / (mu_2 - mu_1 ^ 2)

## [1] 4.885506

mu_1 / (mu_2 - mu_1 ^ 2)

## [1] 7.01947

It looks like our MoM estimators get close to the original parameters of \(5\) and \(7\).