Chapter 5 Parameter Transformations
5.1 Motivation
We’ve estimated parameters (in a couple of different ways) but sometimes we don’t want to estimate the parameter, we want to estimate a function* of a parameter (i.e., the parameter squared). This section will walk us through methodology for this specific type of inference.*
5.2 Transformations and Invariance
Often, we’re interested in making inferences on parameters; one example, which we may have seen before, is inferring the mean height of a group of college students. Generally, we can infer the average or variance by estimating the parameters of a distribution. For example, if the distribution was Normal, then the first parameter, \(\mu\), would be the mean, and the second parameter \(\sigma^2\) would be the variance. However, what if the heights follow an Exponential distribution - i.e., the underlying data follows \(Expo(\lambda)\)? Then if you estimated the parameter \(\lambda\), you wouldn’t be estimating the mean. Remember from probability (and earlier in this book) that the mean of an exponential is \(\frac{1}{\lambda}\), so if we estimated \(\lambda\), we would be estimating the inverse of the mean.
The point is here that estimating the parameter might not be quite what we wanted. What’s the intuitive way to go about this? Well, for the exponential case, let’s say you estimated \(\lambda = \frac{1}{3}\). It makes sense to say, then, that you would estimate the mean of the distribution to be \(3\), since the mean is the inverse of the parameter.
Anyways, the idea is that sometimes we sometimes want estimate a parameter ‘transformed’.
Invariance of MLEs - So, we’re comfortable with the idea that we’re trying to estimate the function of a parameter instead of the parameter itself. Let’s consider this topic of Invariance now; specifically, Invariance with maximum likelihood estimators. Put formally, if you have some MLE \(\hat{\theta}_{MLE}\) for some parameter \(\theta\), and \(g(\theta)\) is a function of \(\theta\), then \(g(\hat{\theta}_{MLE})\) is the MLE of \(g(\theta)\).
Let’s take a quick step back and think about this. Consider the problem we mulled over to the intro of this part; we estimated a \(\lambda\) of \(\frac{1}{3}\) for an Exponential distribution, and then intuitively estimated that the mean of the distribution is the inverse of that, or 3. We just did this example for intuition’s sake, but this actually holds for MLEs. That’s right, you can just take a function of an MLE and it’s still an MLE for something else. In the case we just did, our MLE is \(\frac{1}{3}\), or \(\hat{\theta}_{MLE} = \frac{1}{3}\), our function \(g(\theta)\) is \(\frac{1}{\lambda}\), because we’re trying to get the mean of the Exponential and recall that the mean is given by \(\frac{1}{\lambda}\). So, we plug our \(\hat{\theta}_{MLE}\) into our function \(g(\theta)\) to get \(\frac{1}{\frac{1}{3}} = 3\).
Basically, invariance of the MLE just says that what you would intuitively do is correct. If you’re trying to estimate the function of a parameter, then that function of the MLE is a natural estimator. This is a pretty sweet principle (and is kind of similar to LOTUS, or the law of the unthinking statistician, from probability), and actually doesn’t hold for all estimators. For example, Bayesian estimators don’t have the invariance property. You can’t just take the function to a Bayesian estimator and expect it to work out.
5.3 The Delta Method
We’ve talked about estimating functions of parameters, or, in general, \(g(\theta)\); the Delta Method gives a nifty way of working with it. Specifically, it says that if:
\[\hat{\theta} \rightarrow^{D} N(\theta,\frac{\sigma^2}{n})\]
Then for any \(g(\theta)\) where the derivative \(g^\prime(\theta)\) exists and is non-zero:
\[g(\hat{\theta}) \rightarrow^D N\Big(g(\theta), \frac{\sigma^2}{n} \big(g^\prime(\theta)\big)^2\Big)\]
Whoa, whoa whoa. Let’s break this down. First, let’s review the notation. Remember that \(\hat{\theta}\) is just an estimator for a parameter \(\theta\). The notation \(\rightarrow^D\) here means “converges in distribution to” as the sample size, \(n\) here, grows. So, both of the above converge in distribution to Normals with the given parameters. Recall that \(g^\prime(\theta)\) is just the derivative of \(g(\theta)\) with respect to \(\theta\). For example, if \(g(\theta) = 2\theta\), then \(g^\prime(\theta) = 2\).
Notation aside, what on earth is going on here? Let’s focus on the first part: \(\hat{\theta} \rightarrow^{D} N(\theta,\frac{\sigma^2}{n})\). This is basically saying that your estimator, \(\hat{\theta}\), approaches a Normal distribution (as you get a big enough sample size \(n\)) where the mean is the true parameter \(\theta\) and the variance is \(\frac{\sigma^2}{n}\), which clearly decreases as \(n\), the sample size, grows. If this condition holds, then we get the second condition: \(g(\hat{\theta}) \rightarrow^D N\Big(g(\theta), \frac{\sigma^2}{n} \big(g^\prime(\theta)\big)^2\Big)\). Right away, we know that \(g(\hat{\theta})\) is just a function of the estimator that we’ve found. We know that for MLEs, this function has the invariance property, in that it is the MLE of the same function of the parameter, \(g(\theta)\).
However, now we know that this function of the estimator also has this cool distribution; that is, it’s Normal with the mean of the function of the original parameter, and that nasty variance. What is that variance? Well, it’s the variance that the estimator converges to, \(\frac{\sigma^2}{n}\), multiplied by the derivative of \(g(\theta)\) squared (take the derivative and then square).
Why would this ever be useful? Well, what if we needed a confidence interval for a transformed parameter? That is, in the Exponential example where we estimate \(\lambda\), what if we wanted a confidence interval for the mean, which is of course a transformation of the parameter, or \(\frac{1}{\lambda}\)? Well, now that we know the Delta Method, we could find the distribution of this transformation (assuming it passes the original condition) and then pluck off the respective quantiles (for example, if we wanted a \(95\%\) confidence interval, we’d find the \(2.5^{th}\) percentile and the \(97.5^{th}\) percentile).
<br
5.4 Practice
Let \(P\) be a sequence of arrivals in continuous time that follows a Poisson process with rate \(\lambda\).
- Let \(X\) be the number of arrivals in the process \(P\) from time 0 to \(t\). What is the distribution of \(X\)?
Solution: By the definition of a Poisson Process, \(X \sim Pois(\lambda t)\). If you are not familiar with Poisson Processes, please refer to this book.
- We run the process a couple of times and observe \(X_1,X_2, X_n\), where \(X_i\) is the number of arrivals from 0 to \(t\) on the \(i^{th}\) simulation. What is the relevant MLE for \(X\)?
Solution: We can find the MLE in the usual way. For notation’s sake, let \(\lambda t = \theta\). We can find the likelihood:
\[f(X | \theta) = \prod_{i=1}^n \frac{\theta^{x_i} e^{-\theta}}{x_i!} = e^{-n\theta} \frac{\theta^{\sum_{i=1}^n x_i}}{\prod_{i=1}^n x_i!}\]
Taking the log yields:
\[l(\theta) = -n\theta + (\sum_{i=1}^n x_i)log(\theta) - log(\prod_{i=1}^n x_i!)\]
Derive and set equal to 0:
\[l^{\prime}(\theta) = -n + \frac{\sum_{i=1}^n x_i}{\theta} = 0\]
\[\hat{\theta}_{MLE} = \frac{\sum_{i=1}^n x_i}{n} = \bar{X}\]
Intuitively, the sample mean is the MLE; this is intuitive because, if \(Y \sim Pois(\lambda)\), then \(E(Y) = \lambda\).
Let’s see if our MLE gets us close to the original parameter in R:
## [1] 5.13
- Suppose that there are two types of arrivals; let \(Y\) be the first type and \(Z\) be the second type. For any arrival, there is a \(p\) probability that it is a \(Y\) arrival and \(1-p\) be the probability that it is a \(Z\) arrival. You run the process \(n\) times and observe \(Y_1,...Y_n\) and \(Z_1,...Z_n\), where again \(Y_i\) is the number of \(Y\) type arrivals on the \(i^{th}\) simulation. What are the MLEs for \(Z\) and \(Y\)?
Solution: By the properties of a Poisson, we have that \(Y \sim Pois(p\lambda t)\) and, independently, \(Z \sim Pois((1-p)\lambda t)\). We found the MLE for a Poisson in the previous part: just the sample mean. Let \(p\lambda t = \theta_Y\) and \((1-pt)\lambda t = \theta_Z\), so that we have:
\[\hat{\theta_Y}_{MLE} = \bar{Y}\] \[\hat{\theta_Z}_{MLE} = \bar{Z}\]
You also know that \(Y_i + Z_i = X_i\) for any simulation, so you could write:
\[\bar{Z} = \frac{\sum_{i=1}^n x_i - y_i}{n}\].
- Condition on \(X = x\); that is, there are \(x\) arrivals between time 0 and time \(t\). What are the marginal distributions of \(Y\) and \(Z\)?
Solution: Now we have a set number of trials, with a fixed probability \(p\) or \(1-p\) of success on each trial. This is the story of a Binomial. We know, then, that \(Y \sim Bin(x,p)\) and \(Z \sim Bin(x,1-p)\).
- Continue to condition on \(X = x\). Provide an uninformative prior for the unknown parameter for \(Y\) in part (d).
Solution: It’s a probability, so the perfect uninformative is \(p \sim Unif(0,1)\)
- Continue to condition on \(X = x\). You learn from a similar process that generally 25\(\%\) of the arrivals were type \(Y\). Use this knowledge to adjust your prior from (e) and form a scientifically sound prior for the unknown parameter for \(Y\) in part (d).
Solution: The question asks us to still go with a Uniform, but now we want to center it around a better value. We know the mean of \(Y\) is \(xp\) - we’re conditioning on \(x\) so it’s known - but we also know that the proportion of \(Y\) arrivals was .25. Therefore, we should center our Uniform around .25. A reasonable prior would be \(Unif(0,.5)\), which has mean \(.25\) (but there are other possible answers here, the key is that we center around \(.25\)).
- Continue to condition on \(X = x\). Assign a \(Beta(\alpha,\beta)\) prior distribution to the unknown parameter for \(Y\) in part (d). If we sample \(Y_1,...Y_n\), what is the posterior mean estimate for that parameter?
Solution: We’re essentially saying \(p \sim Beta(\alpha,\beta)\). This is a classic example of the Beta-Binomial conjugacy. We know the prior PDF is just the PDF of a \(Beta(\alpha,\beta)\), and the likelihood is just the likelihood of a Binomial, so we can write down from what we have seen:
\[f(p|Y) \propto f(p)f(Y|p) \]
\[\propto p^{\alpha - 1} (1 - p)^{\beta - 1} \prod_{i=1}^n {x \choose y_i} p^{y_1} (1 - p)^{x - y_i}\]
\[\propto p^{\alpha - 1} (1 - p)^{\beta - 1}p^{\sum_{i=1}^n y_i} (1 - p)^{\sum_{i=1}^n x - y_i} = p^{\alpha + (\sum_{i=1}^n y_i) - 1} (1-p)^{\beta + nx - (\sum_{i=1}^n y_i) - 1}\]
Which is proportional to a \(Beta(\alpha + \sum_{i=1}^n y_i, \beta + nx - \sum_{i=1}^n y_i)\). So posterior mean estimate is just the mean of this distribution, or:
\[\frac{\alpha + \sum_{i=1}^n y_i}{\beta + nx - \sum_{i=1}^n y_i}\]
- Imagine now that \(X\) is modeling the arrival of customers at a service center in the period 0 to \(t\). The company has a fixed cost of 100 of operating in this period, plus a variable cost equal to the number of customers that arrive squared. Let \(C\) be the total cost from 0 to \(t\). Imagine that we sampled from the process \(n\) times; find the MLE of \(C\).
Solution: First, let’s write the cost function in terms of our arrivals. We have a fixed cost of 100, and the number of customers squared added on, so we get:
\[C = 100 + X^2\]
By the invariance of the MLE, we just need to apply the same function to the MLE for \(X\), which we found earlier to be \(\bar{X}\).
\[\hat{C}_{MLE} = 100 + \bar{X}^2\]
- What’s the asymptotic distribution of the MLE for \(X\)?
Solution: We know by the asymptotic distribution of MLEs that this will be Normal with mean of the true parameter and variance of the inverse of Fisher’s information, so let’s find Fisher’s information. We already took one derivative of the log likelihood and got:
\[l^{\prime}(\theta) = -n + \frac{\sum_{i=1}^n x_i}{\theta} = 0\]
Deriving again yields:
\[l^{\prime \prime}(\theta) = -\frac{\sum_{i=1}^n x_i}{\theta^2} = 0\]
And now we take the negative expected value. Recall that \(\theta = \lambda t\), and each \(X_i\) is \(Pois(\lambda t)\).
\[-E(-\frac{\sum_{i=1}^n x_i}{\theta^2}) = \frac{n\lambda t}{(\lambda t)^2} = \frac{n}{\lambda t}\]
So we take the inverse to get the asymptotic variance:
\[\hat{\lambda_{MLE}} \rightarrow^D N(\lambda t, \frac{\lambda t}{n})\]
Which looks pretty reasonable, since we know in general the mean and variance of a Poisson are both \(\lambda\).