Chapter 6 Parameter Transformations






6.1 Motivation


We’ve estimated parameters (in a couple of different ways) but sometimes we don’t want to estimate the parameter, we want to estimate a function* of a parameter (i.e., the parameter squared). This section will walk us through methodology for this specific type of inference.*




6.2 Transformations and Invariance


Often, we’re interested in making inferences on parameters; one example, which we may have seen before, is inferring the mean height of a group of college students. Generally, we can infer the average or variance by estimating the parameters of a distribution. For example, if the distribution was Normal, then the first parameter, μ, would be the mean, and the second parameter σ2 would be the variance. However, what if the heights follow an Exponential distribution - i.e., the underlying data follows Expo(λ)? Then if you estimated the parameter λ, you wouldn’t be estimating the mean. Remember from probability (and earlier in this book) that the mean of an exponential is 1λ, so if we estimated λ, we would be estimating the inverse of the mean.


The point is here that estimating the parameter might not be quite what we wanted. What’s the intuitive way to go about this? Well, for the exponential case, let’s say you estimated λ=13. It makes sense to say, then, that you would estimate the mean of the distribution to be 3, since the mean is the inverse of the parameter.

Anyways, the idea is that sometimes we sometimes want estimate a parameter ‘transformed.’




Invariance of MLEs - So, we’re comfortable with the idea that we’re trying to estimate the function of a parameter instead of the parameter itself. Let’s consider this topic of Invariance now; specifically, Invariance with maximum likelihood estimators. Put formally, if you have some MLE ˆθMLE for some parameter θ, and g(θ) is a function of θ, then g(ˆθMLE) is the MLE of g(θ).


Let’s take a quick step back and think about this. Consider the problem we mulled over to the intro of this part; we estimated a λ of 13 for an Exponential distribution, and then intuitively estimated that the mean of the distribution is the inverse of that, or 3. We just did this example for intuition’s sake, but this actually holds for MLEs. That’s right, you can just take a function of an MLE and it’s still an MLE for something else. In the case we just did, our MLE is 13, or ˆθMLE=13, our function g(θ) is 1λ, because we’re trying to get the mean of the Exponential and recall that the mean is given by 1λ. So, we plug our ˆθMLE into our function g(θ) to get 113=3.


Basically, invariance of the MLE just says that what you would intuitively do is correct. If you’re trying to estimate the function of a parameter, then that function of the MLE is a natural estimator. This is a pretty sweet principle (and is kind of similar to LOTUS, or the law of the unthinking statistician, from probability), and actually doesn’t hold for all estimators. For example, Bayesian estimators don’t have the invariance property. You can’t just take the function to a Bayesian estimator and expect it to work out.




6.3 The Delta Method


We’ve talked about estimating functions of parameters, or, in general, g(θ); the Delta Method gives a nifty way of working with it. Specifically, it says that if:

ˆθDN(θ,σ2n)

Then for any g(θ) where the derivative g(θ) exists and is non-zero:

g(ˆθ)DN(g(θ),σ2n(g(θ))2)

Whoa, whoa whoa. Let’s break this down. First, let’s review the notation. Remember that ˆθ is just an estimator for a parameter θ. The notation D here means “converges in distribution to” as the sample size, n here, grows. So, both of the above converge in distribution to Normals with the given parameters. Recall that g(θ) is just the derivative of g(θ) with respect to θ. For example, if g(θ)=2θ, then g(θ)=2.


Notation aside, what on earth is going on here? Let’s focus on the first part: ˆθDN(θ,σ2n). This is basically saying that your estimator, ˆθ, approaches a Normal distribution (as you get a big enough sample size n) where the mean is the true parameter θ and the variance is σ2n, which clearly decreases as n, the sample size, grows. If this condition holds, then we get the second condition: g(ˆθ)DN(g(θ),σ2n(g(θ))2). Right away, we know that g(ˆθ) is just a function of the estimator that we’ve found. We know that for MLEs, this function has the invariance property, in that it is the MLE of the same function of the parameter, g(θ).


However, now we know that this function of the estimator also has this cool distribution; that is, it’s Normal with the mean of the function of the original parameter, and that nasty variance. What is that variance? Well, it’s the variance that the estimator converges to, σ2n, multiplied by the derivative of g(θ) squared (take the derivative and then square).


Why would this ever be useful? Well, what if we needed a confidence interval for a transformed parameter? That is, in the Exponential example where we estimate λ, what if we wanted a confidence interval for the mean, which is of course a transformation of the parameter, or 1λ? Well, now that we know the Delta Method, we could find the distribution of this transformation (assuming it passes the original condition) and then pluck off the respective quantiles (for example, if we wanted a 95% confidence interval, we’d find the 2.5th percentile and the 97.5th percentile).



<br

6.4 Practice



Let P be a sequence of arrivals in continuous time that follows a Poisson process with rate λ.



  1. Let X be the number of arrivals in the process P from time 0 to t. What is the distribution of X?


Solution: By the definition of a Poisson Process, XPois(λt). If you are not familiar with Poisson Processes, please refer to this book.



  1. We run the process a couple of times and observe X1,X2,Xn, where Xi is the number of arrivals from 0 to t on the ith simulation. What is the relevant MLE for X?


Solution: We can find the MLE in the usual way. For notation’s sake, let λt=θ. We can find the likelihood:

f(X|θ)=ni=1θxieθxi!=enθθni=1xini=1xi!

Taking the log yields:

l(θ)=nθ+(ni=1xi)log(θ)log(ni=1xi!)

Derive and set equal to 0:

l(θ)=n+ni=1xiθ=0

ˆθMLE=ni=1xin=ˉX

Intuitively, the sample mean is the MLE; this is intuitive because, if YPois(λ), then E(Y)=λ.

Let’s see if our MLE gets us close to the original parameter in R:

# replicate
set.seed(0)

x <- rpois(100, 5)
mean(x)
## [1] 5.13



  1. Suppose that there are two types of arrivals; let Y be the first type and Z be the second type. For any arrival, there is a p probability that it is a Y arrival and 1p be the probability that it is a Z arrival. You run the process n times and observe Y1,...Yn and Z1,...Zn, where again Yi is the number of Y type arrivals on the ith simulation. What are the MLEs for Z and Y?


Solution: By the properties of a Poisson, we have that YPois(pλt) and, independently, ZPois((1p)λt). We found the MLE for a Poisson in the previous part: just the sample mean. Let pλt=θY and (1pt)λt=θZ, so that we have:

^θYMLE=ˉY ^θZMLE=ˉZ

You also know that Yi+Zi=Xi for any simulation, so you could write:

ˉZ=ni=1xiyin.



  1. Condition on X=x; that is, there are x arrivals between time 0 and time t. What are the marginal distributions of Y and Z?


Solution: Now we have a set number of trials, with a fixed probability p or 1p of success on each trial. This is the story of a Binomial. We know, then, that YBin(x,p) and ZBin(x,1p).


  1. Continue to condition on X=x. Provide an uninformative prior for the unknown parameter for Y in part (d).


Solution: It’s a probability, so the perfect uninformative is pUnif(0,1)



  1. Continue to condition on X=x. You learn from a similar process that generally 25% of the arrivals were type Y. Use this knowledge to adjust your prior from (e) and form a scientifically sound prior for the unknown parameter for Y in part (d).


Solution: The question asks us to still go with a Uniform, but now we want to center it around a better value. We know the mean of Y is xp - we’re conditioning on x so it’s known - but we also know that the proportion of Y arrivals was .25. Therefore, we should center our Uniform around .25. A reasonable prior would be Unif(0,.5), which has mean .25 (but there are other possible answers here, the key is that we center around .25).



  1. Continue to condition on X=x. Assign a Beta(α,β) prior distribution to the unknown parameter for Y in part (d). If we sample Y1,...Yn, what is the posterior mean estimate for that parameter?


Solution: We’re essentially saying pBeta(α,β). This is a classic example of the Beta-Binomial conjugacy. We know the prior PDF is just the PDF of a Beta(α,β), and the likelihood is just the likelihood of a Binomial, so we can write down from what we have seen:

f(p|Y)f(p)f(Y|p)

\propto p^{\alpha - 1} (1 - p)^{\beta - 1} \prod_{i=1}^n {x \choose y_i} p^{y_1} (1 - p)^{x - y_i}

\propto p^{\alpha - 1} (1 - p)^{\beta - 1}p^{\sum_{i=1}^n y_i} (1 - p)^{\sum_{i=1}^n x - y_i} = p^{\alpha + (\sum_{i=1}^n y_i) - 1} (1-p)^{\beta + nx - (\sum_{i=1}^n y_i) - 1}

Which is proportional to a Beta(\alpha + \sum_{i=1}^n y_i, \beta + nx - \sum_{i=1}^n y_i). So posterior mean estimate is just the mean of this distribution, or:

\frac{\alpha + \sum_{i=1}^n y_i}{\beta + nx - \sum_{i=1}^n y_i}



  1. Imagine now that X is modeling the arrival of customers at a service center in the period 0 to t. The company has a fixed cost of 100 of operating in this period, plus a variable cost equal to the number of customers that arrive squared. Let C be the total cost from 0 to t. Imagine that we sampled from the process n times; find the MLE of C.


Solution: First, let’s write the cost function in terms of our arrivals. We have a fixed cost of 100, and the number of customers squared added on, so we get:

C = 100 + X^2

By the invariance of the MLE, we just need to apply the same function to the MLE for X, which we found earlier to be \bar{X}.

\hat{C}_{MLE} = 100 + \bar{X}^2


  1. What’s the asymptotic distribution of the MLE for X?


Solution: We know by the asymptotic distribution of MLEs that this will be Normal with mean of the true parameter and variance of the inverse of Fisher’s information, so let’s find Fisher’s information. We already took one derivative of the log likelihood and got:

l^{\prime}(\theta) = -n + \frac{\sum_{i=1}^n x_i}{\theta} = 0

Deriving again yields:

l^{\prime \prime}(\theta) = -\frac{\sum_{i=1}^n x_i}{\theta^2} = 0

And now we take the negative expected value. Recall that \theta = \lambda t, and each X_i is Pois(\lambda t).

-E(-\frac{\sum_{i=1}^n x_i}{\theta^2}) = \frac{n\lambda t}{(\lambda t)^2} = \frac{n}{\lambda t}

So we take the inverse to get the asymptotic variance:

\hat{\lambda_{MLE}} \rightarrow^D N(\lambda t, \frac{\lambda t}{n})

Which looks pretty reasonable, since we know in general the mean and variance of a Poisson are both \lambda.