Chapter 7 Solutions
Chapter 1
Q1-1: Introductory Questions
- What is a model?
- What is a statistical model?
- What is an advanced statistical model?
- What is a variable?
- What is a random variable?
Answer
- Quite broad. The Cambridge dictionary lists the following definitions of model (noun)
- something that a copy can be based on because it is an extremely good example of its type.
- a person who wears clothes so that they can be photographed or shown to possible buyers, or a person who is employed to be photographed or painted.
- a physical object, usually smaller than the real object, that is used to represent something.
- a simple description of a system or process that can be used in calculations or predictions of what might happen.
A statistical model is a set (also known as a family) of (probability) distributions.
Advanced is clearly a subjective word here…
- Can a model be advanced? If so, what does that mean?
- Perhaps it is the modelling that is advanced? Requiring more in-depth, expert knowledge, for example.
- Let’s look at the Aim of the Course: To provide advanced methodological and practical knowledge in the field of statistical modelling, covering a wide range of modelling techniques which are essential for the professional statistician.
- and e) A variable in statistics is the same as a variable in mathematics more generally: an unspecified element of a given set whose elements are the possible values of the variable. The difference is that in statistics we place a probability distribution on this space of values. A variable with a probability distribution on its space of values is sometimes called a random variable (although the usual mathematical definition is not in these terms). However, I believe this terminology can be redundant and confusing, and would exercise caution using this terminology, for the following two reasons:
Firstly, nothing has happened to the variable. Rather, an extra structure (the probability distribution) has been added to the space of values, and it is this structure that we must focus on for statistics. As a result, just as in mathematics more generally, we can, and will, get away with not distinguishing between a variable and its value, using \(x\), for example, for both. Nevertheless, just as in mathematics more generally, such a distinction is sometimes useful. If it is useful, then a variable will be denoted in upper case: the usual notation \(x\) will be replaced by the notation \(X = x\), both of them indicating that the variable \(X\) has the value \(x\).
Secondly, the terminology encourages the notion that there are two types of quantity in the world: “random” quantities, to be represented by “random” variables; and “non-random” quantities, to be represented by “ordinary” variables. With the possible exception of quantum mechanics, which is irrelevant to our purposes, this is false, or, rather, meaningless: try thinking of an empirical test to decide whether a quantity is “random” or “non-random”; there are only quantities, represented by variables. The purpose of probability distributions is to describe our knowledge of the values of these variables.
Q1-2: Types of Variable
Decide whether the following variables are categorical or numerical, and classify further if possible:
- gender of the next lamb born at the local farm.
- Answer: categorical, unranked, binary.
- number of times a dice needs to be thrown until the first \(6\) is observed.
- Answer: numerical, discrete.
- amount of fluid (in ounces) dispensed by a machine used to fill bottles with lemonade.
- Answer: numerical, continuous.
- thickness of penny coins in millimetres.
- Answer: numerical, continuous.
- assignment grades of a 3H Mathematics course (from A to D).
- Answer: categorical, ranked.
- marital status of some random sample of citizens.
- Answer: categorical, unranked.
Q1-3: Properties of Probability Distributions
- Prove the results for the expectation and covariance structure of the multinomial distribution stated in Section 1.4.3.1.
Answer: Consider first a Multinoulli random variable \(\boldsymbol{W}\), that is, a random variable following a Multinomial distribution with \(n = 1\), parameterised by probability vector \(\boldsymbol{\pi} = (\pi_1,...,\pi_k)\). Note that \(\boldsymbol{W}\) is a vector of \(k-1\) \(0\)’s and a single \(1\), with the \(W_j\) taking value \(1\) with probability \(\pi_j\).
Almost by definition, we have that \({\mathrm E}[\boldsymbol{W}] = \boldsymbol{\pi}\).
Now, for \(j = 1,...,k\), we have \[\begin{eqnarray} {\mathrm{Var}}[W_j] & = & {\mathrm E}[W_j^2] - ({\mathrm E}[W_j])^2 \\ & = & {\mathrm E}[W_j] - ({\mathrm E}[W_j])^2 \\ & = & \pi_j(1 - \pi_j) \end{eqnarray}\] For \(j \neq j'\): \[\begin{eqnarray} {\mathrm{Cov}}[W_j,W_{j'}] & = & {\mathrm E}[W_j W_{j'}] - {\mathrm E}[W_j] {\mathrm E}[W_{j'}] \\ & = & 0 - \pi_j \pi_{j'} \\ & = & - \pi_j \pi_{j'} \end{eqnarray}\] where we have used the fact that if \(j \neq j'\), at least one of \(W_j\) and \(W_{j'}\) is equal to \(0\). It therefore follows that \[\begin{equation} {\mathrm{Var}}[\boldsymbol{W}] = \Sigma = \textrm{diag}(\boldsymbol{\pi}) - \boldsymbol{\pi} \boldsymbol{\pi}^T \end{equation}\]
Now, \(\boldsymbol{Y} = \sum_{l=1}^n \boldsymbol{Y}_l\), where \(\boldsymbol{Y}_l, l = 1,...,n\) are i.i.d. multinoulli. Therefore \[\begin{eqnarray} {\mathrm E}[\boldsymbol{Y}] & = & {\mathrm E}[\sum_{l=1}^n \boldsymbol{Y}_l] \\ & = & \sum_{l=1}^n {\mathrm E}[ \boldsymbol{Y}_l ] \\ & = & n \boldsymbol{\pi} \end{eqnarray}\] with the third line from the identically distributed nature of \(\boldsymbol{Y}_l\).
Also, \[\begin{eqnarray} {\mathrm{Var}}[\boldsymbol{Y}] & = & {\mathrm{Var}}[\sum_{l=1}^n \boldsymbol{Y}_l] \\ & = & \sum_{l=1}^n {\mathrm{Var}}[ \boldsymbol{Y}_l ] \\ & = & n \Sigma \end{eqnarray}\] with the second line coming from the independence of \(\boldsymbol{Y}_l\) and the third line from the identically distributed nature of \(\boldsymbol{Y}_l\).
- Suppose \(X_1^2,...,X_m^2\) are \(m\) independent variables with chi-squared distributions \(X_i^2 \sim \chi^2(n_i)\). Show that \[\begin{equation} \sum_{i=1}^m X_i^2 = \chi^2(\sum_{i=1}^m n_i ) \end{equation}\]
Answer: This simply comes from noticing that each \(X_i^2\) is a sum of \(n_i\) squared standard normal variables, which when all summed together just becomes a bigger sum of \(\sum_{i=1}^m n_i\) squared standard normal variables. \[\begin{equation} \sum_{i=1}^m X_i^2 = \sum_{i=1}^m \sum_{j=1}^{n_i} Z_{ij}^2 = \chi^2(\sum_{i=1}^m n_i ) \end{equation}\]
- Suppose \(X^2\) has the distribution \(\chi^2(n)\). Prove that \[\begin{align} {\mathrm E}[X^2] & = n \\ {\mathrm{Var}}[X^2] & = 2n \end{align}\] Hint: Your solution may require you to assume or show that \({\mathrm E}[Z^4] = 3\), where \(Z \sim{\mathcal N}(0,1)\).
Answer: Note that \(X^2 = \sum_{i=1}^n Z_i^2\) with \(Z_i \sim {\mathcal N}(0,1)\). Therefore \[\begin{equation} {\mathrm E}[X^2] = \sum_{i=1}^n {\mathrm E}[Z_i^2] = n \end{equation}\] since \({\mathrm E}[Z_i^2] = {\mathrm{Var}}[Z_i] + ({\mathrm E}[Z_i])^2 = 1\).
Also, \[\begin{equation} {\mathrm{Var}}[X^2] = {\mathrm{Var}}[\sum_{i=1}^n Z_i^2] = \sum_{i=1}^n {\mathrm{Var}}[Z_i^2] = n {\mathrm{Var}}[Z_i^2] \end{equation}\] with \[\begin{equation} {\mathrm{Var}}[Z_i^2] = {\mathrm E}[Z_i^4] - ({\mathrm E}[Z_i^2])^2 = 3 - 1^2 = 2 \end{equation}\]
To show that \({\mathrm E}[Z_i^4] = 3\), we choose to show that \({\mathrm E}[Z^{n+1}] = n {\mathrm E}[Z^{n-1}]\), which clearly yields the required result. With \(\phi(z) = \frac{1}{2 \pi} e^{-z/2}\), we have \[\begin{eqnarray} {\mathrm E}[Z^{n+1}] & = & \int_{-\infty}^{\infty} z^{n+1} \phi(z) dz \\ & = & \int_{-\infty}^{\infty} z^{n} z \phi(z) dz \\ & = & - \int_{-\infty}^{\infty} z^{n} \phi'(z) dz \\ & = & -z^n \phi(z) \biggr|_{-\infty}^{\infty} + \int_{-\infty}^{\infty} n z^{n-1} \phi(z) dz \\ & = & 0 + n {\mathrm E}[Z^{n-1}] \end{eqnarray}\] where the third line comes from the fact that \(\phi'(z) = - z \phi(z)\), and the fourth line comes from integration by parts with \(u = z^n\) and \(v' = \phi'(z)\).
Chapter 2
Q2-1: Quickfire Questions
- What is the difference between a Poisson, Multinomial and product Multinomial sampling scheme?
Answer: See Sections 2.1.2 and 2.3.
- What does an odds of 1.8 mean relative to success probability \(\pi\)?
Answer: An odds of 1.8 means that the probability of success is 1.8 times greater than the probability of failure.
Q2-2: Sampling Schemes
Write down statements for the expectation and variance of a variable following a product Multinomial sampling scheme as described in Section 2.3.3.
Answer: Let \(\boldsymbol{N}_i = (N_{i1},...,N_{iJ})\) and \(\boldsymbol{\pi}_i^\star = (\pi_{1|i},...,\pi_{J|i})\) for \(i = 1,...,I\). Then \[\begin{eqnarray} {\mathrm E}[\boldsymbol{N}_i] & = & n_{i+} \boldsymbol{\pi}_i^\star \\ {\mathrm{Var}}[\boldsymbol{N}_i] & = & n_{i+} (\textrm{diag}(\boldsymbol{\pi}_i^\star) - \boldsymbol{\pi}_i^\star \boldsymbol{\pi}_i^{\star,T}) \\ {\mathrm{Cov}}[\boldsymbol{N}_i,\boldsymbol{N}_{i'}] & = & 0_M \qquad i \neq i' \end{eqnarray}\] where \(0_M\) is a matrix of zeroes.
Q2-3: Fatality of Road Traffic Accidents
Table 6.1 shows fatality results for drivers and passengers in road traffic accidents in Florida in 2015, according to whether the person was wearing a shoulder and lap belt restraint versus not using one. Find and interpret the odds ratio.
Answer: The sample odds ratio is calculated as \[\begin{equation} \frac{n_{11} n_{22}}{n_{12} n_{21}} = \frac{433 \times 554883}{8049 \times 570} = 52.37 \end{equation}\]
Thus, the odds of a road traffic accident being fatal are estimated to be50 52.37 times greater if no restraint is used relative to if one is used.
Q2-4: Difference of Proportions or Odds Ratios?
A 20-year study of British male physicians (Doll and Peto (1976)) noted that the proportion who died from lung cancer was 0.00140 per year for cigarette smokers and 0.00010 per year for non-smokers. The proportion who died from heart disease was 0.00669 for smokers and 0.00413 for non-smokers.
Describe and compare the association of smoking with lung cancer and with heart disease using the difference of proportions.
Describe and compare the association of smoking with lung cancer and heart disease using the odds ratio.
Which response (lung cancer or heart disease) is more strongly related to cigarette smoking, in terms of increased proportional risk to the individual?
Which response (lung cancer or heart disease) is more strongly related to cigarette smoking, in terms of the reduction in deaths that could occur with an absence of smoking?
Answer:
- Difference of proportions:
Lung cancer: 0.0014 - 0.0001 = 0.0013;
Heart disease: 0.00669 - 0.00413 = 0.00256.
Using the difference of proportions, the data suggests that cigarette smoking has a bigger impact on heart disease.
- Odds ratio:
Lung cancer (L): The sample odds for smokers (S) is given by \(\omega_{L,S} = 0.0014/0.0086\), and the sample odds for non-smokers (N) is \(\omega_{L,N} =0.0001/0.9999\). The sample odds ratio is therefore \(\omega_{L,S}/\omega_{L,N} = 14.02\).
Heart disease (H): The sample odds for smokers (S) is given by \(\omega_{H,S} = 0.00669/0.99331\), and the sample odds for non-smokers (N) is \(\omega_{H,N} =0.00413/0.99587\). The sample odds ratio is therefore \(\omega_{H,S}/\omega_{H,N} = 1.624\).
The odds of dying from lung cancer are estimated to be 14.02 times higher for smokers than for non-smokers, whilst the odds of dying from heart disease are estimated to be 1.624 times higher for smokers than non-smokers. Thus, using the sample odds ratio, the data suggests that cigarette smoking has a bigger impact on lung cancer.
For measure based on increased proportional risk, we use the sample odds ratios above; lung cancer has an odds ratio of 14.02 compared to heart disease with an odds ratio of 1.624. Therefore, increased proportional risk to the individual smoker is much higher for lung cancer than heart disease relative to the corresponding risks for a non-smoker.
The difference of proportions describes excess deaths due to smoking. That is, if \(N =\) number of smokers in population, we predict there would be \(0.00130N\) fewer deaths per year from lung cancer if they had never smoked, and \(0.00256N\) fewer deaths per year from heart disease. Thus (based on this study), elimination of cigarette smoking is predicted to have the biggest impact on deaths due to heart disease.
References
Crucially, using the data, we estimate…↩︎