We have seen that there may be several competing estimates that can be used to estimate a parameter. How can one choose between these estimates? In this section we present some properties that may be considered desirable in an estimator. However, having these properties does not lead to an unambiguous choice of one estimate as the best for a problem.

The setting : Let $X_{1},\ldots ,X_{n}$ be i.i.d random variables with a common density $f_{\theta}(x)$. The parameter $\theta$ is unknown and the goal is to estimate it. Let $T_{n}$ be an estimator for $\theta$, this just means that $T_{n}$ is a function of $X_{1},\ldots ,X_{n}$ (in words, if we have the data at hand, we should be able to compute the value of $T_{n}$).

Bias : Define the bias of the estimator as $\mbox{bias}_{T_{n} }(\theta):=\mathbf{E}_{\theta}[T_{n}]-\theta$. If $\mbox{Bias}_{T_{n} }(\theta)=0$ for all values of the parameter $\theta$ then we say that $T_{n}$ is unbiased for $\theta$. Here we write $\theta$ in the subscript of $\mathbf{E}_{\theta}$ to remind ourself that in computing the expectation we use the density $f_{\theta}$. However we shall often omit the subscript for simplicity.

Mean-squared error : The mean squared error of $T_{n}$ is defined as $\mbox{m.s.e.}_{T_{n} }(\theta)=\mathbf{E}_{\theta}[(T_{n}-\theta)^{2}]$. This is a function of $\theta$. Smaller it is, better our estimate.

In computing mean squared error, it is useful to observe the formula $$ \mbox{m.s.e.}_{T_{n} }(\theta) = \mbox{Var}_{T_{n} }(\theta) + \left(\mbox{Bias}_{T_{n} }(\theta)\right)^{2}. $$ To prove this, consider and random variable $Y$ with mean $\mu$ and observe that for any real number $a$ we have $$\begin{align*} \mathbf{E}[(Y-a)^{2}] &=\mathbf{E}[(Y-\mu+\mu-a)^{2}] = \mathbf{E}[(Y-\mu)^{2}]+(\mu-a)^{2}+2(\mu-a)\mathbf{E}[Y-\mu] \\ &= \mathbf{E}[(Y-\mu)^{2}]+(\mu-a)^{2} = \mbox{Var}(Y) + (\mu-a)^{2}. \end{align*}$$ Use this identity with $T_{n}$ in place of $Y$ and $\theta$ in place of $a$.

 

Remark 163
An analogy. Consider shooting with a rifle having a telescopic sight. A given target can be missed for two reasons. One, the marksman may be unskilled and shoot all over the place, sometimes a meter to the right of the target, sometimes a meter to the left, etc. In this case, the shots have a large variance. Another person may consistently hit a point 20 cm. to the right of the target. Perhaps the telescopic sight is not set right, and this caused the systematic error. This is the bias. Both bias and variance contribute to missing the target.

 

Example 164
Let $X_{1},\ldots ,X_{n}$ be i.i.d. $N(\mu,{\sigma}^{2})$. Let $V_{n}=\frac{1}{n}\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})^{2}$ be an estimate for ${\sigma}^{2}$. By expanding the squares we get $$ V_{n}=\bar{X}_{n}^{2}+\frac{1}{n}\sum_{k=1}^{n}X_{k}^{2} -\frac{2}{n}\bar{X}_{n}\sum_{k=1}^{n}X_{k} = \left(\frac{1}{n}\sum_{k=1}^{n}X_{k}^{2} \right)-\bar{X}_{n}^{2}. $$ It is given that $\mathbf{E}[X_{k}]=\mu$ and $\mbox{Var}(X_{k})={\sigma}^{2}$. Hence $\mathbf{E}[X_{k}^{2}]=\mu^{2}+{\sigma}^{2}$. We have seen before that $\mbox{Var}(\bar{X}_{n})={\sigma}^{2}$ and $\mathbf{E}[\bar{X}_{n}]=\mu$. Hence $\mathbf{E}[\bar{X}_{n}^{2}]=\mu^{2}+\frac{{\sigma}^{2} }{n}$. Putting all this together, we get $$ \mathbf{E}\left[V_{n} \right] = \left( \frac{1}{n}\sum_{k=1}^{n}\mu^{2}+{\sigma}^{2} \right) - \left(\mu^{2}+\frac{{\sigma}^{2} }{n}\right) = \frac{n-1}{n}{\sigma}^{2}. $$ Thus, the bias of $V_{n}$ is $\frac{n-1}{n}{\sigma}^{2}-{\sigma}^{2}=-\frac{1}{n}{\sigma}^{2}$.

 

Example 165
For the same setting as the previous example, suppose $W_{n}=\frac{1}{n}\sum_{k=1}^{n}(X_{k}-\mu)^{2}$. Then it is easy to see that $\mathbf{E}[W_{n}]={\sigma}^{2}$. Can we say that $W_{n}$ is an unbiased estimate for ${\sigma}^{2}$? There is a hitch!

If the value of $\mu$ is unknown, then $W_{n}$ is not an estimate (cannot compute it using $X_{1},\ldots ,X_{n}$!). However if $\mu$ is known, then it is an unbiased estimate. For example, if we knew that $\mu=0$, then $W_{n}=\frac{1}{n}\sum_{k=1}^{n}X_{k}^{2}$ is an unbiased estimate for ${\sigma}^{2}$.

When $\mu$ is unknown, we define $s_{n}^{2}=\frac{1}{n-1}\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})^{2}$. Clearly $s_{n}^{2}=\frac{n}{n-1}V_{n}$ and hence $\mathbf{E}[s_{n}^{2}]=\frac{n}{n-1}\mathbf{E}[V_{n}]= {\sigma}^{2}$. Thus, $s_{n}^{2}$ is an unbiased estimate for ${\sigma}^{2}$. Note that $s_{n}^{2}$ depends only on the data and hence it is an estimate, whether $\mu$ is known or unknown.

All the remarks in the above two examples apply for any distribution, i.e.,
  1. The sample mean is unbiased for the population mean.
  2. The sample variance $s_{n}^{2}=\frac{1}{n-1}\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})^{2}$ is unbiased for the population variance. But $V_{n}=\frac{1}{n}\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})^{2}$ is not, in fact $\mathbf{E}[V_{n}]=\frac{n-1}{n}{\sigma}^{2}$.
It appears that $s_{n}^{2}$ is better, but the following remark says that one should be cautious in making such a statement.

 

Remark 166
In case of $N(\mu,{\sigma}^{2})$ data, it turns out that although $s_{n}^{2}$ is unbiased and $V_{n}$ is biased, the mean squared error of $V_{n}$ is smaller! Further $V_{n}$ is the maximum likelihood estimate of ${\sigma}^{2}$! Overall, unbiasedness is not so important as having smaller mean squared error, but for estimating variance (when the mean is not known), we always use $s_{n}^{2}$. The computation of the m.s.e is a bit tedious, so we skip it here.

 

Example 167
Let $X_{1},\ldots ,X_{n}$ be i.i.d. $\mbox{Ber}(p)$. Then $\bar{X}_{n}$ is an estimate for $p$. It is unbiased since $\mathbf{E}[\bar{X}_{n}]=p$. Hence, the m.s.e of $\bar{X}_{n}$ is just the variance which is equal to $p(1-p)/n$.

A puzzle : A coin $C_{1}$ has probability $p$ of turning up head and a coin $C_{2}$ has probability $2p$ of turning up head. All we know is that $0 < p < \frac{1}{2}$. You are given $20$ tosses. You can choose all tosses from $C_{1}$ or all tosses from $C_{2}$ or some tosses from each (the total is $20$). If the objective is to estimate $p$, what do you do?

Solution : If we choose to have all $n=20$ tosses from $C_{1}$, then we get $X_{1},\ldots ,X_{n}$ that are i.i.d. $\mbox{Ber}(p)$. An estimate for $p$ is $\bar{X}_{n}$ which is unbiased and hence $\mbox{MSE}_{\bar{X}_{n} }(p)=\mbox{Var}(\bar{X}_{n})=p(1-p)/n$. On the other hand if we choose to have all $20$ tosses from $C_{2}$, then we get $Y_{1},\ldots ,Y_{n}$ that are i.i.d. $\mbox{Ber}(2p)$. The estimate for $p$ is now $\bar{Y}_{n}/2$ which is also unbiased and has $\mbox{MSE}_{\bar{Y}_{n}/2}(p)=\mbox{Var}(\bar{Y}_{n})=2p(1-2p)/4 = p(1-2p)/2$. It is not hard to see that for all $p < 1/2$, $\mbox{MSE}_{\bar{Y}_{n}/2}(p) < \mbox{MSE}_{\bar{X}_{n} }(p)$ and hence choosing $C_{2}$ is better, at least by mean-squared criterion! It can be checked that if we choose to have $k$ tosses from $C_{1}$ and the rest from $C_{2}$, the MSE of the corresponding estimate will be between the two MSEs found above and hence not better than $\bar{Y}_{n}/2$.

Another puzzle : A factory produces light bulbs having an exponential distribution with mean $\mu$. Another factory produces light bulbs having an exponential distribution with mean $2\mu$. Your goal is to estimate $\mu$. You are allowed to choose a total of $50$ light bulbs (all from the first or all from the second or some from each factory). What do you do?

Solution : If we pick all $n=50$ bulbs from the first factory, we see $X_{1},\ldots ,X_{n}$ i.i.d. $\mbox{Exp}(1/\mu)$. The estimate for $\mu$ is $\bar{X}_{n}$ which has $\mbox{MSE}_{\bar{X}_{n} }(\mu)=\mbox{Var}(\bar{X}_{n})=\mu^{2}/n$. If we choose all bulbs from factory $2$ we get $Y_{1},\ldots ,Y_{n}$ i.i.d. $\mbox{Exp}(1/2\mu)$. The estimate for $\mu$ is $\bar{Y}_{n}/2$. But $\mbox{MSE}_{\bar{Y}_{n}/2}(\mu)=\mbox{Var}(\bar{Y}_{n}/2)=(2\mu)^{2}/4n=\mu^{2}/n$. The two mean-squared errors are exactly the same!

Probabilistic thinking : Is there any calculation-free explanation why the answers to the two puzzles are as above? Yes, and it is illustrative of what may be called probabilistic thinking. Take the second puzzle. Why are the two estimates same by mean-squared error? Is one better by some other criterion?

Recall that if $X\sim \mbox{Exp}(1/\mu)$ then $X/2\sim \mbox{Exp}(1/2\mu)$ and vice versa. Therefore, if we have data from $\mbox{Exp}(1/\mu)$ distribution, then we can divided all the numbers by $2$ and convert it into data from $\mbox{Exp}(1/2\mu)$ distribution. Conversely if we have data from $\mbox{Exp}(1/2\mu)$ distribution, then we can convert it into data from $\mbox{Exp}(1/\mu)$ distribution by multiplying each number by $2$. Hence there should be no advantage in choosing either factory. We leave it for you to think in analogous ways why in the first puzzle $C_{2}$ is better than $C_{1}$.

Chapter 31. Confidence intervals