We have seen that there may be several competing estimates that can be used to estimate a parameter. How can one choose between these estimates? In this section we present some properties that may be considered desirable in an estimator. However, having these properties does not lead to an unambiguous choice of one estimate as the best for a problem.
The setting : Let $X_{1},\ldots ,X_{n}$ be i.i.d random variables with a common density $f_{\theta}(x)$. The parameter $\theta$ is unknown and the goal is to estimate it. Let $T_{n}$ be an estimator for $\theta$, this just means that $T_{n}$ is a function of $X_{1},\ldots ,X_{n}$ (in words, if we have the data at hand, we should be able to compute the value of $T_{n}$).
Bias : Define the bias of the estimator as $\mbox{bias}_{T_{n} }(\theta):=\mathbf{E}_{\theta}[T_{n}]-\theta$. If $\mbox{Bias}_{T_{n} }(\theta)=0$ for all values of the parameter $\theta$ then we say that $T_{n}$ is unbiased for $\theta$. Here we write $\theta$ in the subscript of $\mathbf{E}_{\theta}$ to remind ourself that in computing the expectation we use the density $f_{\theta}$. However we shall often omit the subscript for simplicity.
Mean-squared error : The mean squared error of $T_{n}$ is defined as $\mbox{m.s.e.}_{T_{n} }(\theta)=\mathbf{E}_{\theta}[(T_{n}-\theta)^{2}]$. This is a function of $\theta$. Smaller it is, better our estimate.
In computing mean squared error, it is useful to observe the formula $$ \mbox{m.s.e.}_{T_{n} }(\theta) = \mbox{Var}_{T_{n} }(\theta) + \left(\mbox{Bias}_{T_{n} }(\theta)\right)^{2}. $$ To prove this, consider and random variable $Y$ with mean $\mu$ and observe that for any real number $a$ we have $$\begin{align*} \mathbf{E}[(Y-a)^{2}] &=\mathbf{E}[(Y-\mu+\mu-a)^{2}] = \mathbf{E}[(Y-\mu)^{2}]+(\mu-a)^{2}+2(\mu-a)\mathbf{E}[Y-\mu] \\ &= \mathbf{E}[(Y-\mu)^{2}]+(\mu-a)^{2} = \mbox{Var}(Y) + (\mu-a)^{2}. \end{align*}$$ Use this identity with $T_{n}$ in place of $Y$ and $\theta$ in place of $a$.
If the value of $\mu$ is unknown, then $W_{n}$ is not an estimate (cannot compute it using $X_{1},\ldots ,X_{n}$!). However if $\mu$ is known, then it is an unbiased estimate. For example, if we knew that $\mu=0$, then $W_{n}=\frac{1}{n}\sum_{k=1}^{n}X_{k}^{2}$ is an unbiased estimate for ${\sigma}^{2}$.
When $\mu$ is unknown, we define $s_{n}^{2}=\frac{1}{n-1}\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})^{2}$. Clearly $s_{n}^{2}=\frac{n}{n-1}V_{n}$ and hence $\mathbf{E}[s_{n}^{2}]=\frac{n}{n-1}\mathbf{E}[V_{n}]= {\sigma}^{2}$. Thus, $s_{n}^{2}$ is an unbiased estimate for ${\sigma}^{2}$. Note that $s_{n}^{2}$ depends only on the data and hence it is an estimate, whether $\mu$ is known or unknown.
A puzzle : A coin $C_{1}$ has probability $p$ of turning up head and a coin $C_{2}$ has probability $2p$ of turning up head. All we know is that $0 < p < \frac{1}{2}$. You are given $20$ tosses. You can choose all tosses from $C_{1}$ or all tosses from $C_{2}$ or some tosses from each (the total is $20$). If the objective is to estimate $p$, what do you do?
Solution : If we choose to have all $n=20$ tosses from $C_{1}$, then we get $X_{1},\ldots ,X_{n}$ that are i.i.d. $\mbox{Ber}(p)$. An estimate for $p$ is $\bar{X}_{n}$ which is unbiased and hence $\mbox{MSE}_{\bar{X}_{n} }(p)=\mbox{Var}(\bar{X}_{n})=p(1-p)/n$. On the other hand if we choose to have all $20$ tosses from $C_{2}$, then we get $Y_{1},\ldots ,Y_{n}$ that are i.i.d. $\mbox{Ber}(2p)$. The estimate for $p$ is now $\bar{Y}_{n}/2$ which is also unbiased and has $\mbox{MSE}_{\bar{Y}_{n}/2}(p)=\mbox{Var}(\bar{Y}_{n})=2p(1-2p)/4 = p(1-2p)/2$. It is not hard to see that for all $p < 1/2$, $\mbox{MSE}_{\bar{Y}_{n}/2}(p) < \mbox{MSE}_{\bar{X}_{n} }(p)$ and hence choosing $C_{2}$ is better, at least by mean-squared criterion! It can be checked that if we choose to have $k$ tosses from $C_{1}$ and the rest from $C_{2}$, the MSE of the corresponding estimate will be between the two MSEs found above and hence not better than $\bar{Y}_{n}/2$.
Another puzzle : A factory produces light bulbs having an exponential distribution with mean $\mu$. Another factory produces light bulbs having an exponential distribution with mean $2\mu$. Your goal is to estimate $\mu$. You are allowed to choose a total of $50$ light bulbs (all from the first or all from the second or some from each factory). What do you do?
Solution : If we pick all $n=50$ bulbs from the first factory, we see $X_{1},\ldots ,X_{n}$ i.i.d. $\mbox{Exp}(1/\mu)$. The estimate for $\mu$ is $\bar{X}_{n}$ which has $\mbox{MSE}_{\bar{X}_{n} }(\mu)=\mbox{Var}(\bar{X}_{n})=\mu^{2}/n$. If we choose all bulbs from factory $2$ we get $Y_{1},\ldots ,Y_{n}$ i.i.d. $\mbox{Exp}(1/2\mu)$. The estimate for $\mu$ is $\bar{Y}_{n}/2$. But $\mbox{MSE}_{\bar{Y}_{n}/2}(\mu)=\mbox{Var}(\bar{Y}_{n}/2)=(2\mu)^{2}/4n=\mu^{2}/n$. The two mean-squared errors are exactly the same!
Probabilistic thinking : Is there any calculation-free explanation why the answers to the two puzzles are as above? Yes, and it is illustrative of what may be called probabilistic thinking. Take the second puzzle. Why are the two estimates same by mean-squared error? Is one better by some other criterion?
Recall that if $X\sim \mbox{Exp}(1/\mu)$ then $X/2\sim \mbox{Exp}(1/2\mu)$ and vice versa. Therefore, if we have data from $\mbox{Exp}(1/\mu)$ distribution, then we can divided all the numbers by $2$ and convert it into data from $\mbox{Exp}(1/2\mu)$ distribution. Conversely if we have data from $\mbox{Exp}(1/2\mu)$ distribution, then we can convert it into data from $\mbox{Exp}(1/\mu)$ distribution by multiplying each number by $2$. Hence there should be no advantage in choosing either factory. We leave it for you to think in analogous ways why in the first puzzle $C_{2}$ is better than $C_{1}$.