Probability Models and Stastics

Consider the following examples.

A coin has an unknown probability $p$ of turning up head. We wish to determine the value of $p$. For this, we toss the coin $100$ times and observe the outcomes. How to give a guess for the value of $p$ based on the data?
A factory manufacture light bulbs whose lifetimes may be assumed to be exponential random variables with a mean life-time $\mu$. We take a sample of $50$ bulbs at random and measure their life-times $X_{1},\ldots ,X_{50}$. Based on this data, how can we present a reasonable guess for $\mu$? We may want to do this so that the specifications can be printed on the product when sold.
Can we guess the average height $\mu$ of all people in India by taking a random sample of $100$ people and measuring their heights?

In such questions, there is an unknown parameter $\mu$ (there could be more than one unknown parameter too) whose value we are trying to guess based on the data. The data consists of i.i.d. random variables from a family of distributions. We assume that the family of distributions is known and the only unknown is (are) the value of the parameter(s). Rather than present the ideas in abstract let us see a few examples.

Example 155

Let $X_{1},\ldots ,X_{n}$ be i.i.d. random variables with Exponential density $f_{\mu}(x)=\frac{1}{\mu}e^{-x/\mu}$ (fro $x > 0$) where the value of $\mu > 0$ is unknown. How to estimate it using the data $X=(X_{1},\ldots ,X_{n})$?

This is the framework in which we would study the second example above, namely the lie-time distribution of light bulbs. Observe that we have parameterized the exponential family of distributions differently from usual. We could equivalently have considered $g_{\lambda}(x)=\lambda e^{-\lambda x}$ but the interest is then in estimating $1/\lambda$ (which is the expected value) rather than $\lambda$. Here are two methods.

Method of moments : We observe that $\mu=\mathbf{E}_{\mu}[X_{1}]$, the mean of the distribution (also called population mean). Hence it seems reasonable to take the sample mean $\bar{X}_{n}$ as an estimate. On second thought, we realize that $\mathbf{E}_{\mu}[X_{1}^{2}]=2\mu^{2}$ and hence $\mu=\sqrt{\frac{1}{2}\mathbf{E}_{\mu}[X_{1}^{2}]}$. Therefore it also seems reasonable to take the corresponding sample quantity, $T_{n}:=\sqrt{\frac{1}{2n}(X_{1}^{2}+\ldots +X_{n}^{2})}$ as an estimate for $\mu$. One can go further and write $\mu$ in various ways as $\mu=\sqrt{\mbox{Var}_{\mu}(X_{1})}$, $\mu=\sqrt[3]{\frac{1}{6}\mathbf{E}_{\mu}[X_{1}^{3}]}$ etc. Each such expression motivates an estimate, just by substituting sample moments for population moments.

This is called estimating by the method of moments because we are equating the sample moments to population moments to obtain the estimate.

We can also use other features of the distribution, such as quantiles (we may call this the ''method of quantiles''). In other words, obtain estimates by equating the sample quantiles to population quantiles. For example, the median of $X_{1}$ is $\mu\log 2$, hence a reasonable estimate for $\mu$ is $M_{n}/\log 2$, where $M_{n}$ is a sample median. Alternately, the $25\%$ quantile of $\mbox{Exponential}(1/\mu)$ distribution is $\mu\log(4/3)$ and hence another estimate for $\mu$ is $Q_{n}/\log(4/3)$ where $Q_{n}$ is a $25\%$ sample quantile.

Maximum likelihood method : The joint density of $X_{1},\ldots ,X_{n}$ is $$g_{\mu}(x_{1},\ldots ,x_{n})=\mu^{-n}e^{-\mu(x_{1}+\ldots +x_{n})} \qquad \mbox{ if all }x_{i} > 0$$ (since $X_{i}$ are independent, the joint density is a product). We evaluate the joint density at the observed data values. This is called the likelihood function. In other words, define, $$ L_{X}(\mu) := \mu^{-n}e^{-\frac{1}{\mu}\sum_{i=1}^{n}X_{i} }. $$ Two points: This is the joint density of $X_{1},\ldots ,X_{n}$, evaluated at the observed data. Further, we like to think of it as a function of $\mu$ with $X:=(X_{1},\ldots ,X_{n})$ being fixed.

When $\mu$ is the actual value, then $L_{X}(\mu)$ is the ''likelihood'' of seeing the data that we have actually observed. The maximum likelihood estimate is that value of $\mu$ that maximizes the likelihood function. In our case, by differentiating and setting equal to zero we get, $$ 0 =\frac{d}{d\mu}L_{X}(\mu) = -n\mu^{-n-1}e^{-\frac{1}{\mu}\sum_{i=1}^{n}X_{i} }+\mu^{-n}\left(\frac{1}{\mu^{2} }\sum_{i=1}^{n}X_{i}\right)e^{-\frac{1}{\mu}\sum_{i=1}^{n}X_{i} } $$ which is satisfied when $\mu=\frac{1}{n}\sum_{i=1}^{n}X_{i}=\bar{X}_{n}$. To distinguish this from the true value of $\mu$ which is unknown, it is customary to put a hat on the leter $\mu$. We write $\hat{\mu}_{MLE}=\bar{X}_{n}$. We should really verify whether $L(\mu)$ is maximized or minimized (or neither) at this point, but we leave it to you to do the checking (eg., by looking at the second derivative).

Let us see the same methods at work in two more examples.

Example 156

Let $X_{1},\ldots ,X_{n}$ be i.i.d. Ber($p$) random variables where the value of $p$ is unknown. How to estimate it using the data $X=(X_{1},\ldots ,X_{n})$?

Method of moments : We observe that $p=\mathbf{E}_{p}[X_{1}]$, the mean of the distribution (also called population mean). Hence, a method of moments estimator would be the sample mean $\bar{X}_{n}$. In this case, $\mathbf{E}_{p}[X_{1}^{2}]=p$ again but we don't get any new estimate because $X_{k}^{2}=X_{k}$ (as $X_{k}$ is $0$ or $1$)

Maximum likelihood method : Now we have a probability mass function instead of density. The joint pmf of of $X_{1},\ldots ,X_{n}$ is $f_{p}(x_{1},\ldots ,x_{n}=p^{\sum_{i=1}^{n}x_{i} }(1-p)^{n-\sum_{i=1}^{n}x_{i} }$ when each $x_{i}$ is $0$ or $1$. The likelihood function is $$ L_{X}(p) := p^{\sum_{i=1}^{n}x_{i} }(1-p)^{n-\sum_{i=1}^{n}x_{i} } = p^{n\bar{X}_{n} }(1-p)^{n(1-\bar{X}_{n})}. $$ We need to find the value of $p$ that maximizes $L_{X}(p)$. Here is a trick that almost always simplifies calculations (try it in the previous example too!). Instead of maximizing $L_{X}(p)$, maximize $\ell_{X}(p)=\log L_{X}(p)$ (called the log-likelihood function). Since ''$\log$'' is an increasing function, the maximizer will remain the same. In our case, $$ \ell_{X}(p)=\bar{X}_{n}\log p + n(1-\bar{X}_{n})\log (1-p). $$ Differentiating and setting equal to $0$, we get $\hat{p}_{MLE}=\bar{X}_{n}$. Again the sample mean is the maximum likelihood estimate.

A last example.

Example 157

Consider the two-parameter Laplace-density $f_{\theta,\alpha}(x)=\frac{1}{2\alpha}e^{-\frac{|x-\theta|}{\alpha} }$ for all $x\in \mathbb{R}$. Check that $f_{\theta,\alpha}$ is indeed a density for all $\theta\in \mathbb{R}$ and $\alpha > 0$.

Now suppose we have data $X_{1},\ldots ,X_{n}$ i.i.d. from $f_{\theta,\alpha}$ where we do not know the values of $\theta$ and $\alpha$. How to estimate the parameters?

Method of moments : We compute $$\begin{align*} \mathbf{E}_{\theta,\alpha}[X_{1}]&=\frac{1}{2\alpha}\int\limits_{-\infty}^{+\infty}te^{-\frac{|t-\theta|}{\alpha} }dt = \frac{1}{2}\int\limits_{-\infty}^{+\infty}(\alpha s+\theta) e^{-|s|}ds = \theta.\\ \mathbf{E}_{\theta,\alpha}[X_{1}^{2}]&=\frac{1}{2\alpha}\int\limits_{-\infty}^{+\infty}t^{2}e^{-\frac{|t-\theta|}{\alpha} }dt = \frac{1}{2}\int\limits_{-\infty}^{+\infty}(\alpha s+\theta)^{2} e^{-|s|}ds = 2\alpha^{2}+\theta^{2}. \end{align*}$$ Thus the variance is $\mbox{Var}_{\theta,\alpha}(X_{1})=2\alpha^{2}$. Based on this, we can take the method of moments estimate to be $\hat{\theta}_{n}=\bar{X}_{n}$ (sample mean) and $\hat{\alpha}_{n}=\frac{1}{\sqrt{2} }s_{n}$ where $s_{n}^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\bar{X}_{n})^{2}$. At the moment the ideas of defining sample variance as $s_{n}^{2}$ may look strange and it might be more natural to take $V_{n}:=\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\bar{X}_{n})^{2}$ as an estimate for the population variance. As we shall see later, $s_{n}^{2}$ has some desirable properties that $V_{n}$ lacks. Whenever we say sample variance, we mean $s_{n}^{2}$, unless stated otherwise.

Maximum likelihood method : The likelihood function of the data is $$ L_{X}(\theta,\alpha)=\prod\limits_{k=1}^{n}\frac{1}{2\alpha}\exp\left\{-\frac{|X_{k}-\theta|}{\alpha}\right\}= 2^{-n}\alpha^{-n}\exp\left\{-\sum_{k=1}^{n}\frac{|X_{k}-\theta|}{\alpha}\right\}. $$ The log-likelihood function is $$\ell_{X}(\theta,\alpha)=\log L(\theta,\alpha)=-n\log 2 - n\log \alpha -\frac{1}{\alpha}\sum_{k=1}^{n}|X_{k}-\theta|.$$ We know that¹If you do not know here is an argument. Let $x_{1} < x_{2} < \ldots < x_{n}$ be $n$ distinct real numbers and let $a\in \mathbb{R}$. Rewrite $\sum_{k=1}^{n}|x_{k}-a|$ as $(|x_{1}-a|+|x_{n}-a|)+(|x_{2}-a|+|x_{n-1}-a|)+\ldots$. By triangle inequality, we see that $$|x_{1}-a|+|x_{n}-a|\ge x_{n}-x_{1}, |x_{2}-a|+|x_{n-1}-a|\ge x_{n-1}-x_{2}, |x_{3}-a|+|x_{n-2}-a|\ge x_{n-2}-x_{3}\ldots. $$ Further the first inequality is an equality if and only if $x_{1}\le a\le x_{n}$, the second inequality is an equality if and only if $x_{2}\le a\le x_{n-1}$ etc. In particular, if $a$ is a median, then all these inequalities become equalities and shows that a median minimizes the given sum. for fixed $X_{1},\ldots ,X_{n}$, the value of $\sum_{k=1}^{n}|X_{k}-\theta|$ is minimized when $\theta=M_{n}$, the median of $X_{1},\ldots ,X_{n}$ (strictly speaking the median may have several choices, all of them are equally good). Thus we fix $\hat{\theta}=M_{n}$ and then we maximize $\ell(\hat{\theta},\alpha)$ over $\alpha$ by differentiating. We get $\hat{\alpha}=\frac{1}{n}\sum_{k=1}^{n}|X_{k}-\theta|$ (the sample mean-absolute deviation about the median). Thus the MLE of $(\theta,\alpha)$ is $(\hat{\theta},\hat{\alpha})$.

In homeworks and tutorials you will see several other estimation problems which we list in the exercise below.

Exercise 158

Find an estimate for the unknown parameters by the method of moments and the maximum likelihood method.

$X_{1},\ldots, X_{n}$ are i.i.d. $N(\mu,1)$. Estimate $\mu$. How do your estimates change if the distribution is $N(\mu,2)$?
$X_{1},\ldots, X_{n}$ are i.i.d. $N(0,{\sigma}^{2})$. Estimate ${\sigma}^{2}$. How do your estimates change if the distribution is $N(7,{\sigma}^{2})$?
$X_{1},\ldots, X_{n}$ are i.i.d. $N(\mu,{\sigma}^{2})$. Estimate $\mu$ and ${\sigma}^{2}$.

[ Note: The first case is when ${\sigma}^{2}$ is known and $\mu$ is unknown. Then the known value of ${\sigma}^{2}$ may be used to estimate $\mu$. In the second case it is similar, now $\mu$ is known and ${\sigma}^{2}$ is not known. In the third case, both are unknown].

Exercise 159

$X_{1},\ldots, X_{n}$ are i.i.d. $\mbox{Geo}(p)$ Estimate $\mu=1/p$.

Exercise 160

$X_{1},\ldots, X_{n}$ are i.i.d. $\mbox{Pois}(\lambda)$ Estimate $\lambda$.

Exercise 161

$X_{1},\ldots, X_{n}$ are i.i.d. $\mbox{Beta}(a,b)$ Estimate $a,b$.

The following exercise is approachable by the same methods but requires you to think a little.

Exercise 162

$X_{1},\ldots, X_{n}$ are i.i.d. $\mbox{Uniform}[a,b]$ Estimate $a,b$.

Chapter 30. Properties of estimates

Chapter 29 : Estimation problems