Consider the following examples.
This is the framework in which we would study the second example above, namely the lie-time distribution of light bulbs. Observe that we have parameterized the exponential family of distributions differently from usual. We could equivalently have considered $g_{\lambda}(x)=\lambda e^{-\lambda x}$ but the interest is then in estimating $1/\lambda$ (which is the expected value) rather than $\lambda$. Here are two methods.
Method of moments : We observe that $\mu=\mathbf{E}_{\mu}[X_{1}]$, the mean of the distribution (also called population mean). Hence it seems reasonable to take the sample mean $\bar{X}_{n}$ as an estimate. On second thought, we realize that $\mathbf{E}_{\mu}[X_{1}^{2}]=2\mu^{2}$ and hence $\mu=\sqrt{\frac{1}{2}\mathbf{E}_{\mu}[X_{1}^{2}]}$. Therefore it also seems reasonable to take the corresponding sample quantity, $T_{n}:=\sqrt{\frac{1}{2n}(X_{1}^{2}+\ldots +X_{n}^{2})}$ as an estimate for $\mu$. One can go further and write $\mu$ in various ways as $\mu=\sqrt{\mbox{Var}_{\mu}(X_{1})}$, $\mu=\sqrt[3]{\frac{1}{6}\mathbf{E}_{\mu}[X_{1}^{3}]}$ etc. Each such expression motivates an estimate, just by substituting sample moments for population moments.
This is called estimating by the method of moments because we are equating the sample moments to population moments to obtain the estimate.
We can also use other features of the distribution, such as quantiles (we may call this the ''method of quantiles''). In other words, obtain estimates by equating the sample quantiles to population quantiles. For example, the median of $X_{1}$ is $\mu\log 2$, hence a reasonable estimate for $\mu$ is $M_{n}/\log 2$, where $M_{n}$ is a sample median. Alternately, the $25\%$ quantile of $\mbox{Exponential}(1/\mu)$ distribution is $\mu\log(4/3)$ and hence another estimate for $\mu$ is $Q_{n}/\log(4/3)$ where $Q_{n}$ is a $25\%$ sample quantile.
Maximum likelihood method : The joint density of $X_{1},\ldots ,X_{n}$ is $$g_{\mu}(x_{1},\ldots ,x_{n})=\mu^{-n}e^{-\mu(x_{1}+\ldots +x_{n})} \qquad \mbox{ if all }x_{i} > 0$$ (since $X_{i}$ are independent, the joint density is a product). We evaluate the joint density at the observed data values. This is called the likelihood function. In other words, define, $$ L_{X}(\mu) := \mu^{-n}e^{-\frac{1}{\mu}\sum_{i=1}^{n}X_{i} }. $$ Two points: This is the joint density of $X_{1},\ldots ,X_{n}$, evaluated at the observed data. Further, we like to think of it as a function of $\mu$ with $X:=(X_{1},\ldots ,X_{n})$ being fixed.
When $\mu$ is the actual value, then $L_{X}(\mu)$ is the ''likelihood'' of seeing the data that we have actually observed. The maximum likelihood estimate is that value of $\mu$ that maximizes the likelihood function. In our case, by differentiating and setting equal to zero we get, $$ 0 =\frac{d}{d\mu}L_{X}(\mu) = -n\mu^{-n-1}e^{-\frac{1}{\mu}\sum_{i=1}^{n}X_{i} }+\mu^{-n}\left(\frac{1}{\mu^{2} }\sum_{i=1}^{n}X_{i}\right)e^{-\frac{1}{\mu}\sum_{i=1}^{n}X_{i} } $$ which is satisfied when $\mu=\frac{1}{n}\sum_{i=1}^{n}X_{i}=\bar{X}_{n}$. To distinguish this from the true value of $\mu$ which is unknown, it is customary to put a hat on the leter $\mu$. We write $\hat{\mu}_{MLE}=\bar{X}_{n}$. We should really verify whether $L(\mu)$ is maximized or minimized (or neither) at this point, but we leave it to you to do the checking (eg., by looking at the second derivative).
Let us see the same methods at work in two more examples.
Method of moments : We observe that $p=\mathbf{E}_{p}[X_{1}]$, the mean of the distribution (also called population mean). Hence, a method of moments estimator would be the sample mean $\bar{X}_{n}$. In this case, $\mathbf{E}_{p}[X_{1}^{2}]=p$ again but we don't get any new estimate because $X_{k}^{2}=X_{k}$ (as $X_{k}$ is $0$ or $1$)
Maximum likelihood method : Now we have a probability mass function instead of density. The joint pmf of of $X_{1},\ldots ,X_{n}$ is $f_{p}(x_{1},\ldots ,x_{n}=p^{\sum_{i=1}^{n}x_{i} }(1-p)^{n-\sum_{i=1}^{n}x_{i} }$ when each $x_{i}$ is $0$ or $1$. The likelihood function is $$ L_{X}(p) := p^{\sum_{i=1}^{n}x_{i} }(1-p)^{n-\sum_{i=1}^{n}x_{i} } = p^{n\bar{X}_{n} }(1-p)^{n(1-\bar{X}_{n})}. $$ We need to find the value of $p$ that maximizes $L_{X}(p)$. Here is a trick that almost always simplifies calculations (try it in the previous example too!). Instead of maximizing $L_{X}(p)$, maximize $\ell_{X}(p)=\log L_{X}(p)$ (called the log-likelihood function). Since ''$\log$'' is an increasing function, the maximizer will remain the same. In our case, $$ \ell_{X}(p)=\bar{X}_{n}\log p + n(1-\bar{X}_{n})\log (1-p). $$ Differentiating and setting equal to $0$, we get $\hat{p}_{MLE}=\bar{X}_{n}$. Again the sample mean is the maximum likelihood estimate.
A last example.
Now suppose we have data $X_{1},\ldots ,X_{n}$ i.i.d. from $f_{\theta,\alpha}$ where we do not know the values of $\theta$ and $\alpha$. How to estimate the parameters?
Method of moments : We compute $$\begin{align*} \mathbf{E}_{\theta,\alpha}[X_{1}]&=\frac{1}{2\alpha}\int\limits_{-\infty}^{+\infty}te^{-\frac{|t-\theta|}{\alpha} }dt = \frac{1}{2}\int\limits_{-\infty}^{+\infty}(\alpha s+\theta) e^{-|s|}ds = \theta.\\ \mathbf{E}_{\theta,\alpha}[X_{1}^{2}]&=\frac{1}{2\alpha}\int\limits_{-\infty}^{+\infty}t^{2}e^{-\frac{|t-\theta|}{\alpha} }dt = \frac{1}{2}\int\limits_{-\infty}^{+\infty}(\alpha s+\theta)^{2} e^{-|s|}ds = 2\alpha^{2}+\theta^{2}. \end{align*}$$ Thus the variance is $\mbox{Var}_{\theta,\alpha}(X_{1})=2\alpha^{2}$. Based on this, we can take the method of moments estimate to be $\hat{\theta}_{n}=\bar{X}_{n}$ (sample mean) and $\hat{\alpha}_{n}=\frac{1}{\sqrt{2} }s_{n}$ where $s_{n}^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\bar{X}_{n})^{2}$. At the moment the ideas of defining sample variance as $s_{n}^{2}$ may look strange and it might be more natural to take $V_{n}:=\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\bar{X}_{n})^{2}$ as an estimate for the population variance. As we shall see later, $s_{n}^{2}$ has some desirable properties that $V_{n}$ lacks. Whenever we say sample variance, we mean $s_{n}^{2}$, unless stated otherwise.
Maximum likelihood method : The likelihood function of the data is $$ L_{X}(\theta,\alpha)=\prod\limits_{k=1}^{n}\frac{1}{2\alpha}\exp\left\{-\frac{|X_{k}-\theta|}{\alpha}\right\}= 2^{-n}\alpha^{-n}\exp\left\{-\sum_{k=1}^{n}\frac{|X_{k}-\theta|}{\alpha}\right\}. $$ The log-likelihood function is $$\ell_{X}(\theta,\alpha)=\log L(\theta,\alpha)=-n\log 2 - n\log \alpha -\frac{1}{\alpha}\sum_{k=1}^{n}|X_{k}-\theta|.$$ We know that 1 If you do not know here is an argument. Let $x_{1} < x_{2} < \ldots < x_{n}$ be $n$ distinct real numbers and let $a\in \mathbb{R}$. Rewrite $\sum_{k=1}^{n}|x_{k}-a|$ as $(|x_{1}-a|+|x_{n}-a|)+(|x_{2}-a|+|x_{n-1}-a|)+\ldots$. By triangle inequality, we see that $$|x_{1}-a|+|x_{n}-a|\ge x_{n}-x_{1}, |x_{2}-a|+|x_{n-1}-a|\ge x_{n-1}-x_{2}, |x_{3}-a|+|x_{n-2}-a|\ge x_{n-2}-x_{3}\ldots. $$ Further the first inequality is an equality if and only if $x_{1}\le a\le x_{n}$, the second inequality is an equality if and only if $x_{2}\le a\le x_{n-1}$ etc. In particular, if $a$ is a median, then all these inequalities become equalities and shows that a median minimizes the given sum. for fixed $X_{1},\ldots ,X_{n}$, the value of $\sum_{k=1}^{n}|X_{k}-\theta|$ is minimized when $\theta=M_{n}$, the median of $X_{1},\ldots ,X_{n}$ (strictly speaking the median may have several choices, all of them are equally good). Thus we fix $\hat{\theta}=M_{n}$ and then we maximize $\ell(\hat{\theta},\alpha)$ over $\alpha$ by differentiating. We get $\hat{\alpha}=\frac{1}{n}\sum_{k=1}^{n}|X_{k}-\theta|$ (the sample mean-absolute deviation about the median). Thus the MLE of $(\theta,\alpha)$ is $(\hat{\theta},\hat{\alpha})$.
In homeworks and tutorials you will see several other estimation problems which we list in the exercise below.
The following exercise is approachable by the same methods but requires you to think a little.