Let $X$ be a random variable with distribution $F$. We shall assume that it has pmf or pdf denoted by $f$.

 

Definition 132
The expected value (also called mean) of $X$ is defined as the quantity $\mathbf{E}[X]=\sum_{t}tf(t)$ if $f$ is a pmf and $\mathbf{E}[X]=\int_{-\infty}^{+\infty} t f(t)dt$ if $f$ is a pdf (provided the sum or the integral converges absolutely).
Note that this agrees with the definition we gave earlier for random variables with pmf. it is possible to define expect value for distributions without pmf or pdf, but we shall not do it here.

Properties of expectation : Let $X,Y$ be random variables both having pmf $f,g$ or pdf $f,g$, respectively.

  1. Then, $\mathbf{E}[aX+bY]=a\mathbf{E}[X]+b\mathbf{E}[Y]$ for any $a,b\in \mathbb{R}$. In particular, for a constant random variable (i.e., $X=a$ with probability $1$ for some $a$, $\mathbf{E}[X]=a$). This is called linearity of expectation.
  2. If $X\ge Y$ (meaning, $X(\omega)\ge Y(\omega)$ for all $\omega$), then $\mathbf{E}[X]\ge \mathbf{E}[Y]$
  3. If $\varphi:\mathbb{R}\rightarrow \mathbb{R}$, then $$\mathbf{E}[\varphi(X)]= \begin{cases} \sum\limits_{t}\varphi(t)f(t) & \mbox{ if $f$ is a pmf}. \\ \int_{-\infty}^{+\infty}\varphi(t)f(t) dt & \mbox{ if $f$ is a pdf}. \end{cases} $$
  4. More generally, if $(X_{1},\ldots ,X_{n})$ has joint pdf $f(t_{1},\ldots ,t_{n})$ and $V=T(X_{1},\ldots ,X_{n})$ (here $T:\mathbb{R}^{n}\rightarrow \mathbb{R}$), then $\mathbf{E}[V]=\int_{-\infty}^{\infty}\ldots \int_{-\infty}^{\infty}T(x_{1},\ldots ,x_{n})f(x_{1},\ldots ,x_{n})dx_{1}\ldots dx_{n}$.
For random variables on a discrete probability space (then they have pmf), we have essentially proved all these properties (or you can easily do so). For random variables with pmf, a proper proof require a bit of work. So we shall just take these for granted.

We state one more property of expectations, its relationship to independence.

 

Lemma 133
Let $X,Y$ be random variables on a common probability space. If $X$ and $Y$ are independent, then $\mathbf{E}[H_{1}(X)H_{2}(Y)]=\mathbf{E}[H_{1}(X)]\mathbf{E}[H_{2}(Y)]$ for any functions $H_{1},H_{2}:\mathbb{R}\rightarrow \mathbb{R}$ (for which the expectations make sense). In particular, $\mathbf{E}[XY]=\mathbf{E}[X]\mathbf{E}[Y]$.
Independence means that the joint density (analogous statements for pmf omitted) of $(X,Y)$ is f the form $f(t,s)=g(t)h(s)$ where $g(t)$ is the density of $X$ and $h(s)$ is the density of $Y$. Hence, $$ \mathbf{E}[H_{1}(X)H_{2}(Y)]=\int\limits_{-\infty}^{\infty}\!\!\int\limits_{-\infty}^{\infty} H_{1}(t)H_{2}(s) f(t,s)dt ds = \left(\int\limits_{-\infty}^{\infty} H_{1}(t)g(t) dt \right)\left(\int\limits_{-\infty}^{\infty} H_{2}(s)g(s) ds \right) $$ which is precisely $\mathbf{E}[H_{1}(X)]\mathbf{E}[H_{2}(Y)]$.
Expectation is a very important quantity. Using it, we can define several other quantities of interest.

Discussion : For simplicity let us take random variables to have densities in this discussion. You may adapt the remarks to the case of pmf easily. The density has all the information we need about a random variable. However, it is a function, which means that we have to know $f(t)$ for every $t$. In real life often we have random variables whose pdf is unknown or impossible to determine. It would be better to summarize the main features of the distribution (i.e., the density) in a few numbers. That is what the quantities defined below try to do.

Mean : Mean is another term for expected value.

Quantiles : Let us assume that the CDF $F$ of $X$ is strictly increasing and continuous. Then $F^{-1}(t)$ is well defined for every $t\in (0,1)$. For each $t\in (0,1)$, the number $Q_{t}=F^{-1}(t)$ is called the $t$-quantile. For example, the $1/2$-quantile, also called median is the number $x$ such that $F(x)=\frac{1}{2}$ (unique when the CDF is strictly increasing and continuous). Similarly one defines $1/4$-quantile and $3/4$-quantile and these are sometimes called quartiles. Another familiar quantity is the percentile, frequently used in reporting performance in competitive exams. For each $x$, the $x$-percentile is nothing but $F(x)$. For exam scores, it tells the proportion of exam-takers who scored less than or equal to $x$.

Moments : The quantity $\mathbf{E}[X^{k}]$ (if it exists) is called the $k^{\mbox{th} }$ moment of $X$.

Variance : Let $\mu=\mathbf{E}[X]$ and define ${\sigma}^{2}:=\mathbf{E}\left[ (X-\mu)^{2}\right]$. This is called the variance of $X$, also denoted by $\mbox{Var}(X)$. It can be written in other forms. For example, $$\begin{align*} {\sigma}^{2} &= \mathbf{E}[X^{2}+\mu^{2}-2\mu X] \qquad (\mbox{by expanding the square})\\ &= \mathbf{E}[X^{2}]+\mu^{2} - 2\mu \mathbf{E}[X] \qquad (\mbox{by property (1) above})\\ &= \mathbf{E}[X^{2}]-\mu^{2}. \end{align*}$$ That is $\mbox{Var}(X)=\mathbf{E}[X^{2}]-(\mathbf{E}[X])^{2}$.

Standard deviation : The standard deviation of $X$ is defined as $\mbox{s.d.}(X):=\sqrt{\mbox{Var}(X)}$.

Mean absolute deviation : The mean absolute deviation of $X$ is defined as the $\mathbf{E}[|X-\mbox{med}(X)|]$.

Coefficient of variation : The coefficient of variation of $X$ is defined as $\mbox{c.v.}(X)=\frac{\mbox{s.d.}(X)}{|{\bf E}[X]|}$.

Covariance : Let $X,Y$ be random variables on a common probability space. The covariance of $X$ and $Y$ is defined as $\mbox{Cov}(X,Y)=\mathbf{E}[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])]$. It can also be written as $\mbox{Cov}(X,Y)=\mathbf{E}[XY]-\mathbf{E}[X]\mathbf{E}[Y]$.

Correlation : Let $X,Y$ be random variables on a common probability space. Their correlation is defined as $\mbox{Corr}(X,Y)=\frac{\mbox{Cov}(X,Y)}{\sqrt{\mbox{Var}(X)}\sqrt{\mbox{Var}(Y)} }$.

Entropy : The entropy of a random variable $X$ is defined as $$ \mbox{Ent}(X) = \begin{cases} -\sum_{i} f(t) \log(f(t_{i})) & \mbox{ if }X\mbox{ has pmf }f.\\ -\int\limits f(t) \log(f(t)) & \mbox{ if }X \mbox{ has pdf} f. \end{cases} $$ If $\mathbf{X}=(X_{1},\ldots ,X_{n})$ is a random vector, we can define its entropy exactly by the same expressions, except that we use the joint pmf or pdf of $\mathbf{X}$ and the sum or integral is over points in $\mathbb{R}^{n}$.

Discussion : What do these quantities mean?

Measures of central tendency Mean and median try to summarize the distribution of $X$ by a single number. Of course one number cannot capture the whole distribution, so there are many densities and mass functions that have the same mean or median. Which is better - mean or median? This question has no unambiguous answer. Mean has excellent mathematical properties (mainly linearity) which the median lacks ($\mbox{med}(X+Y)$ bears no general relationship to $\mbox{med}(X)+\mbox{med}(Y)$). In contrast, mean is sensitive to outliers, while the median is far less so. For example, if the average income in a village of 50 people is $1000$ Rs. per month, the immigration of multi-millionaire to the village will change the mean drastically but the median remains about the same. This is good, if by giving one number we are hoping to express the state of a typical individual in the population.

Measures of dispersion : Suppose the average height of people in a city is 160 cm. This could be because everyone is 160 cm exactly or because half the people are 100 cm. while the other half are 220 cm., or alternately the heights could be uniformly spread over 150-170 cm., etc. How widely the distribution is spread is measured by standard deviation and mean absolute deviation. Since we want deviation from mean, $\mathbf{E}[X-\mathbf{E}[X]]$ looks natural, but this is zero because of cancellation of positive and negative deviations. To prevent cancellation, we may put absolute values (getting to the $\mbox{m.a.d}$, but that is usually taken around the median) or we may square the deviations before taking expectation (giving the variance, and then the standard deviation). Variance and standard deviation have much better mathematical properties (as we shall see) and hence are usually preferred.

The standard deviation has the same units as the quantity. Fo example, if mean height is 160cm measured in centimeters with a standard deviation of 10cm, and the mean weight is 55kg with a standard deviation of 5kg, then we cannot say which of the two is less variable. To make such a comparison we need a dimension free quantity (a pure number). Coefficient of variation is such a quantity, as it measure the standard deviation per mean. For the height and weight data just described, the coefficients of variation are 1/16 and 1/11, respectively. Hence we may say that height is less variable than weight in this example.

Measures of association : The marginal distributions do not determine the joint distribution. For example, if $(X,Y)$ is a point chosen at random from the unit square (with vertices $(0,0),(1,0),(0,1),(1,1)$) then $X,Y$ both have marginal distribution that is uniform on $[0,1]$. If $(U,V)$ is a point picked at random from the diagonal line (the line segment from $(0,0)$ to $(1,1)$, then again $U$ and $V$ have marginals that are uniform on $[0,1]$. But the two joint distributions are completely different. In particular, giving the means and standard deviations of $X$ and $Y$ does not tell anything about possible relationships between the two.

Covariance is the quantity that is used to measure the ''association'' of $Y$ and $X$. Correlation is a dimension free quantity that measures the same. For example, we shall see that if $Y=X$, then $\mbox{Corr}(X,Y)=+1$, if $Y=-X$ then $\mbox{Corr}(X,Y)=-1$. Further, if $X$ and $Y$ are independent, then $\mbox{Corr}(X,Y)=0$. In general, if an increase in $X$ is likely to mean an increase in $Y$, then the correlation is positive and if an increase in $X$ is likely to mean a decrease in $Y$ then the correlation is negative.

 

Example 134
Let $X\sim N(\mu,{\sigma}^{2})$. Recall that its density is $\frac{1}{{\sigma} \sqrt{2\pi} }e^{-\frac{(x-\mu)^{2} }{2{\sigma}^{2} }}$. We can compute $$\begin{align*} \mathbf{E}[X] &=\frac{1}{{\sigma} \sqrt{2\pi} }\int\limits_{-\infty}^{+\infty}xe^{-\frac{(x-\mu)^{2} }{2{\sigma}^{2} }} dx =\mu. \end{align*}$$ On the other hand $$\begin{align*} \mbox{Var}(X) &= \frac{1}{{\sigma} \sqrt{2\pi} }\int\limits_{-\infty}^{+\infty}(x-\mu)^{2}e^{-\frac{(x-\mu)^{2} }{2{\sigma}^{2} }} dx = {\sigma}^{2}\frac{1}{\sqrt{2\pi} } \int\limits_{-\infty}^{+\infty}u^{2}e^{-\frac{u^{2} }{2} } du \qquad (\mbox{substitute }x = \mu+{\sigma} u) \\ &= {\sigma}^{2}\frac{2}{\sqrt{2\pi} } \int\limits_{0}^{+\infty}u^{2}e^{-\frac{u^{2} }{2} } du = {\sigma}^{2}\frac{2\sqrt{2} }{\sqrt{2\pi} } \int\limits_{0}^{+\infty}\sqrt{t}e^{-t} dt \qquad (\mbox{substitute }t=u^{2}/2) \\ &= {\sigma}^{2}\frac{2\sqrt{2} }{\sqrt{2\pi} }\Gamma(3/2) = {\sigma}^{2}. \end{align*}$$ To get the last line, observe that $\Gamma(3/2)=\frac{1}{2}\Gamma(1/2)$ and $\Gamma(1/2)=\sqrt{\pi}$. Thus we now have a meaning for the parameters $\mu$ and ${\sigma}^{2}$ - they are the mean and variance of the $N(\mu,{\sigma}^{2})$ distribution. Again note that the mean is the same for all $N(0,{\sigma}^{2})$ distributions but the variances are different, capturing the spread of the distribution.

 

Exercise 135
Let $X\sim N(0,1)$. Show that $\mathbf{E}[X^{n}]=0$ if $n$ is odd and if $n$ is even then $\mathbf{E}[X^{n}]=(n-1)(n-3)\ldots (3)(1)$ (product of all odd numbers up to and including $n-1$). What happens if $X\sim N(0,{\sigma}^{2})$?

 

Exercise 136
Calculate the mean and variance for the following distributions.
  1. $X\sim \mbox{Geo}(p)$. $\mathbf{E}[X]=\frac{1}{p}$ and $\mbox{Var}(X)=\frac{q}{p^{2} }$.
  2. $X\sim \mbox{Bin}(n,p)$. $\mathbf{E}[X]=np$ and $\mbox{Var}(X)=npq$.
  3. $X\sim \mbox{Pois}(\lambda)$. $\mathbf{E}[X]=\lambda$ and $\mbox{Var}(X)=\lambda$.
  4. $X\sim \mbox{Hypergeo}(N_{1},N_{2},m)$. $\mathbf{E}[X]=\frac{mN_{1} }{N_{1}+N_{2} }$ and $\mbox{Var}(X)=??$.

 

Exercise 137
Calculate the mean and variance for the following distributions.
  1. $X\sim \mbox{Exp}(\lambda)$. $\mathbf{E}[X]=\frac{1}{\lambda}$ and $\mbox{Var}(X)=\frac{1}{\lambda^{2} }$.
  2. $X\sim \mbox{Gamma}(\nu,\lambda)$. $\mathbf{E}[X]=\frac{\nu}{\lambda}$ and $\mbox{Var}(X)=\frac{\nu}{\lambda^{2} }$.
  3. $X\sim \mbox{Unif}[0,1]$. $\mathbf{E}[X]=\frac{1}{2}$ and $\mbox{Var}(X)=\frac{1}{12}$.
  4. $X\sim \mbox{Beta}(p,q)$. $\mathbf{E}[X]=\frac{p}{p+q}$ and $\mbox{Var}(X)=\frac{pq}{(p+q)^{2}(p+q+1)}$.

Properties of covariance and variance : Let $X,Y,X_{i},Y_{i}$ be random variables on a common probability space. Small letters $a,b,c$ etc will denote scalars.

  1. (Bilinearity): $\mbox{ Cov}(aX_{1}+bX_{2},Y)=a\mbox{ Cov}(X_{1},Y)+b\mbox{ Cov}(X_{2},Y)$ and $\mbox{ Cov}(Y,aX_{1}+bX_{2})=a\mbox{ Cov}(Y,X_{1})+b\mbox{ Cov}(Y,X_{2})$
  2. (Symmetry): $\mbox{ Cov}(X,Y)=\mbox{ Cov}(Y,X)$.
  3. (Positivity): $\mbox{ Cov}(X,X)\ge 0$ with equality if and only if $X$ is a constant random variable. Indeed, $\mbox{ Cov}(X,X)=\mbox{Var}(X)$.

 

Exercise 138
Show that $\mbox{Var}(cX)=c^{2}\mbox{Var}(X)$ (hence $\mbox{sd}(cX)=|c|\mbox{sd}(X)$). Further, if $X$ and $Y$ are independent, then $\mbox{Var}(X+Y)=\mbox{Var}(X)+\mbox{Var}(Y)$.

Note that the properties of ovariance are very much like properties of inner-products in vector spaces. In particular, we have the following analogue of the well-known inequality for vectors $(\mathbf{u}\cdot \mathbf{v})^{2}\le(\mathbf{u}\cdot \mathbf{u})(\mathbf{v}\cdot \mathbf{v})$.

Cauchy-Schwarz inequality : If $X$ and $Y$ are random variables with finite variances, then $(\mbox{ Cov}(X,Y))^{2}\le \mbox{Var}(X)\mbox{Var}(Y)$ with equality if and only if $Y=aX+b$ for some scalars $a,b$.

If not convinced, follow the proof of Cauchy-Schwarz inequality that you have seen for vectors. This just means that $\mbox{Var}(X+tY)\ge 0$ for any scalar $t$ and choose an appropriate $t$ to get the Cauchy-Schwarz's inequality.

Chapter 22. Makov's and Chebyshev's inequalities