Probability Models and Stastics

Suppose we have a bivariate sample $(X_{1},Y_{1}),(X_{2},Y_{2}),\ldots ,(X_{n},Y_{n})$ i.i.d. from a joint density (or joint pmf) $f(x,y)$. The question is to decide whether $X_{i}$ is independent of $Y_{i}$.

Example 180

There are many situations in which such a problem arises. For example, suppose a bunch of students are given two exams, one testing mathematical skills and another testing verbal skills. The underlying goal may be to investigate whether the human brain has distinct centers for verbal and quantitative thinking.

Example 181

As another example, say we want to investigate whether smoking causes lung cancer. In this case, for each person in the sample, we take two measurements - $X$ (equals $1$ if smoker and $0$ if not) and $Y$ (equal $1$ if the person has lung cancer, $0$ if not). The resulting data may be summarized in a two-way table as follows. $$ \begin{array}{c|cc|c} & X=0 & X=1 & \\ \hline Y=0 & n_{0,0} & n_{0,1} & n_{0\cdot}\\ Y=1 & n_{1,0} & n_{1,1} & n_{1\cdot} \\ \hline & n_{\cdot 0} & n_{\cdot 1} & n \end{array} $$ Here the total sample is of $n$ persons and $n_{i,j}$ denote the numbers in each of the four boxes. The numbers $n_{0\cdot}$ etc denote row or column sums. The statistical problem is to check if smoking ($X$) and incidence of lung cancer ($Y$) are positively correlated.

Testing independence in bivariate normal : We shall not discuss this problem in detail but instead quickly give some indicators and move on. Here we have $(X_{i},Y_{i})$ i.i.d bivariate normal random variables with $\mathbf{E}[X]=\mu_{1}$, $\mathbf{E}[Y]=\mu_{2}$, $\mbox{Var}(X)={\sigma}_{1}^{2}$, $\mbox{Var}(Y)={\sigma}_{2}^{2}$ and $\mbox{Corr}(X,Y)=\rho$. The testing problem is $H_{0}: \rho=0$ versus $H_{1}: \rho\not=0$. (Remember that if $(X,Y)$ is bivariate normal, then $X$ and $Y$ are independent if and only if $X$ and $Y$ are uncorrelated.

The natural statistic to consider is the sample correlation coefficient ( Pearson's $r$ statistic) $$ r_{n}:=\frac{s_{X,Y} }{s_{X}.s_{Y} } $$ where $s_{X}^{2},s_{Y}^{2}$ are the sample variances of $X$ and $Y$ and $s_{X,Y}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})$ is the sample covariance. It is clear that the test should reject null hypothesis if $r_{n}$ is away from $0$. To decide the threshold we need the distribution of $r_{n}$ under the null hypothesis.

Fisher : Under the null hypothesis, $r_{n}^{2}$ has $\mbox{Beta}(\frac{1}{2}, \frac{n-2}{2})$ distribution.

Using this result, we can draw the threshold for rejection using the Beta distribution (of course the explicit threshold can only be computed numerically). If the assumption of normality of the data is not satisfied, then this test is invalid. However, for large $n$ as usual we can obtain an asymptotically level-$\alpha$ test.

Testing for independence in contingency tables : Here the measurements $X$ and $Y$ take values in $\{x_{1},\ldots ,x_{k}\}$ and $\{y_{1},\ldots ,y_{\ell}\}$, respectively. These $x_{i},y_{j}$ are categories, not numerical values (such as ''smoking'' and ''non-smoking''). Let the total number of samples be $n$ and let $N_{i,j}$ be the number of samples with values $(x_{i},y_{j})$. Let $N_{i\cdot}=\sum_{j}N_{i,j}$ and let $N_{\cdot j}=\sum_{i}N_{i,j}$.

We want to test $$\begin{align*} H_{0}&: X \mbox{ and } Y \mbox{ are independent} \\ H_{1}&: X \mbox{ and } Y \mbox{ are not independent}. \end{align*}$$

Let $\mu(i,j)=\mathbf{P}\{X=x_{i},Y=y_{j}\}$ be the joint pmf of $(X,Y)$ and let $p(i)$, $q(j)$ be the marginal pmfs of $X$ and $Y$ respectively. From the sample, our estimates for these probabilities would be $\hat{\mu}(i,j)=N_{i,j}/n$ and $\hat{p}(i)=N_{i\cdot}/n$ and $\hat{q}(j)=N_{\cdot j}/n$ (which are consistent in the sense that $\sum_{j}\hat{\mu}(i,j)=\hat{p}(i)$ etc).

Under the null hypothesis we must have $\mu(i,j)=p(i)q(j)$. We test if these equalities hold (approximately) for the estimates. That is, define $$ T=\sum_{i=1}^{k}\sum_{j=1}^{\ell}\frac{(N_{i,j}-n\hat{p}(i)\hat{q}(j))^{2} }{n\hat{p}(i)\hat{q}(j)}. $$ Note that this is in the usual form of a $\chi^{2}$ statistic (sum of $(\mbox{observed}-\mbox{expected})^{2}/\mbox{expected}$).

The number of terms is $k\ell$. We lose one d.f. as usual but in addition we estimate $(k-1)$ parameters $p(i)$ (the last one $p(k)$ can be got from the others) and $(\ell-1)$ parameters $q(j)$. Consequently, the total degress of freedom is $k\ell-1-(k-1)-(\ell-1)=(k-1)(\ell-1)$.

Hence, we reject the null hypothesis if $T > \chi_{(k-1)(\ell-1)}^{2}(\alpha)$ to get (an approximately) level $\alpha$ test.

Chapter 40. Regression and Linear regression

Chapter 39 : Tests for independence