Earlier in the course we discussed the problem of how to test whether a ''psychic'' can make predictions better than a random guesser. This is a prototype of what are called testing problems. We start with this simple example and introduce various general terms and notions in the context of this problem.

 

Question 173
A ''psychic'' claims to guess the order of cards in a deck. We shuffle a deck of cards, ask her to guess and count the number of correct guesses, say $X$.

One hypotheses (we call it the null hypothesis and denote it by $H_{0}$) is that the psychic is guessing randomly. The alternate hypothesis (denoted $H_{1}$) is that his/her guesses are better than random guessing (in itself this does not imply existence of psychic powers. It could be that he/she has managed to see some of the cards etc.). Can we decide between the two hypotheses based on $X$?

What we need is a rule for deciding which hypothesis is true. A rule for deciding between the hypotheses is called a test. For example, the following are examples of rules (the only condition is that the rule must depend only on the data at hand).

 

Example 174
We present three possible rules.
  1. If $X$ is an even number declare that $H_{1}$ is true. Else declare that $H_{1}$ is false.
  2. If $X\ge 5$, then accept $H_{1}$, else reject $H_{1}$.
  3. If $X\ge 8$, then accept $H_{1}$, else reject $H_{1}$.
The first rule does not make much sense as the parity (evenness or oddness) has little to do with either hypothesis. On the other hand, the other two rules make some sense. They rely on the fact that if $H_{1}$ is true then we expect $X$ to be larger than if $H_{0}$ is true. But the question still remains, should we draw the line at $5$ or at $8$ or somewhere else?
In testing problems there is only one objective, to avoid the following two possible types of mistakes. $$\begin{align*} \mbox{Type-I error:} & H_{0} \mbox{ is true but our rule concludes }H_{1}. \\ \mbox{Type-II error:} & H_{1} \mbox{ is true but our rule concludes }H_{0}. \end{align*}$$ The probability of type-I error is called the significance level of the test and usually denote by $\alpha$. That is, $\alpha=\mathbf{P}_{H_{0} }\{\mbox{the test accepts }H_{1}\}$ where we write $\mathbf{P}_{H_{0} }$ to mean that the probability is calculated under the assumption that $H_{0}$ is true. Similarly one define the power of the test as $\beta=\mathbf{P}_{H_{1} }\{\mbox{the test accepts }H_{1}\}$. Note that $\beta$ is the probability of not making type-II error, and hence we would like it to be close to $1$. Given two tests with the same level of significance, the one with higher power is better. Ideally we would like both to be small, but that is not always achievable.

We fix the desired level of significance, usually $\alpha=0.05$ or $0.1$ and only consider tests whose probability of type-I error is at most $\alpha$. It may seem surprising that we take $\alpha$ to be so small. Indeed the two hypotheses are not treated equally. Usually $H_{0}$ is the default option, representing traditional belief and $H_{1}$ is a claim that must prove itself. As such, the burden of proof is on $H_{1}$.

To use analogy with law, when a person is convicted, there are two hypotheses, one that he is guilty and the other that he is not guilty. According to the maxim ''innocent till proved guilty'', one is not required to prove his/her innocence. On the other hand guilt must be proved. Thus the null hypothesis is ''not guilty'' and the alternative hypothesis is ''guilty''.

In our example of card-guessing, assuming random guessing, we have calculated the distribution of $X$ long ago. Let $p_{k}=\mathbf{P}\{X=k\}$ for $k=0,1,\ldots ,52$. Now consider a test of the form ''Accept $H_{1}$ if $X\ge k_{0}$ and reject otherwise''. Its level of significance is $$ \mathbf{P}_{H_{0} }\{\mbox{accept }H_{1}\} = \mathbf{P}_{H_{0} }\{X\ge k_{0}\} = \sum_{i=k_{0} }^{52}p_{i}. $$ For $k_{0}=0$, the right side is $1$ while for $k_{0}=52$ it is $1/52!$ which is tiny. As we increase $k_{0}$ there is a first time where it becomes less than or equal to $\alpha$. We take that $k_{0}$ to be the threshold for cut-off.

In the same example of card-guessing, let $\alpha=0.01$. Let us also assume that Poisson approximation holds. This means that $p_{j}\approx e^{-1}/j! $ for each $j$. Then, we are looking for the smallest $k_{0}$ such that $\sum_{j=k_{0} }^{\infty}e^{-1}/j! \le 0.01$. For $k_{0}=4$, this sum is about $0.019$ while for $k_{0}=5$ this sum is $0.004$. Hence, we take $k_{0}=5$. In other words, accept $H_{1}$ if $X\ge 5$ and reject if $X < 5$. If we took $\alpha=0.0001$ we would get $k_{0}=7$ and so on.

Strength of evidence : Rather than merely say that we accepted $H_{1}$ or rejected it would be better to say how strong the evidence is in favour of the alternative hypothesis. This is captured by the $p$-value, a central concept of decision making. It is defined as the probability that data drawn from the null hypothesis would show closer agreement with the alternative hypothesis than the data we have at hand (read it five times!).

Before we compute it in our example, let us return to the analogy with law. Suppose a man is convicted for murder. Recall that $H_{0}$ is that he is not guilty and $H_{1}$ is that he is guilty. Suppose his fingerprints were found in the house of the murdered person. Does it prove his guilt? It is some evidence in favour of it, but not necessarily strong. For example, if the convict was a friend of the murdered person, then he might be innocent but have left his fingerprints on his visits to his friend. However if the convict is a total stranger, then one wonders why, if he was innocent, his finger prints were found there. The evidence is stronger for guilt. If bloodstains are found on his shirt, the evidence would be even stronger! In saying this, we are asking ourselves questions like ''if he was innocent, how likely is it that his shirt is blood-stained?''. That is $p$-value. Smaller the $p$-value, stronger the evidence for the alternate hypothesis.

Now we return to our example. Suppose the observed value is $X_{\mbox{obs} }=4$. Then the $p$-value is $\mathbf{P}\{X\ge 4\}=p_{4}+\ldots +p_{52}\approx 0.019$. If the observed value was $X_{\mbox{obs} }=6$, then the $p$-value would be $p_{6}+\ldots +p_{52}\approx 0.00059$. Note that the computation of $p$-value does not depend on the level of significance. It just depends on the given hypotheses and the chosen test.

Chapter 35. Testing for the mean of a normal population