Probability Models and Stastics

In statistics we are faced with data, which could be measurements in an experiment, responses in a survey etc. There will be some randomness, which may be inherent in the problem or due to errors in measurement etc. The problem in statistics is to make various kinds of inferences about the underlying distribution, from realizations of the random variables. We shall consider a few basic types of problems encountered in statistics. We shall mostly deal with examples, but sufficiently many that the general ideas should become clear too. It may be remarked that we stay with the simplest ''textbook type problems'' but we shall also see some real data. Unfortunately we shall not touch upon the problems of current interest, which typically involve very huge data sets etc. Here are the kinds of problems we study.

General setting : We shall have data (measurements perhaps), usually of the form $X_{1},\ldots ,X_{n}$ which are realizations of independent random variables from a common distribution. The underlying distribution is not known. In the problems we consider, typically the distribution is known, except for the values of a few parameters. Thus, we may write the data as $X_{1},\ldots ,X_{n}$ i.i.d. $f_{\theta}(x)$ where $f_{\theta}(x)$ is a pdf or pmf for each value of the parameter(s) $\theta$. For example, the density could be of $N(\mu,{\sigma}^{2})$ (two unknown parameters $\mu$ and ${\sigma}^{2}$) or of $\mbox{Pois}(\lambda)$ (one unknown parameter $\lambda$).

(1) Estimation : Here, the question is to guess the value of the unknown $\theta$ from the sample $X_{1},\ldots ,X_{n}$. For example, if $X_{i}$ are i.i.d. from $\mbox{Ber}(p)$ distribution ($p$ is unknown), then a reasonable guess for $\theta$ would be the sample mean $\bar{X}_{n}$ (an estimator). Is this the only one? Is it the ''best'' one? Such questions are addressed in estimation.

(2) Confidence intervals : Here again the problem is of estimating the value of a parameter, but instead of giving one value as a guess, we instead give an interval and quantify how sure we are that the interval will contain the unknown parameter. For example, a coin with unknown probability $p$ of turning up head, is tossed $n$ times. Then, a confidence interval for $p$ could be of the form \[\begin{aligned} \left[\bar{X}_{n}-\frac{3}{\sqrt{n} }\sqrt{\bar{X}_{n}(1-\bar{X}_{n})},\bar{X}_{n}+\frac{3}{\sqrt{n} }\sqrt{\bar{X}_{n}(1-\bar{X}_{n})}\right] \end{aligned}\] where $\bar{X}_{n}$ is the proportion of heads in $n$ tosses. The reason for such an interval will come later. It turns out that if $n$ is large, one can say that with probability $0.99$ (''confidence level''), this interval will contain the true value of the parameter.

(3) Hypothesis testing : In this type of problem we are required to decide between two competing choices (''hypotheses''). For example, it is claimed that one batch of students is better than a second batch of students in mathematics. One way to check this is to give the same exam to students in both exams and record the scores. Based on the scores, we have to decide whether the first batch is better than the second (one hypothesis) or whether there is not much difference between the two (the other hypothesis). One can imagine that this can be done by comparing the sample means etc., but that will come later.

A good analogy for testing problems is from law, where the judge has to decide whether an accused is guilty or not guilty. Evidence presented by lawyers take the role of data (but of course one does not really compute any probabilities quantitatively here!).

(4) Regression : Consider two measurements, such as height and weight. It is reasonable to say that weight and height are positively correlated (if the height is larger, the weight tends to be larger too), but is there a more quantitative relationship? Can we predict the weight (roughly) from the height? One could try to see if a linear function fits: $\mbox{wt.}=a \mbox{ht.}+b$ for some $a,b$. Or perhaps a more complicated fit such as $\mbox{wt.}=a \mbox{ht.}+b \mbox{ht.}^{2}+c$, etc. To see if this is a good fit, and to know what values of $a,b,c$ to take, we need data. Thus, the problem is that we have some data $(H_{i},W_{i})$, $i=1,2,\ldots ,n$, and based on this data we try to find the best linear fit (or the best quadratic fit) etc.

As another example, consider the approximate law that the resistivity of a material is proportional to the temperature. What is the constant of proportionality (for a given material). Here we have a law that says $R=aT$ where $a$ is not known. By taking many measurements at various temperatures we get data $(T_{i},R_{i})$, $i=1,2,\ldots ,n$. From this we must find the best possible $a$ (if all the data points were to lie on a line $y=ax$, there would be no problem. In reality they never will, and that is why the choice is an issue!).

Chapter 29. Estimation problems

Chapter 28 : Introduction