Let $(X_{i},Y_{i})$ be i.i.d random variables. For example, we could pick people at random from a population and measure their height ($X$) and weight ($Y$). One question of interest is to predict the value of $Y$ from the value of $X$. This may be useful if $Y$ is difficult to measure directly. For instance, $X$ could be the height of a person and $Y$ could be the xxx In other words, we assume that there is an underlying relationship $Y=f(X)$ for an unknown function $f$ which we want to find. From a random sample $(X_{1},Y_{1}),\ldots ,(X_{n},Y_{n})$ we try to guess the function $f$. If we allow all possible functions, it is easy to find one that fits all the data points, i.e., there exists a function $f:\mathbb{R}\rightarrow \mathbb{R}$ (in fact we may take $f$ to be a polynomials of degree $n$) such that $f(X_{i})=Y_{i}$ for each $i\le n$ (this is true only if we assume that all $X_{i}$ are distinct which happens if $X$ has a continuous distribution). This is not a good predictor, because the next data point $(U,V)$ will fall way off the curve. We have found a function that ''predicts'' well all the data we have, but not for a future observation! Instead, we fix a class of functions, for example the collection of all linear functions $y=mx+c$ where $m,c\in \mathbb{R}$ and within this class, find the best fitting function.

 

Remark 182
One may wonder if linearity is too restrictive. To some extent, but perhaps not as much as it sounds at first.
  1. Firstly, many relationships are linear in a reasonable range of the $X$ variable (for example, resistance of a materiaal versus temperature).
  2. Secondly, we may sometimes transform the variables so that the relationship becomes linear. For example, if $Y=ae^{bX}$, then $\log(Y)=a'+b'X$ where $a'=\log(a)$ and $b'=\log(b)$ and hence in terms of the new variables $X$ and $\log(Y)$, we have a linear relationship.
  3. Lastly, as a slight extension of linear regression, one can study multiple linear regression, where one has several independent variables $X^{(1)},\ldots, X^{(p)}$ and try to fit a linear function $Y=\beta_{1}X^{(1)}+\ldots +\beta_{p}X^{(p)}$. Once that is done, it increases the scope of curve fitting even more. For example, if we have two variable $X,Y$, then we can take $X^{(1)}=1$, $X^{(2)}=X$, $X^{(3)}=X^{2}$. Then, linear regression of $Y$ against $X^{(1)},X^{(2)},X^{(3)}$ is tantamount to fitting a quadratic polynomial curve for $X,Y$.
In short, multiple linear regression along with non-linear transformations of the individual variables, the class of functions $f$ is greatly extended.
Finding the best linear fit : We need a criterion for deciding the ''best''. A basic one is the method of least squares which recommends finding $\alpha,\beta$ such that the error sum of squares $R^{2}:=\sum_{k=1}^{n}(Y_{k}-\alpha-\beta X_{k})^{2}$ is minimized. For fixed $X_{i},Y_{i}$ this is a simple problem in calculus. We get $$ \hat{\beta}=\frac{\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})(Y_{k}-\bar{Y}_{n})}{\sum_{k=1}^{n}(X_{k}-\bar{X}_{n})^{2} }=\frac{s_{X,Y} }{s_{X}^{2} }, \qquad \hat{\alpha}=\bar{Y}_{n}-\hat{\beta}\bar{X}_{n} $$ where $s_{X,Y}$ is the sample covariance of $X,Y$ and $s_{X}$ is the sample variance of $X$. We leave the derivation of the least squares estimators by calculus to you. Instead we present another approach. For a given choice of $\beta$, we know that the choice of $\alpha$ which minimizes $R^{2}$ is the sample mean of $Y_{i}-\beta X_{i}$ which is $\bar{Y}-\beta \bar{X}$. Thus, we only need to find $\hat{\beta}$ that minimizes $$\sum_{k=1}^{n}\left((Y_{k}-\bar{Y})-\beta(X_{k}-\bar{X})\right)^{2}$$ and then we simply set $\hat{\alpha}=\bar{Y}-\beta\bar{X}$. Let We are dividing by $X_{k}-\bar{X}$. What if it is zero for some $k$? But note that in the expression $\sum \left((Y_{k}-\bar{Y})-\beta(X_{k}-\bar{X})\right)^{2}$, all such terms do not involve $\beta$ and hence can be safely left out of the summation. We leave the details for you to work out (the expressions at the end should involve all $X_{k},Y_{k}$). $Z_{k}=\frac{Y_{k}-\bar{Y} }{X_{k}-\bar{X} }$ and $w_{k}=(X_{k}-\bar{X})^{2}/s_{X}^{2}$. Then, $$ \sum_{k=1}^{n}\left((Y_{k}-\bar{Y})-\beta(X_{k}-\bar{X})\right)^{2}=s_{X}^{2}\sum_{k=1}^{n}w_{k}\left(Z_{k}-\beta \right)^{2}. $$ Since $w_{k}$ are non-negative numbers that add to $1$, we can intepret it as a probability mass function and hence we see that the minimizing $\beta$ is given by the expectation with respect to this mass function. In other words, $$ \hat{\beta}=\sum_{k=1}^{n}w_{k}Z_{k} = \frac{s_{X,Y} }{s_{X}^{2} }. $$ Another way to write it is $\hat{\beta}=\frac{s_{Y} }{s_{X} }r_{X,Y}$ where $r_{X,Y}$ is the sample correlation coefficient. A motivation for the least squares criterion : Suppose we make more detailed model assumptions as follows. Let $X$ be a control variable (i.e., not random but we can tune it to any value, like temperature) and assume that $Y_{i}=\alpha+\beta X_{i}+\epsilon_{i}$ where $\epsilon_{i}$ are i.i.d. $N(0,{\sigma}^{2})$ ''errors''. Then, the data is essential $Y_{i}$ that are independent $N(\alpha+\beta X_{i},{\sigma}^{2})$ random variables. Now we can extimate $\alpha,\beta$ by the maximum likelihood method.

 

Example 183 ([Hubble's 1929 experiment on the recession velocity of nebulae and their distance to earth)
Hubble collected the following data that I took from \hrefhttp://lib.stat.cmu.edu/DASL/Datafiles/Hubble.htmlthis website. Here $X$ is the number of megaparsecs from the nebula to earth and $Y$ is the observed recession velocity in $10^{3}$km/s. \[ \begin{array}{||r|c|c|c|c|c|c|c|c|c|c|c|c|c||} \hline X & 0.032 & 0.034 & 0.214 & 0.263 & 0.275 & 0.275 & 0.45 & 0.5 & 0.5 & 0.63 & 0.8 & 2 \\ \hline Y & 0.17 & 0.29 & -0.13 & -0.07 & -0.185 & -0.22 & 0.2 & 0.29 & 0.27 & 0.2 & 0.3 & 1.09\\ \hline \hline X & 0.9 & 0.9 & 0.9 & 0.9 & 1 & 1.1 & 1.1 & 1.4 & 1.7 & 2 & 2 & 2 \\ \hline Y & -0.03 & 0.65 & 0.15 & 0.5 & 0.92 & 0.45 & 0.5 & 0.5 & 0.96 & 0.5 & 0.85 & 0.8 \\ \hline \hline \end{array} \] We fit two straight lines to this data.
  1. Fit the line $Y=\alpha+\beta X$. The least squares estimators (as derived earlier) turn out to be $\hat{\alpha}=-0.04078$ and $\hat{\beta}=0.45416$. If $Z_{i}=\alpha+\beta X_{i}$ are the predicted values of $Y_{i}$s, then one can see that the residual sum of squares is $\sum_{i}(Y_{i}-Z_{i})^{2}=1.1934$.
  2. Fit the line $Y=bX$. In this case we get $\hat{b}$ by minimizing $\sum_{i}(Y_{i}-bX_{i})^{2}$. This is slightly different from before, but the same methods (calculus or the alternate argument we gave) work to give $$ \hat{b}=\frac{\sum_{i=1}^{n}Y_{i}X_{i} }{\sum_{i=1}^{n}X_{i}^{2} }= 0.42394. $$ The residual sum of squares $\sum_{i=1}^{n}(Y_{i}-bX_{i})^{2}$ turns out to be $1.2064$.
The residual sum of squares is smaller in the first, thus one may naively think that it is a better fit. However, note that the reduction is due to an extra parameter. Purely statistically, introducing extra parametrs will always reduce the residual sum of squares for obvious reasons. But the question is whether the extra parameter is worth the reduction. More precisely, if we fit the data too closely, then the next data point to be discovered (which may be nebula that is $10$ megaparsecs away) may fall way off the curve. More importantly, in this example, physics tells us that the line must pass through zero (that is, there is no recession velocity when two objects are very close). Therefore it is the second line that we consider, not the first. This gives the Hubble constant to be $423$ km./s./megaparsec (the currently accepted values appear to be about $70$, with data going up to distances of hundreds of megaparsecs...see \hrefhttps://www.cfa.harvard.edu/ dfabricant/huchra/hubble.plot.datthis data!).

 

Example 184
I have taken this example from the \hrefhttp://ces.iisc.ernet.in/hpg/nvjoshi/statspunedatabook/databook.html wonderful compilation of data sets by A. P. Gore, S. A. Paranjpe, M. B. Kulkarni. In this example, $Y$ denotes the number of frogs of age $X$ (in some delimited population). \[ \begin{array}{||r|c|c|c|c|c|c|c|c|c|c|c|c|c||} \hline X & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline Y & 9093 & 35 & 30 & 28 & 12 & 8 & 5 & 2\\ \hline \hline \end{array} \] A prediction about life-times says that the survival probability $P(t)$ (which is the chance that an individual survives up to age $t$ or more) decays as $P(t)=Ae^{-bt}$ for some constants $A$ and $b$. We would like to check this agains the given data. What we need are individuals that survive beyond age $t$. Taking $Z$ to be the cumulative sums of $Y$, this gives us \[ \begin{array}{||r|c|c|c|c|c|c|c|c|c|c|c|c|c||} \hline X & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline Z & 9213 & 120 & 85 & 55 & 27 & 15 & 7 & 2\\ \hline P=Z/n & 1.0000 & 0.0130 & 0.0092 & 0.0060 & 0.0029 & 0.0016 & 0.0008 & 0.0002 \\ \hline W=\log P & 0 & -4.3409 & -4.6857 & -5.1210 & -5.8325 & -6.4203 & -7.1825 & -8.4352 \\ \hline \hline \end{array} \] We compute that $\bar{X}=4.5$, $\bar{W}=-5.25$, $\mbox{std}(X)=2.45$, $\mbox{std}(W)= 2.52$ and $\mbox{corr}(X,W)=0.92$. Hence, in the linear regression $W=a+bX$, we see that $\hat{b}=0.94$ and $\hat{a}=-9.49$. The residual sum of squares is $7.0$.
How good is the fit? For the same data $(X_{1},Y_{1}),\ldots ,(X_{n},Y_{n})$, suppose we have two candidates (a) $Y=f(X)$ and (b) $Y=g(X)$. How to decide which is better? Or how to say if a fit is good at all? By the least-squares criterion, the answer is the one with smaller residual sum of squares $SS:=\sum_{k=1}^{n}(Y_{k}-f(X_{k}))^{2}$. Usually one presents a closely related quantity $R^{2}=1-\frac{SS}{SS_{0} }$ (where $\mbox{SS}_{0}=\sum_{k=1}^{n}(Y_{k}-\bar{Y})^{2}=(n-1)s_{Y}^{2}$). Since $SS_{0}$ is (a multiple of) the total variance in $Y$, $R^{2}$ measures how much of it is ''explained'' by a particular fit. Note that $0\le R^{2}\le 1$. And higher (i.e., closer to $1$) the $R^{2}$ is, the better the fit. Thus, the first naive answer to the above question is to compute $R^{2}$ in the two situations (fitting by $f$ and fitting by $g$) and see which is higher. But a more nuanced approach is preferable. Consider the same data and three situations.
  1. Fit a constant function. This means, choose $\alpha$ to minimize $\sum_{k=1}^{n}(Y_{k}-\alpha)^{2}$. The solution is $\hat{\alpha}=\bar{Y}$ and the residual sum of squares is $\mbox{SS}_{0}$ itself. Then, $R_{0}^{2}=0$.
  2. Fit a linear function. Then $\alpha,\beta$ are chosen as discussed earlier and the residual sum of squares is $\mbox{SS}_{1}=\sum_{k=1}^{n}(Y_{k}-\hat{\alpha}-\hat{\beta}X_{k})^{2}$. Then, $R_{1}^{2}=1-\frac{SS_{1} }{SS_{0} }$.
  3. Fit a quadratic function. The the residual sum of squares is $\mbox{SS}_{2}=\sum_{k=1}^{n}(Y_{k}-\hat{\alpha}-\hat{\beta}X_{k}-\hat{\gamma}X_{k}^{2})^{2}$ where $\hat{\alpha},\hat{\beta},\hat{\gamma}$ are chosen so as to minimize $\sum_{k=1}^{n}(Y_{k}-\alpha-\beta X_{k}-\gamma X_{k}^{2})^{2}$. Then $R_{2}^{2}=1-\frac{SS_{2} }{SS_{0} }$.
Obviously we will have $R_{2}^{2}\ge R_{1}^{2}\ge R_{0}^{2}$ (since linear functions include constants and quadratic functions include linear ones). Does that mean that the third is better? If that were the conclusion, then we can continue to introduce more parameters as that will always reduce the residual sum of squares! But that comes at the cost of making the model more complicated (and having too many parameters means that it will fit the current data well, but not future data!). When to stop adding more parameters? Qualitatively, a new parameter is desirable if it leads to a significant increase of the $R^{2}$. The question is, how big an increase is significant. For this, one introduces the notion of adjusted $R^{2}$, which is defined as follows: If the model has $p$ parameters, then define $\bar{SS}=SS/(n-1-p)$. In particular, $\bar{SS}_{0}=\frac{SS_{0} }{n-1}=s_{Y}^{2}$. Then define the adjusted $R^{2}$ as $\bar{R}^{2}=1-\frac{\bar{SS} }{\bar{SS}_{0} }$. In particular, $\bar{R}_{0}^{2}=R_{0}^{2}$ as before. But $R_{1}^{2}=1-\frac{SS_{1}/(n-2)}{SS_{0}/(n-1)}$. Note that $\bar{R}^{2}$ does not necessarily increase upon adding an extra parameter. If we want a polynomial fit, then a rule of thumb is to keep adding more powers as long as $\bar{R}^{2}$ continues to increase and stop the moment it decreases.

 

Example 185
To illustrate the point let us look at a simulated data set. I generated $25$ i.i.d $N(0,1)$ variables $X_{i}$ and then generated $25$ i.i.d. $N(0,1/4)$ variables $\epsilon_{i}$. And set $Y_{i}=2X_{i}+\epsilon_{i}$. The data set obtained was as follows. \[ \begin{array}{||r|c|c|c|c|c|c|c|c|c|c|c|c|c||} \hline X & -0.87&0.07&-1.22&-1.12&-0.01&1.53&-0.77&0.37&-0.23&1.11&-1.09&0.03&0.55 \\ \hline Y & -2.43&-0.56&-2.19&-2.32&-0.12&3.77&-1.4&0.84&0.34&1.83&-1.83&0.48&0.98 \\ \hline \hline X & 1.1&1.54&0.08&-1.5&-0.75&-1.07&2.35&-0.62&0.74&-0.2&0.88&-0.77 &\\ \hline Y & 2.3&2.5&-0.41&-2.94&-1.13&-0.84&4.36&-1.14&1.45&-1.36&1.55&-2.43 & \\ \hline \hline \end{array} \] To this data set we fit two models (A) $Y=\beta X$ and (B) $Y=a+bX$. The results are as follows. $$\begin{align*} \mbox{SS}_{0}=96.20, & R_{0}^{2}=0 \\ \mbox{SS}_{1}=6.8651, & R_{1}^{2}=0.9286, \hspace{2mm} \bar{R}_{1}^{2}=0.9255 \\ \mbox{SS}_{2}= 6.8212, & R_{2}^{2}=0.9291, \hspace{2mm} \bar{R}_{2}^{2}=0.9227. \end{align*}$$ Note that the adjusted $R^{2}$ decreases (slightly) for the the second model. Thus, if we go by that, then the model with one parameter is chosen (correctly, as we generated from that model!). You can try various simulations yourself. Also note the high value of $R_{1}^{2}$ (and $R_{2}^{2}$) which indicates that it is not a bad fit at all.