A regression line is a line that fits best to a set of numerical data points in the plane. In this post, we will discuss the concept of a regression (least squares) line and a method for finding it.

In statistics, regression is an analytical method to measure the association of one or more independent variables with a dependent variable. For example, regression can be an equation using specified and associated data for two or more variables. This is such that one variable can be estimated from the remaining variable(s). Linear regression is a linear equation for predicting a single value from one or more known explanatory values.

#### Definition of the regression line

From some experiments, suppose we have the following data points: $$(x_1,y_1), \dots, (x_n,y_n).$$ Also, assume that these points, when graphed on the plane, resemble a line. Our main task is to find an equation of a line $$y = \beta_0 + \beta_1 x$$ such that fits “most” for these points.

For each data point \((x_i, y_i)\), there is a point $$(x_i, \beta_0 + \beta_1 x_i)$$ on the line having the same \(x\)-coordinate. Some references refer to the difference between the observed value \(y_i\) and the predicted value \(\beta_0 + \beta_1 x_i\) as the residual.

One way to measure how “near” the line is to the data is “to add the squares of the residuals”. The **regression line** is, by definition, the line $$y = \beta_0 + \beta_1 x$$ that **minimizes** the sum of the squares of the residuals. And for this reason, some resources refer to the regression line as the least squares line.

#### Why is the regression line a least squares problem?

If the observed values were the same as their corresponding predicted values, we had $$y_i = \beta_0 + \beta_1 x_i,$$ for each \(i \in \mathbb{N}_n\). We can collectively express these equations as the following matrix equation $$U \beta = y,$$ where \(U\) is the \(n \times 2\) matrix $$U = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}$$ and \(\beta = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}\) and \(y\) is the following column vector $$y = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}.$$

Observe that the square of the distance between the vectors \(U\beta\) and \(y \) equals the sum of the square of residuals: $$\Vert U\beta – y \Vert^2 = \sum_{i=1}^{n} (\beta_0 + \beta_1 x_i – y_i)^2.$$ Therefore, the vector \(\beta\) that minimizes the above sum is the \(\beta\) that one may obtain by computing the least squares solution to the equation \(U \beta = y\) discussed in the post on the least squares problem.

#### A mathematical formulation of the regression line

Now, we proceed to find the equation of the regression line. In order to find a least squares solution of the equation \(U\beta = y\), we multiply both sides by \(U^T\). Then, we obtain the normal equation of \(U\beta = y\): $$(U^T U) \beta = U^T y.$$ It is easy to see that $$U^T U = $$ $$\begin{pmatrix} 1 & \cdots & 1 \\ x_1 & \cdots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} = $$ $$\begin{pmatrix} n & \sum_{i=1}^{n} x_i \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x^2_i\end{pmatrix}.$$ Consequently, the inverse of the \(2 \times 2\) matrix \(U^T U\) is (see matrix operations) $$\frac{1}{n\sum_{i=1}^{n} – \left( \sum_{i=1}^{n} x_i \right)^2} \times $$ $$\begin{pmatrix} \sum_{i=1}^{n} x^2_i & -\sum_{i=1}^{n} x_i \\ -\sum_{i=1}^{n} x_i & n\end{pmatrix}.$$

Also, note that $$U^T y = $$ $$\begin{pmatrix} 1 & \cdots & 1 \\ x_1 & \cdots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = $$ $$\begin{pmatrix}\sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} x_i y_i \end{pmatrix}.$$ From all we said, the computation of \(\beta = (U^TU)^{-1} (U^T y)\) is, now, obvious.

#### An example of the regression line

**Exercise**. Find the regression line of the following set of data points: \((0,0)\), \((1,2)\), and \((2,3)\).

**Solution**. Observe that $$U^T U = \begin{pmatrix} 3 & 3 \\ 3 & 5\end{pmatrix}$$ and so, $$(U^TU)^{-1} = 1/6 \begin{pmatrix} 5 & -3 \\ -3 & 3\end{pmatrix}.$$ On the other hand, $$U^T y = \begin{pmatrix} 5 \\ 8 \end{pmatrix}.$$ Therefore, $$\begin{pmatrix} \beta_0 \\ \beta_1\end{pmatrix} = \begin{pmatrix} 1/6 \\ 3/2 \end{pmatrix}$$ and the regression line is: $$ y = 1/6 + 3/2x.$$

**Exercise**. Find the equation of the regression line that best fits the following data points: \((2,1)\), \((5,2)\), \((7,3)\), and \((8,3)\).

**Remark**. A related and similar concept to the regression (least squares) line is the concept of the trend line discussed in finance.