scatter plot with trend line

Linear Regression Fundamentals: Simple Linear Model, Part 2

Linear regression is a fundamental statistical technique to understand the relationship between two variables. It works on the principle of fitting a straight line through data points to model the relationship between the independent variable and the dependent variable(s).

This topic isnโ€™t always taught in depth to anyone that hasnโ€™t had rigorous mathematical training. So, in a series of posts I will go over in depth some of the fundamental concepts of linear regression. Please refer to the previous post over the fundamental concept of model assumptions, if you haven’t already.

The Simple Linear Regression Model and its Components

This post considers the simple linear regression model, a model with a single regressor x that has a relationship with a response variable y that is a straight line.

y = ฮฒ0 + ฮฒ1x + ฯต

Where ฮฒ0 is the intercept and ฮฒ1 is the slope, and we call these the regression coefficients. The component ฯต represents random error.

Component Definition
\beta_0 Intercept – If the range of data on x includes x = 0, then the intercept is the mean of the distribution of y when x = 0. If the range of x does not include 0, then the intercept has no practical interpretation.
\beta_1 Slope – the change in the mean of the distribution of y produced by a unit change in x.
ฯต Random error
Y The dependent or response variable
X The independent or predictor variable

Linear Regression Model Assumptions

For this model to be valid, there are several assumptions that the model has about the underlying data:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: The error terms are independent of each other.
  3. Homoscedasticity: The error has constant variance
  4. Normality: The residuals are normally distributed with mean 0.

Thus, for the model to yield unbiased estimates of ฮฒ0 and ฮฒ1, then these assumptions must not be violated.

Estimation of Linear Regression Parameters

The parameters ฮฒ0 and ฮฒ1 are unknown and must be estimated using a sample dataset. This data may be the result from a controlled experiment, from an observational study, or from historical records.

The method of least squares is a technique used to estimate the parameters of a model by minimizing the differences between the data points and the predicted values. We use this method to find the best-fitting line that describes the relationship between the X variable and the Y variable. In other words, the method of least squares is used to estimate ฮฒ0 and ฮฒ1 so that the sum of squares of the differences between the observations yi and the straight line is minimum.

Least Squares Estimators

The first equation we referred to as the simple linear model, or a population regression model, is y = ฮฒ0 + ฮฒ1x + ฯต. Suppose we have n pairs of data, (y1, x1), (y2, x2), …, (yn, xn). From the first equation, we can write yi = ฮฒ0 + ฮฒ1xi + ฯตi, i = 1, 2, …, n, which is the sample regression model, written in terms of n pairs of data.

Mathematically, we want to minimize:

S(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \hat{y_i})^2 = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 where \hat{y_i} is a residual.

Residuals play a key role when investigating model adequacy and in detecting departures from model assumptions, however, these concepts will be discussed more in-depth in further posts.

Additionally, this sum represents the total squared vertical distance between each observed point and the corresponding point on the fitted regression line.

Deriving the Normal Equations

To estimate ฮฒ0 and ฮฒ1, we differentiate the above equation with respect to each parameter and set the derivatives equal to zero. We refer to the resulting equations as the least squares normal equations. These equations provide the estimates for the parameters.

  • Partial derivative with respect to ฮฒ0:

\frac{\partial }{\partial \beta_0}  \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0

  • Partial derivative with respecting to ฮฒ1:

\frac{\partial }{\partial \beta_1}  \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = -2 \sum_{i=1}^n x_i (y_i - \beta_0 - \beta_1 x_i) = 0

We can now simplify these to obtain the normal equations:

\sum y_i = n\beta_0 + \beta_1 \sum x_i

\sum y_i x_i = \beta_0 \sum x_i + \beta_1 \sum x_i ^2, where n is the number of observations.

Essentially, these two equations represent the two unknowns ฮฒ0 and ฮฒ1, and solving them gives the least squares estimates.

Solving the Normal Equations

The solutions for the parameters ฮฒ0 and ฮฒ1 are:

  • \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}, where \bar{x} = \frac{1}{n} \sum x_i and \bar{y} = \frac{1}{n} \sum y_i are the means of the X and Y values, respectfully.
  • \hat{\beta_1} = \frac{ \sum  y_i x_i - \frac{(\sum y_i)(\sum x_i)}{n}}{\sum x_i^2 - \frac{(\sum x_i)^2}{n}}

Therefore, the above two equations are the least squares estimators of the intercept and slope, respectively. So the fitted model is now:

\hat{y} = \hat{\beta_0} + \hat{\beta_1} x, this equation gives a point estimate of the mean of y for a particular x.

Since the denominator of the equation for \hat{\beta_1} is the corrected sum of squares of xi, and the numerator is the corrected sum of cross products of xi and yi, we can write these quantities in a more compact notation as

S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = \sum (x_i - \bar{x})^2 and S_{xy} = \sum y_i x_i - \frac{(\sum y_i)(\sum x_i)}{n} = \sum y_i (x_i - \bar{x}). Thus a convenient way to write \hat{\beta_1} is \hat{\beta_1} = \frac{S_{xy}}{S_{xx}}

Understanding the Least Squares Normal Equations

The normal equations are essentially a system of linear equations derived from minimizing the sum of squared residuals. Solving these equations provides the estimates \hat{\beta_0}โ€‹ and \hat{\beta_1}โ€‹, which define the line of best fit.

In summary:

  • The first normal equation ensures that the sum of the residuals is zero. This means that the average of the observed values yi is equal to the average of the predicted values \hat{y_i}.
  • The second normal equation balances the relationship between X and the residuals. This ensures that the weighted sum of the residuals (weighted by the X values) is also zero, ensuring that the line is as close as possible to the data points in terms of minimizing the total error.

Simple Linear Regression Example

Rocket Propellant

A rocket motor is manufactured by bonding an igniter propellant and a sustainer propellant together inside a metal housing. An important characteristic of the bond between the two types of propellant is the shear strength. It is suspected that the shear strength is related to the age in weeks of the batch of the sustainer propellant. We have twenty observations on the shear strength and age of a corresponding batch.

Here is a look at the data:

Observation Shear Strength (psi) Age (weeks)
1 2158.7 15.5
2 1678.15 23.75
3 2316 8
4 2061.3 17
5 2207.5 5.5
6 1708.3 19
7 1784.6 24
8 2575.9 2.5
9 2357.9 7.5
10 2256.7 11
11 2165.2 13
12 2399.55 3.75
13 1779.5 25
14 2336.75 9.75
15 1756.3 22
16 2053.5 18
17 2424.4 6
18 2200.5 12.5
19 2654.2 2
20 1754.7 21.5

Let’s make a scatter plot of the shear strength and age variables to see if the linear assumption is reasonable.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

rocket_data = pd.read_excel(r'\...\rocket_data.xlsx')

plt.scatter(rocket_data['Age (weeks)'], rocket_data['Shear Strength (psi)'])
plt.xlabel('Age (weeks)')
plt.ylabel('Shear Strength (psi)')
plt.savefig('rocket_scatter.png')
plt.show()
scatter plot of rocket data

This plot suggests there is a strong statistical relationship between shear strength and age. The assumption of the linear relationship seems to be appropriate.

So, now let’s estimate the model parameters.

Estimating model parameters

To estimate the parameters, let us first calculate:

S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 4,677.6875 - \frac{71,422.5625}{20} = 1,106.559375

and

S_{xy} = \sum x_i y_i - \frac{\sum x_i \sum y_i}{n} = 528,368.4875 - \frac{(267.25)(42,629.65)}{20} =  -41,270.21062

Therefore, from the calculations of S_{xx} and S_{xy} we can find \hat{\beta_1}:

\hat{\beta_1} = \frac{S_{xy}}{S_{xx}} = \frac{-41,270.21062}{1,106.559375} = -37.29597485

and

\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} = 2,131.4825 - (-37.29597485) 13.3625 = 2,629.849964

So the least-squares fit of the model is:

\hat{y} = 2,629.85 - 37.296x

We may interpret the slope -37.296 as the average weekly decrease in shear strength of the propellant due to the age. Since the lower limit of the x‘s is near the origin, the intercept 2,629.85 represents the shear strength in a batch of propellant immediately following manufacture.

Thus, if we take our original data, plug the x values into the fitted equation, we get the value for \hat{y_i} . We can calculate the residuals for each by subtracting the calculated value from the observed value:

Observed value, y_i Fitted value, \hat{y_i}Residual, e_i = y_i - \hat{y_i}
2158.72051.762354106.9376463
1678.15 1744.070561-65.9205612
23162331.482165-15.48216516
2061.31995.81839165.48160853
2207.52424.722102-217.2221023
1708.31921.226442-212.9264418
1784.61734.74656749.85343251
2575.92536.61002739.28997314
2357.92350.1301537.769847416
2256.72219.59424137.10575941
2165.22145.00229120.19770912
2399.552489.990058-90.44005829
1779.51697.45059382.04940737
2336.752266.21420970.53579084
1756.31809.338517-53.03851719
2053.51958.52241794.97758339
2424.42406.07411518.32588513
2200.52163.65027836.84972169
2654.22555.25801498.94198572
1754.71827.986505-73.28650462
\sum y_i = 42,629.65\sum \hat{y_i} = 42,629.65\sum e_i = 0.00

The Residuals

We can see from the table, that the sum of the residuals is 0.00. Recall that the residuals represent the part of the observed data that the model doesn’t capture, showing the deviation (also called the error) between the real outcome and what the model predicts (the line). When the residuals sum to 0, it means that the regression model’s predicted values are, on average, equal or close to the observed values.

It implies that the regression line has been positioned such that it minimizes the overall error and balances the differences between the observed and predicted values across the data points. However, it doesn’t necessarily mean that the model is a perfect fit; the residuals can still vary widely even though their total is zero.

Simple Linear Regression Questions

After getting the least-squares fit, there are some questions we can investigate before the model is deployed:

  1. How well does the model fit the data?
  2. Is the model likely to be useful as a predictor?
  3. Are any of the model assumptions (like constant variance and uncorrelated errors) violated? If so, how serious is this?

As mentioned earlier, residuals play an integral role in evaluating model adequacy. They can be viewed as realizations of the model errors, ฯตi.

In order to check the assumptions of constant variance and uncorrelated errors, we must investigate if the residuals look like a random sample from a distribution with these properties. We will dive into these topics and questions in a later post, where we perform model adequacy checking.

Properties of Least Squares and the Fitted Model

We need to determine the statistical properties of least-squares estimators if we wish to use them to make statistical inferences.

Furthermore, the least squares estimates \hat{\beta_0}โ€‹ and $ latex \hat{\beta_1}$ have several important properties:

  • Unbiased: The estimators \hat{\beta_0}โ€‹ and \hat{\beta_1}โ€‹ are unbiased, meaning their expected values equal the true parameters ฮฒ0 and ฮฒ1โ€‹.
  • Minimum Variance: By the Gauss-Markov theorem, under the assumptions of linearity, independence, and constant variance (but not necessarily normality), the OLS estimators are the best linear unbiased estimators (BLUE).
  • Efficiency: If we assume normality of the errors ฯต, the least squares estimates are also maximum likelihood estimators, making them the most efficient estimators under normality.
  • The sum of the residuals in any regression model that contains an intercept is always zero (as demonstrated in the previous example).
  • The least-squares regression line always passes through the centroid of the data. That is, the point (\bar{y}, \bar{x}).

Unbiased Estimators

To show that the least squares estimators for ฮฒ0 and ฮฒ1โ€‹ are unbiased, we need to demonstrate that the expected value of the estimators equals the true parameter values. In order to do this, it is essential to understand the theorems of expectation and variance, discussed in more detail in another post.

Estimation by Maximum Likelihood

The method of least squares can be used to estimate the parameters in a linear regression model regardless of the form of the distribution of the errors. This method produces best linear unbiased estimators of ฮฒ0 and ฮฒ1โ€‹ . If the form of the distribution of the errors is known, then an alternative method may be used to estimate the parameters known as the method of maximum likelihood.

Note that if the errors are normally distributed, then the maximum likelihood estimation and least-squares will produce the same estimates of the parameters.

The Likelihood Function

First, given the normality assumption, the probability density function of the errors is:

p(y_i|x_i, \beta, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}exp(-\frac{(y_i - x_i \beta)^2}{2\sigma^2})

The likelihood function is the product of these probabilities for all observations, L(\beta, \sigma^2):

L(\beta, \sigma^2) = \prod_{i=1}^2 p(y_i | x_i, \beta, \sigma^2)

Note: To simplify the math, we typically work with the log-likelihood function:

ln L(\beta, \sigma^2) = -\frac{n}{2} ln(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i\beta)^2

Consider data (y_i, x_i), i = 1, 2, ..., n. If we assume that the errors in the model are normally distributed with 0 mean and constant variance, then the observations yi in this sample are normally and independently distributed random variables with the mean \beta_0 + \beta_1 x_i and variance \sigma^2. The likelihood function is found from the joint distribution of the observations. For the simple linear regression model with normal errors, the likelihood function is:

L(y_i, x_i, \beta_0, \beta_i, \sigma^2) = \prod_{i=1}^n (2\pi \sigma^2)^\frac{-1}{2} exp[-\frac{1}{2\sigma^2} (y_i - \beta_0 -\beta_1 x_i)^2]

= (2\pi \sigma^2)^\frac{-1}{2} exp[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 -\beta_1 x_i)^2]

The goal is to maximize the log-likelihood with respect to ฮฒ and ฯƒ2. By setting the derivative of the log-likelihood with respect to ฮฒ equal to zero, you can derive the maximum likelihood estimates (MLE) for ฮฒ, which turns out to be the same as the ordinary least squares (OLS) estimates. We will go more in-depth on these calculations in a later post.

Properties of MLE

In general, maximum-likelihood estimators have better statistical properties than least-squares estimators. The maximum-likelihood estimators are unbiased and have minimum variance when compared to all other unbiased estimators. They are also consistent estimators (consistency is a large-sample property indicating that the estimators differ from the true parameter value by a very small amount as n becomes large), and they are a set of sufficient statistics (this implies that the estimators contain all of the โ€œinformationโ€ in the original sample of size n).

On the other hand, maximum-likelihood estimation requires more stringent statistical assumptions than the least-squares estimators. The least-squares estimators require only second-moment assumptions (assumptions about the expected value, the variances, and the covariances among the random errors). The maximum-likelihood estimators require a full distributional assumption, in this case that the random errors follow a normal distribution with the same second moments as required for the least-squares estimates.

Estimation of ฯƒ2

In addition to estimating the intercept ฮฒ0 and slope ฮฒ1, an estimate of ฯƒ2 is important. This estimate is critical for conducting hypothesis tests and constructing confidence intervals pertinent to the model.

Ideally, we would like this estimate not to be dependent on the adequacy of the fitted model. However, this is only possible when there are several observations on y for at least one value of x, or when prior information concerning ฯƒ2 is already known.

When this is not true, the estimate is obtained from the residual or error sum of squares, which measures the unexplained variation in the observed values after fitting the regression line.

The variance of the error terms ฯƒ2 is estimated by averaging the squared residuals. To obtain an unbiased estimate, we divide by the number of observations minus the number of parameters estimated (degrees of freedom):

\hat{\sigma^2} = \frac{1}{n-p} \sum_{i=1}^n e_i^2 where:

  • n is the number of observations
  • p is the number of parameters, including the intercept.

Example with Rocket Propellant Data

Formulas:

  • \sum_{i=1}^n y_i^2 - n\bar{y}^2 = \sum_{i=1}^n y_i^2 - \frac{(\sum_{i=1}^n y_i)^2}{n}
    • 92,570,846.55 – ((42629.65)2/20) = 1,706,493.591
  • From this, we can find the residual sum of squares
    • 1,706,493.591 – (-37.30)(-41,270.21) = 167,280.85
  • Therefore, the estimate of ฯƒ2 can be computed as:
    • \hat{\sigma^2} = \frac{167,280.85}{18} = 9,293.38
      • Where 18 comes from n – 2 to account for the degrees of freedom.

*Note that some calculations may be slightly different due to rounding.

Wrap Up – Simple Linear Regression

In this overview of simple linear regression, we covered the key concepts and techniques for modeling the relationship between two variables. We began by outlining the important assumptions of the model, including linearity, independence, constant variance, and normality. From there, we delved into estimating the parameters using the method of least squares, deriving the normal equations to find the best-fitting line. We also touched on maximum likelihood estimation (MLE) and how it aligns with least squares under certain conditions. To bring these ideas to life, we demonstrated parameter estimation using rocket propellant data, showing how these calculations can provide insight into real-world data.

What’s Next?

In our next post, weโ€™ll explore how to use these parameter estimates for hypothesis testing and building confidence intervals, further enhancing the model’s utility in statistical inference.

Leave a Reply

Your email address will not be published. Required fields are marked *