Linear regression is a fundamental statistical technique to understand the relationship between two continuous variables. It works on the principle of fitting a straight line through a set of data points to model the relationship between the independent variable (predictor) and the dependent variable(s) (outcome).
This topic isn’t always taught in depth to working professionals or anyone that hasn’t had rigorous mathematical training. So, in a series of posts I will go over in depth some of the fundamental concepts of linear regression.
Fundamental Assumptions
However, for linear regression to provide reliable insights, it relies on several key assumptions:
- Linearity: The relationship between the variables should be approximately linear. This means that changes in the outcome variable are proportional to changes in the predictor variable.
- Independence: Each data point should be independent of other data points. In other words, there should be no correlation between the residuals (the differences between observed and predicted values).
- Homoscedasticity: The variance of the residuals should be constant across all levels of the predictor variable. This assumption ensures that the model is equally precise in predicting outcomes regardless of the values of the predictor.
- Normality of Residuals: For this assumption to be met, the residuals are normally distributed around zero. This means that the errors or the differences between the observed and predicted values should follow a bell-shaped curve.
- No Multicollinearity: In multiple regression (with more than one predictor variable), there should be no high correlation among the predictor variables. High multicollinearity can lead to unreliable estimates of the coefficients.
Meeting these assumptions is crucial for the validity and accuracy of the results obtained from linear regression analysis. When we violate these assumptions, the reliability of the model’s predictions may be compromised, and alternative approaches or transformations might be necessary.
Let’s touch on each assumption of linear regression, and state the ways to check for violations and potential adjustments if these assumptions are not met:
Linearity
- Check: Plotting the scatterplot of the dependent variable against each independent variable helps assess linearity visually. If the relationship appears nonlinear, a curved trend might indicate a violation.
- Adjustment: Transforming variables by taking logarithms, squares, or higher-order terms (e.g., quadratic or cubic terms) might help linearize the relationship.
Independence
- Check: The Durbin-Watson test or examining residual plots over time (in time series data) can reveal autocorrelation or dependency among residuals.
- Adjustment: Consider using time series analysis techniques for temporal dependencies or reevaluating the data collection method to ensure independence.
Homoscedasticity
- Check: Plotting residuals against fitted values helps visualize whether the spread of residuals is consistent across all values of the predictor variable.
- Adjustment: Performing transformations on the dependent variable (e.g., logarithmic transformation) or using weighted least squares regression can address heteroscedasticity.
Normality of Residuals
- Check: Histograms or QQ-plots of residuals can indicate departure from normality.
- Adjustment: Consider applying transformations on the dependent variable or using robust regression techniques that are less sensitive to non-normality.
No Multicollinearity
- Check: Calculating correlation coefficients among predictor variables helps identify multicollinearity. High correlation values (close to 1 or -1) indicate multicollinearity.
- Adjustment: Remove one of the highly correlated variables, use principal component analysis, or employ regularization techniques like ridge regression or LASSO regression to handle multicollinearity.
If assumptions are significantly violated and adjustments don’t suffice, it’s essential to reconsider the model’s suitability for the data or explore alternative regression approaches like nonlinear regression, generalized linear models, or machine learning algorithms that are less sensitive to these assumptions. Additionally, collecting more diverse or additional data might also help address violations of these assumptions.
Uses of Regression
Regression models are used for several purposes, such as the following:
- Data description
- Parameter estimation
- Prediction and estimation
- Control
Data description: Professionals that frequently use data, such as engineers and scientists use equations to summarize or describe a set of data. Regression is a helpful tool in developing those equations. For example, we may collect a considerable amount of data noting delivery time and volume, and a regression model would probably be a much more convenient and useful summary of those data rather than a table or graph. It can help visualize the relationships between variables. It is a useful tool in the exploratory data analysis tool set, gathering summary statistics, or model building.
Parameter estimation: Consider, the Michaelis-Menten equation (right). Chemical engineers use this equation to describe the relationship between the velocity of reaction y and concentration x. In this model, B1 is the asymptotic velocity of the reaction. That is, the maximum velocity as the concentration gets large.
If a sample of observed values of the velocity at different concentrations is available, then the engineer can use regression analysis to fit this model to the data, and produce an estimate of the maximum velocity.
Prediction and estimation: Consider that we wish to predict delivery time for a specified number of cases of soft drinks to be delivered. These predictions may be helpful to plan delivery activities such as routing and scheduling. This use of regression for prediction has its potential risks, such as extrapolation. This term refers to inferring unknown values from trends in the known data.
Personally, I feel as though a lot of professionals in the industry fit a model, assume it is valid, and put a lot of weight into extrapolated predictions. But, even when the model form is correct, poor estimates of the model parameters may still cause poor prediction performance. It needs to be emphasized that most regression predictions should be done within the range of the data that the model was trained on (interpolation), and avoid extrapolation when possible. However, sometimes in industry, we may get a task from leadership that asks for extrapolated values. These results should be given with caution to the stakeholder, with some measure of confidence.
Control: Consider the example that an engineer could use regression to develop a model relating the tensile strength of paper to the hardwood concentration in the pulp. This equation could then be used to control the strength to suitable values by varying the level of hardwood concentration. When a regression equation is used for control purposes, it is important that the variables be related in a causal manner. Note that a cause-and-effect relationship may not be necessary if the equation is used only for prediction.
Some Considerations
Regression analysis is widely used and, unfortunately, frequently
misused. There are several common abuses of regression that should
be mentioned:
- Regression models are intended as interpolation equations over
the range of the regressor variable(s) used to fit the model. As
observed previously, we must be careful if we extrapolate outside of
this range. - The disposition of the x values plays an important role in the least squares fit. While all points have equal weight in determining the height of the line, the slope is more strongly influenced by the remote values of x. Situations such as this often require corrective action, such as further analysis and possible deletion of the unusual points, estimation of the model parameters with some technique that is less seriously influenced by these points than least squares, or restructuring the model, possibly by introducing further regressors.
- Outliers are observations that differ considerably from the rest of the data. They can seriously disturb the least-squares fit. For example, consider the data in the figure below. Observation A seems to be an outlier because it falls far from the line implied by the rest of the data. If this point is really an outlier, then the estimate of the intercept may be incorrect and the residual mean square may be an inflated estimate of ฯ2. The outlier may be a โbad valueโ that has resulted from a data recording or some other error. On the other hand, the data point may not be a bad value and may be a highly useful piece of evidence concerning the process under investigation.
4. Just because a regression analysis has indicated a strong relationship between two variables, this does not imply that the variables are related in any causal sense. Causality implies necessary correlation. Regression analysis can only address the issues on correlation. It cannot address the issue of necessity. Thus, our expectations of discovering cause-and-effect relationships from regression should be modest.
5. In some applications of regression the value of the regressor variable x required to predict y is unknown. For example, consider predicting maximum daily load on an electric power generation system from a regression model relating the load to the maximum daily temperature. To predict tomorrowโs maximum load, we must first predict tomorrowโs maximum temperature. Consequently, the prediction of maximum load is conditional on the temperature forecast. The accuracy of the maximum load forecast depends on the accuracy of the temperature forecast. This must be considered when evaluating model performance.
What’s Next?
Understanding these fundamental assumptions is absolutely essential to properly utilize regression analysis, and is the first step of many. This series is meant to introduce linear regression in the format we wish we had in school. We will discuss things we learned in class, as well as hard lessons-learned from real world experience.
Next: Simple Linear Regression
References
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to linear regression analysis. Wiley.
3 responses to “Linear Regression Fundamentals: Bridging Theory and Practice, Part 1”
I was recommended this website through my cousin. I am now not sure whether or not this
post is written by him as no one else recognise such designated about my
problem. You’re incredible! Thank you!
I constantly spent my half an hour to read this weblog’s articles or reviews all the time along with a cup of coffee.
That is really fascinating, You are a very professional blogger.
I’ve joined your feed and sit up for in search of extra of your wonderful post.
Also, I’ve shared your site in my social networks