Table of contents
I recently read an article about two men who were caught cheating during a fishing tournament. Being an avid fisherman myself, I couldn’t help but wonder if there could have been a way to detect the cheating with an algorithm. In this post, I am going to use a multiple linear regression model to predict the weight of a fish based on certain measurements.
When catching a fish, one may want to record the weight along with several other measurements. I can imagine visually inspecting a fish, and provide a best guess as to how much that fish would weigh. If I saw a rather small perch or walleye, and grabbed it and thought “wow this is heavier than I thought,” then I may want to investigate further. And this is exactly how the two fisherman were caught! The fishing tournament organizer looked at the rather normal looking fish and thought they couldn’t possibly weigh that much. If a human can come to that conclusion with their own intuition, what could a machine learning model do? Perhaps it could prevent things like this from happening…
Before you start…
In order to apply linear regression, there are some assumptions about the underlying data that need to be met for it to be a valid model:
- Assume the relationship between the independent and dependent variables is linear.
- Assume the variables are normally distributed.
- Linear regression assumes there is little or no multicollinearity in the data.
- Assume the residuals are independent from each other. This is also sometimes called autocorrelation.
- Assume homoscedasticity, or also called constant variance.
Throughout this analysis, we will check for these assumptions.
The data set
This got me thinking and I scoured the internet for a relatable data set. Unfortunately, there aren’t that many freely available data sets on fishing tournaments and fish weights, so I settled for getting creative with the fish market data set from Kaggle.
This analysis is just a first pass at a simple model, and of course one could always take it a step further. Feel free to follow along with the completed notebook on my GitHub.
Setting up the notebook
You may need the following libraries for what we are going to do:
# BASE -----------------------------------
import numpy as np
import pandas as pd
import math
from scipy import stats
# VISUALIZATIONS -------------------------
import matplotlib.pyplot as plt
import seaborn as sns
# MODELING & ANALYSIS --------------------
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from statsmodels.tools.eval_measures import rmse
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
Explore the data set
# load in the data
df = pd.read_csv(r'C:\...\Fish.csv')
print(df.columns)
print(df.shape)
Index(['Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height',
'Width'],
dtype='object')
(159, 7)
# Look at first three rows of data frame
df.head3()
I don’t like the ambiguous column names, so let’s rename them according to the data set documentation:
Column descriptions
Weight – weight of fish in grams
Length1 – vertical length in cm
Length2 – Diagonal length in cm
Length3 – Cross length in cm
Height – height in cm
Width – Diagonal width in cm
df = df.rename({'Length1':'Vertical_Length', 'Length2':'Diagonal_Length', 'Length3':'Cross_Length'}, axis='columns')
Check for nulls
null_sum = df.isna().sum()
print(null_sum)
Species 0
Weight 0
Vertical_Length 0
Diagonal_Length 0
Cross_Length 0
Height 0
Width 0
dtype: int64
Next, I want to see how each of the variables relate to each other. I will use a scatter plot matrix to view the relationship:
# Create the scatter plot matrix
sns.pairplot(df, kind="scatter", hue='Species')
plt.show()
This correlogram (Scatter plot matrix) allows us to see that a linear relationship may be appropriate for this dataset (although it does appear that a nonlinear model might be better), but for the sake of this example, we will continue with the linear assumption.
Possible model
In the textbook “Introduction to Linear Regression Analysis” by Douglas Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, it states “Any regression model that is linear in the parameters is a linear regression model, regardless of the shape that it may generate.”
This means that the weight of a fish (y) can be computed using a linear combination of its different length measurements given the function:
With B0 representing the constant term.
Now, we will learn the best values for our betas from the data, which is where the process for training the model begins. A trained model is a model where the beta values have been decided. A model with these parameters pre built into the formula is what constitutes a parametric model. In other words, we fix the formula of the training function before the training.
Exploratory Data Analysis (EDA)
Explore relationship with Weight variable
Now, I want to look at the relationship of the variables with the weight variable a little more:
# separate into category and numerical columns
df_cat = df[['Species']]
df_num = df.select_dtypes(include='number')
# create histograms with density curve
fig, axes = plt.subplots(2, 3, figsize=(20,10))
fig.subplots_adjust(wspace=0.1, hspace=0.3)
axes = axes.ravel()
for i, col in enumerate(df_num):
sns.histplot(data = df[col], ax=axes[i], kde=True)
plt.show()
# plot the scatter plot of the weight variable against the other variables
sns.pairplot(df,
x_vars=['Vertical_Length', 'Diagonal_Length', 'Cross_Length', 'Height', 'Width'],
y_vars=['Weight'], hue='Species')
plt.show()
Summary statistics
Next, let us take a look at some summary statistics to get a general summary:
df.describe()
- Notice there are some values where the weight is equal to zero, which is impossible, so it may be an input error and we could possibly remove those values
- I should explore each category of species to see if any might be out of the ordinary
species = df['Species'].value_counts()
Perch 56
Bream 35
Roach 20
Pike 17
Smelt 14
Parkki 11
Whitefish 6
Name: Species, dtype: int64
# Weight cannot be 0g, so should look at those values
zeros = df[df['Weight'] == 0]
zeros
This fish cannot possibly weigh 0g, so we will assume for our example that maybe the value was input in error, and will delete this record.
I also see in the summary statistics table, there are a few really high values. Let us take a look at those:
# high values
df[142:145]
These values seem pretty high compared to the rest of the Pike weights, but I have no way to investigate if the data were entered incorrectly, or if some other error occurred, so we will assume they are valid data and keep them in the model until we do further outlier analysis.
# new df with bad value removed and view summary statistics df1 = df[df['Weight'] != 0] df1.describe()
Multicollinearity
# correlation heatmap
sns.heatmap(df1.corr(), annot=True, vmin=0, vmax=1, cmap="rocket_r")
plt.show()
Note that all the variables are highly correlated, especially the length variables, height has the least issues with this.
Outlier analysis
# separate into category and numerical columns
df1_cat = df1[['Species']]
df1_num = df1.select_dtypes(include='number')
# create box plots for outlier analysis
fig, axes = plt.subplots(2, 3, figsize=(16, 8))
axes = axes.ravel()
fig.subplots_adjust(wspace=0.4, hspace=0.4)
fig.suptitle('Box Plot - Outlier Analysis', fontsize=20)
for i, col in enumerate(df1_num):
box=sns.boxplot(data=df1_num[col], ax=axes[i], orient='h')
axes[i].set_title(col)
plt.show()
Notice there are three values that might be a cause for concern, so we will fit the model with and without them, and compare the results.
Prepare the data for modeling
# categorical encoding
df2 = pd.get_dummies(df1)
# drop one species column
df3 = df2.drop(columns=['Species_Bream'])
Model Building
At the time of this writing, sklearn did not have a function for residual analysis and some other values that I prefer to use when performing linear regression. Because of this, I will use both sklearn and statsmodels and see how they compare.
Split into training and test sets
# we will split using the data frame with the dummy variables
# Note: this data frame is the full data frame - no variables or outliers yet removed
X = df3.iloc[:, 1:13]
y = df3[['Weight']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print('X_train: ', np.shape(X_train))
print('y_train: ', np.shape(y_train))
X_train: (118, 11)
y_train: (118, 1)
Define and fit the model
Using sklearn
model = LinearRegression()
model.fit(X_train, y_train)
# Parameters of the model and other values
print('Model intercept: ', model.intercept_)
print('Model coefficients: ', model.coef_)
print('Model score: ', model.score(X, y))
Model intercept: [-785.67562956]
Model coefficients: [[ -63.17797578 71.70850333 27.81978955 -11.52939024 6.42359181
133.76631624 34.38721174 -362.58082738 9.46492511 306.4493073
63.20534515]]
Model score: 0.9329499697782535
# predicting weights from the test set
y_hat = model.predict(X_test)
# Analyzing r2 value of model
r2_score(y_test, y_hat)
0.9482451798572478
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))
MSE: 6790.893
RMSE: 82.407
MAE: 62.558
Cross validation
cross_val_score_train = cross_val_score(model, X_train, y_train, cv=10, scoring='r2')
print(cross_val_score_train)
print(cross_val_score_train.mean())
[0.91333352 0.96324695 0.93654314 0.9341261 0.86007924 0.9247442
0.82266236 0.88844178 0.84039499 0.60139132]
0.8684963615355723
Using statsmodels package
X_sm = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_sm).fit()
sm_model.summary()
Residual Analysis
# using full model
residuals = sm_model.resid
plt.scatter(sm_model.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()
This indicates some potential nonlinearity in the data. We can experiment by adding interaction terms or performing a transformation.
# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()
This follows pretty close to the line except for a few at the top. Let’s see if these are overly influential values:
# influence plot
fig = sm.graphics.influence_plot(sm_model, criterion='cooks')
plt.show()
# Cook's distance - rule of thumb, drop observations with distance over 4/n
lm_cooks = sm_model.get_influence().cooks_distance[0]
n = len(X_train)
# calculate d
d = 4/n
print('Criticial distance: ', d)
out_d = lm_cooks > d
print(X_train.index[out_d], "\n", lm_cooks[out_d])
Criticial distance: 0.03389830508474576
Int64Index([137, 142, 111, 143, 144, 60, 55, 72], dtype='int64')
[0.04262362 0.24729002 0.09532826 0.16539705 0.11011024 0.04889406
0.05648581 0.15169978]
Clearly, there are some points that seem to have a high amount of leverage or influence, such as 142, 143, 144, and 72.
Explore Different Variations of the Model
# Fit model with only one length column
X = df3.drop(columns=['Weight', 'Vertical_Length', 'Diagonal_Length'])
y = df3[['Weight']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
r2_score(y_test, y_hat)
0.943429129558069
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % math.sqrt(mean_absolute_error(y_test, y_hat)))
MSE: 7594.113
RMSE: 87.144
MAE: 64.253
# Use statsmodels
X_sm = sm.add_constant(X)
sm_model = sm.OLS(y, X_sm).fit()
sm_model.summary()
This has slightly improved by looking at some of the values in the summary table. Let’s go ahead and see if an interaction term improves the model. Thinking about weight, maybe volume is a good indicator:
df3['Volume'] = df3['Height']*df3['Width']*df3['Vertical_Length']
X = df3.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
y = df3[['Weight']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % math.sqrt(mean_absolute_error(y_test, y_hat)))
R2: 0.9701085869761354
Adj r2: 0.9558745807742952
MSE: 4012.644
RMSE: 63.345
MAE: 46.916
# statsmodels - interaction term
X_sm = sm.add_constant(X)
sm_model1 = sm.OLS(y, X_sm).fit()
sm_model1.summary()
residuals = sm_model1.resid
plt.scatter(sm_model1.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()
# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()
Adding the interacting term seems to have helped it look more random. But I don’t like the funnel shape that is produced. We may need to perform a transformation to make it more linear. In the normal probability plot, there are still some questionable points at the top, and now at the bottom. But the rest fall pretty close to the line, but not entirely normally distributed.
Let’s check out some outliers:
# Cook's distance - rule of thumb, drop observations with distance over 4/n
lm_cooks = sm_model1.get_influence().cooks_distance[0]
n = len(X)
# calculate d
d = 4/n
print('Criticial distance: ', d)
out_d = lm_cooks > d
print(X.index[out_d], "\n", lm_cooks[out_d])
Criticial distance: 0.02531645569620253
Int64Index([13, 29, 32, 33, 34, 60, 72, 111, 123, 126, 127, 137, 142, 143], dtype='int64')
[0.04353014 0.05292537 0.07018666 0.08754108 0.05220421 0.02642313
0.0283841 0.12750454 0.03541461 0.06505045 0.03392408 0.03301345
0.54359171 0.3697363 ]
Here, it is clear that points 142, 143, and 144 have a lot of leverage and influence. Let’s try removing them.
# outlier removal
df3 = df3.drop([142, 143, 144])
X = df3.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
y = df3[['Weight']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))
R2: 0.9858885592047293
Adj r2: 0.9788328388070939
MSE: 1577.113
RMSE: 39.713
MAE: 28.596
X_sm = sm.add_constant(X)
sm_model2 = sm.OLS(y, X_sm).fit()
sm_model2.summary()
residuals = sm_model2.resid
plt.scatter(sm_model2.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()
# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()
It still looks pretty much the same. With the funnel shape of the residuals, and the nonlinearity of the response variable, we performed the interaction term, but we may need to do a transformation to make it more linear. I will apply a log transformation of the response variable.
X = df3.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
ln_y = np.log(df3.Weight)
X_train, X_test, y_train, y_test = train_test_split(X, ln_y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))
R2: 0.9926637216147842
Adj r2: 0.9889955824221763
MSE: 0.013
RMSE: 0.113
MAE: 0.091
X_sm = sm.add_constant(X)
sm_model3 = sm.OLS(ln_y, X_sm).fit()
sm_model3.summary()
residuals = sm_model3.resid
plt.scatter(sm_model3.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()
# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()
These look a lot better! Although clearly there are still a couple of outliers.
# Cook's distance - rule of thumb, drop observations with distance over 4/n
lm_cooks = sm_model3.get_influence().cooks_distance[0]
n = len(X)
# calculate d
d = 4/n
print('Criticial distance: ', d)
out_d = lm_cooks > d
print(X.index[out_d], "\n", lm_cooks[out_d])
Criticial distance: 0.025806451612903226
Int64Index([13, 58, 72, 117, 129], dtype='int64')
[0.0517299 0.04511071 1.42329489 0.02651304 0.03879586]
# remove point 13 and 72
df4 = df3.drop([13, 72])
ln_y = np.log(df4.Weight)
X = df4.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
X_train, X_test, y_train, y_test = train_test_split(X, ln_y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))
R2: 0.9889376169433531
Adj r2: 0.9834064254150297
MSE: 0.008
RMSE: 0.088
MAE: 0.074
# cross validation
scores = cross_val_score(lr, X_train, y_train, scoring='r2', cv=5)
print(scores)
print('R2: ', np.mean(scores))
scores2 = cross_val_score(lr, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
print(scores2)
print('RMSE: ', math.sqrt(np.mean(abs(scores2))))
[0.99328735 0.99616257 0.99699811 0.99280587 0.99479305]
R2: 0.9948093908684947
[-0.00890102 -0.00966734 -0.00552462 -0.01043415 -0.0089092 ]
RMSE: 0.0932054951192439
X_sm = sm.add_constant(X) sm_model4 = sm.OLS(ln_y, X_sm).fit() sm_model4.summary()
residuals = sm_model4.resid
plt.scatter(sm_model4.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()
# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()
These plots look a lot better to me, and I am confident in the model. Here is a table summarizing some of the values for each model I tested:
As you can see, there are high R2 and adjusted R2 values, as well as a low p-value (using a significance level of 0.05) for the f statistic. This tells us that the sample data provides sufficient evidence to conclude that the regression model fits the data better than the model with no independent variables.
Another statistic that assesses how well a regression model fits a data set is the root mean square error (RMSE). This metric shows the average distance between the predicted values and the actual values in the data set. The lower the RMSE, the better a given model is able to fit a data set
For the Durbin-Watson, this is a test for autocorrelation. A value of 2 indicates zero autocorrelation, our value is below 2, indicating a positive autocorrelation. A rule of thumb is that values between 1.5 and 2.5 are relatively normal. Even though our value is 1.345, I would say given the nature of the data, this is pretty close.
Conclusion
Thinking back to our scenario of the fisherman caught cheating, if we caught a bream with measurements length=24, height=12 and width=5, then we could make a reasonable guess as to what the weight of that fish would be, as in the following example.
# length, height, width, parkki, perch, pike, roach, smelt, whitefish, volume (Multiply first three numbers)
pred = lr.predict(np.array([[24.0, 12.0, 5.0, 0, 0, 0, 0, 0, 0, 24*12*5]]))
print(math.exp(pred))
323.4571016610546
Another example, if we caught a Pike with average measurements like in the data below, then we get a pretty close guess to the average weight.
df_pike = df4[df4['Species_Pike']==1]
df_pike.describe()
pred2 = lr.predict(np.array([[39.3, 7.2, 4.7, 0, 0, 1, 0, 0, 0, 39.3*7.2*4.7]]))
print(math.exp(pred2))
503.59805282570085
These predictions look fairly close to me. So I think this model is definitely good enough. Of course, in a real life scenario, one would also compare other models such as random forest. For this data set, I would also look into lasso or ridge regression and see how they compare.
If you liked this analysis walkthrough, leave a comment below. If I missed something or gotten something incorrect, please leave a comment! I am always trying to learn and improve. Don’t forget to check out the data set on Kaggle and look at the full notebook on my GitHub.
References
- Montgomery, D. C., Peck, E. A., & Vining, G. G. (2013). Introduction to linear regression analysis. Wiley-Blackwell.