Predicting the Weight of Fish Using Multiple Linear Regression 🐟

Before you start…
The data set
Setting up the notebook
Explore the data set
- Check for nulls
- Possible model
Exploratory Data Analysis (EDA)
Model Building
Explore Different Variations of the Model
Conclusion
Other Posts

I recently read an article about two men who were caught cheating during a fishing tournament. Being an avid fisherman myself, I couldn’t help but wonder if there could have been a way to detect the cheating with an algorithm. In this post, I am going to use a multiple linear regression model to predict the weight of a fish based on certain measurements.

When catching a fish, one may want to record the weight along with several other measurements. I can imagine visually inspecting a fish, and provide a best guess as to how much that fish would weigh. If I saw a rather small perch or walleye, and grabbed it and thought “wow this is heavier than I thought,” then I may want to investigate further. And this is exactly how the two fisherman were caught! The fishing tournament organizer looked at the rather normal looking fish and thought they couldn’t possibly weigh that much. If a human can come to that conclusion with their own intuition, what could a machine learning model do? Perhaps it could prevent things like this from happening…

Before you start…

In order to apply linear regression, there are some assumptions about the underlying data that need to be met for it to be a valid model:

Assume the relationship between the independent and dependent variables is linear.
Assume the variables are normally distributed.
Linear regression assumes there is little or no multicollinearity in the data.
Assume the residuals are independent from each other. This is also sometimes called autocorrelation.
Assume homoscedasticity, or also called constant variance.

Throughout this analysis, we will check for these assumptions.

The data set

This got me thinking and I scoured the internet for a relatable data set. Unfortunately, there aren’t that many freely available data sets on fishing tournaments and fish weights, so I settled for getting creative with the fish market data set from Kaggle.

This analysis is just a first pass at a simple model, and of course one could always take it a step further. Feel free to follow along with the completed notebook on my GitHub.

Setting up the notebook

You may need the following libraries for what we are going to do:

# BASE -----------------------------------
import numpy as np
import pandas as pd
import math
from scipy import stats

# VISUALIZATIONS -------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# MODELING & ANALYSIS --------------------
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from statsmodels.tools.eval_measures import rmse
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score

Explore the data set

# load in the data
df = pd.read_csv(r'C:\...\Fish.csv')

print(df.columns)
print(df.shape)

Index(['Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height',
       'Width'],
      dtype='object')
(159, 7)

# Look at first three rows of data frame
df.head3()

I don’t like the ambiguous column names, so let’s rename them according to the data set documentation:

Column descriptions

Weight – weight of fish in grams
Length1 – vertical length in cm
Length2 – Diagonal length in cm
Length3 – Cross length in cm
Height – height in cm
Width – Diagonal width in cm

df = df.rename({'Length1':'Vertical_Length', 'Length2':'Diagonal_Length', 'Length3':'Cross_Length'}, axis='columns')

Check for nulls

null_sum = df.isna().sum()
print(null_sum)

Species            0
Weight             0
Vertical_Length    0
Diagonal_Length    0
Cross_Length       0
Height             0
Width              0
dtype: int64

Next, I want to see how each of the variables relate to each other. I will use a scatter plot matrix to view the relationship:

# Create the scatter plot matrix
sns.pairplot(df, kind="scatter", hue='Species')
plt.show()

This correlogram (Scatter plot matrix) allows us to see that a linear relationship may be appropriate for this dataset (although it does appear that a nonlinear model might be better), but for the sake of this example, we will continue with the linear assumption.

Possible model

In the textbook “Introduction to Linear Regression Analysis” by Douglas Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, it states “Any regression model that is linear in the parameters is a linear regression model, regardless of the shape that it may generate.”

This means that the weight of a fish (y) can be computed using a linear combination of its different length measurements given the function:

With B₀ representing the constant term.

Now, we will learn the best values for our betas from the data, which is where the process for training the model begins. A trained model is a model where the beta values have been decided. A model with these parameters pre built into the formula is what constitutes a parametric model. In other words, we fix the formula of the training function before the training.

Exploratory Data Analysis (EDA)

Explore relationship with Weight variable

Now, I want to look at the relationship of the variables with the weight variable a little more:

# separate into category and numerical columns
df_cat = df[['Species']]
df_num = df.select_dtypes(include='number')

# create histograms with density curve
fig, axes = plt.subplots(2, 3, figsize=(20,10))
fig.subplots_adjust(wspace=0.1, hspace=0.3)

axes = axes.ravel()

for i, col in enumerate(df_num):
    sns.histplot(data = df[col], ax=axes[i], kde=True)
plt.show()

# plot the scatter plot of the weight variable against the other variables
sns.pairplot(df, 
             x_vars=['Vertical_Length', 'Diagonal_Length', 'Cross_Length', 'Height', 'Width'],
             y_vars=['Weight'], hue='Species')
plt.show()

Histograms of the Weight variable against the other variables

Scatter plot of the Weight against the other measurement variabels

Summary statistics

Next, let us take a look at some summary statistics to get a general summary:

df.describe()

Notice there are some values where the weight is equal to zero, which is impossible, so it may be an input error and we could possibly remove those values
I should explore each category of species to see if any might be out of the ordinary

species = df['Species'].value_counts()

Perch        56
Bream        35
Roach        20
Pike         17
Smelt        14
Parkki       11
Whitefish     6
Name: Species, dtype: int64

# Weight cannot be 0g, so should look at those values
zeros = df[df['Weight'] == 0]
zeros

This fish cannot possibly weigh 0g, so we will assume for our example that maybe the value was input in error, and will delete this record.

I also see in the summary statistics table, there are a few really high values. Let us take a look at those:

# high values
df[142:145]

These values seem pretty high compared to the rest of the Pike weights, but I have no way to investigate if the data were entered incorrectly, or if some other error occurred, so we will assume they are valid data and keep them in the model until we do further outlier analysis.

# new df with bad value removed and view summary statistics
df1 = df[df['Weight'] != 0]
df1.describe()

Multicollinearity

# correlation heatmap
sns.heatmap(df1.corr(), annot=True, vmin=0, vmax=1, cmap="rocket_r")
plt.show()

Note that all the variables are highly correlated, especially the length variables, height has the least issues with this.

Outlier analysis

# separate into category and numerical columns
df1_cat = df1[['Species']]
df1_num = df1.select_dtypes(include='number')

# create box plots for outlier analysis
fig, axes = plt.subplots(2, 3, figsize=(16, 8))
axes = axes.ravel()
fig.subplots_adjust(wspace=0.4, hspace=0.4)
fig.suptitle('Box Plot - Outlier Analysis', fontsize=20)

for i, col in enumerate(df1_num):
    box=sns.boxplot(data=df1_num[col], ax=axes[i], orient='h')
    axes[i].set_title(col)
    
plt.show()

Notice there are three values that might be a cause for concern, so we will fit the model with and without them, and compare the results.

Prepare the data for modeling

# categorical encoding
df2 = pd.get_dummies(df1)

# drop one species column
df3 = df2.drop(columns=['Species_Bream'])

Model Building

At the time of this writing, sklearn did not have a function for residual analysis and some other values that I prefer to use when performing linear regression. Because of this, I will use both sklearn and statsmodels and see how they compare.

Split into training and test sets

# we will split using the data frame with the dummy variables
# Note: this data frame is the full data frame - no variables or outliers yet removed
X = df3.iloc[:, 1:13]
y = df3[['Weight']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print('X_train: ', np.shape(X_train))
print('y_train: ', np.shape(y_train))

X_train:  (118, 11)
y_train:  (118, 1)

Define and fit the model

Using sklearn

model = LinearRegression()
model.fit(X_train, y_train)

# Parameters of the model and other values
print('Model intercept: ', model.intercept_)
print('Model coefficients: ', model.coef_)
print('Model score: ', model.score(X, y))

Model intercept:  [-785.67562956]
Model coefficients:  [[ -63.17797578   71.70850333   27.81978955  -11.52939024    6.42359181
   133.76631624   34.38721174 -362.58082738    9.46492511  306.4493073
    63.20534515]]
Model score:  0.9329499697782535

# predicting weights from the test set
y_hat = model.predict(X_test)

# Analyzing r2 value of model
r2_score(y_test, y_hat)

0.9482451798572478

print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))

MSE: 6790.893
RMSE: 82.407
MAE: 62.558

Cross validation

cross_val_score_train = cross_val_score(model, X_train, y_train, cv=10, scoring='r2')
print(cross_val_score_train)
print(cross_val_score_train.mean())

[0.91333352 0.96324695 0.93654314 0.9341261  0.86007924 0.9247442
 0.82266236 0.88844178 0.84039499 0.60139132]
0.8684963615355723

Using statsmodels package

X_sm = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_sm).fit()

sm_model.summary()

Residual Analysis

# using full model
residuals = sm_model.resid

plt.scatter(sm_model.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()

This indicates some potential nonlinearity in the data. We can experiment by adding interaction terms or performing a transformation.

# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()

This follows pretty close to the line except for a few at the top. Let’s see if these are overly influential values:

# influence plot
fig = sm.graphics.influence_plot(sm_model, criterion='cooks')
plt.show()

# Cook's distance - rule of thumb, drop observations with distance over 4/n
lm_cooks = sm_model.get_influence().cooks_distance[0]

n = len(X_train)

# calculate d
d = 4/n
print('Criticial distance: ', d)

out_d = lm_cooks > d

print(X_train.index[out_d], "\n", lm_cooks[out_d])

Criticial distance:  0.03389830508474576
Int64Index([137, 142, 111, 143, 144, 60, 55, 72], dtype='int64') 
 [0.04262362 0.24729002 0.09532826 0.16539705 0.11011024 0.04889406
 0.05648581 0.15169978]

Clearly, there are some points that seem to have a high amount of leverage or influence, such as 142, 143, 144, and 72.

Explore Different Variations of the Model

# Fit model with only one length column
X = df3.drop(columns=['Weight', 'Vertical_Length', 'Diagonal_Length'])
y = df3[['Weight']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_hat = lr.predict(X_test)

r2_score(y_test, y_hat)

0.943429129558069

print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % math.sqrt(mean_absolute_error(y_test, y_hat)))

MSE: 7594.113
RMSE: 87.144
MAE: 64.253

# Use statsmodels
X_sm = sm.add_constant(X)
sm_model = sm.OLS(y, X_sm).fit()

sm_model.summary()

This has slightly improved by looking at some of the values in the summary table. Let’s go ahead and see if an interaction term improves the model. Thinking about weight, maybe volume is a good indicator:

df3['Volume'] = df3['Height']*df3['Width']*df3['Vertical_Length']

X = df3.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
y = df3[['Weight']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_hat = lr.predict(X_test)

print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % math.sqrt(mean_absolute_error(y_test, y_hat)))

R2:  0.9701085869761354
Adj r2:  0.9558745807742952
MSE: 4012.644
RMSE: 63.345
MAE: 46.916

# statsmodels - interaction term
X_sm = sm.add_constant(X)
sm_model1 = sm.OLS(y, X_sm).fit()

sm_model1.summary()

residuals = sm_model1.resid

plt.scatter(sm_model1.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()

# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()

Adding the interacting term seems to have helped it look more random. But I don’t like the funnel shape that is produced. We may need to perform a transformation to make it more linear. In the normal probability plot, there are still some questionable points at the top, and now at the bottom. But the rest fall pretty close to the line, but not entirely normally distributed.

Let’s check out some outliers:

# Cook's distance - rule of thumb, drop observations with distance over 4/n
lm_cooks = sm_model1.get_influence().cooks_distance[0]

n = len(X)

# calculate d
d = 4/n
print('Criticial distance: ', d)

out_d = lm_cooks > d

print(X.index[out_d], "\n", lm_cooks[out_d])

Criticial distance:  0.02531645569620253
Int64Index([13, 29, 32, 33, 34, 60, 72, 111, 123, 126, 127, 137, 142, 143], dtype='int64') 
 [0.04353014 0.05292537 0.07018666 0.08754108 0.05220421 0.02642313
 0.0283841  0.12750454 0.03541461 0.06505045 0.03392408 0.03301345
 0.54359171 0.3697363 ]

Here, it is clear that points 142, 143, and 144 have a lot of leverage and influence. Let’s try removing them.

# outlier removal
df3 = df3.drop([142, 143, 144])

X = df3.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
y = df3[['Weight']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_hat = lr.predict(X_test)

print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))

R2:  0.9858885592047293
Adj r2:  0.9788328388070939
MSE: 1577.113
RMSE: 39.713
MAE: 28.596

X_sm = sm.add_constant(X)
sm_model2 = sm.OLS(y, X_sm).fit()

sm_model2.summary()

residuals = sm_model2.resid

plt.scatter(sm_model2.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()

# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()

It still looks pretty much the same. With the funnel shape of the residuals, and the nonlinearity of the response variable, we performed the interaction term, but we may need to do a transformation to make it more linear. I will apply a log transformation of the response variable.

X = df3.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])
ln_y = np.log(df3.Weight)

X_train, X_test, y_train, y_test = train_test_split(X, ln_y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_hat = lr.predict(X_test)

print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))

R2:  0.9926637216147842
Adj r2:  0.9889955824221763
MSE: 0.013
RMSE: 0.113
MAE: 0.091

X_sm = sm.add_constant(X)
sm_model3 = sm.OLS(ln_y, X_sm).fit()

sm_model3.summary()

residuals = sm_model3.resid

plt.scatter(sm_model3.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()

# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()

These look a lot better! Although clearly there are still a couple of outliers.

# Cook's distance - rule of thumb, drop observations with distance over 4/n
lm_cooks = sm_model3.get_influence().cooks_distance[0]

n = len(X)

# calculate d
d = 4/n
print('Criticial distance: ', d)

out_d = lm_cooks > d

print(X.index[out_d], "\n", lm_cooks[out_d])

Criticial distance:  0.025806451612903226
Int64Index([13, 58, 72, 117, 129], dtype='int64') 
 [0.0517299  0.04511071 1.42329489 0.02651304 0.03879586]

# remove point 13 and 72

df4 = df3.drop([13, 72])

ln_y = np.log(df4.Weight)
X = df4.drop(columns=['Weight','Diagonal_Length', 'Cross_Length'])


X_train, X_test, y_train, y_test = train_test_split(X, ln_y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_hat = lr.predict(X_test)

print('R2: ', r2_score(y_test, y_hat))
print("Adj r2: ", 1 - (1-lr.score(X_test, y_test))*(len(y_test)-1)/(len(y_test) - X_test.shape[1]-1))
print("MSE: %.3f" % mean_squared_error(y_test, y_hat))
print("RMSE: %.3f" % math.sqrt(mean_squared_error(y_test, y_hat)))
print("MAE: %.3f" % mean_absolute_error(y_test, y_hat))

R2:  0.9889376169433531
Adj r2:  0.9834064254150297
MSE: 0.008
RMSE: 0.088
MAE: 0.074

# cross validation
scores = cross_val_score(lr, X_train, y_train, scoring='r2', cv=5)
print(scores)
print('R2: ', np.mean(scores))

scores2 = cross_val_score(lr, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
print(scores2)
print('RMSE: ', math.sqrt(np.mean(abs(scores2))))

[0.99328735 0.99616257 0.99699811 0.99280587 0.99479305]
R2:  0.9948093908684947
[-0.00890102 -0.00966734 -0.00552462 -0.01043415 -0.0089092 ]
RMSE:  0.0932054951192439

X_sm = sm.add_constant(X)
sm_model4 = sm.OLS(ln_y, X_sm).fit()

sm_model4.summary()

residuals = sm_model4.resid

plt.scatter(sm_model4.predict(), residuals)
plt.axhline(0, color='red')
plt.xlabel('Predicted values')
plt.ylabel('Residual')
plt.show()

# normal probability plot
pplot = sm.ProbPlot(residuals)
fig = pplot.qqplot(line='s')
plt.show()

These plots look a lot better to me, and I am confident in the model. Here is a table summarizing some of the values for each model I tested:

As you can see, there are high R₂ and adjusted R₂ values, as well as a low p-value (using a significance level of 0.05) for the f statistic. This tells us that the sample data provides sufficient evidence to conclude that the regression model fits the data better than the model with no independent variables.

Another statistic that assesses how well a regression model fits a data set is the root mean square error (RMSE). This metric shows the average distance between the predicted values and the actual values in the data set. The lower the RMSE, the better a given model is able to fit a data set

For the Durbin-Watson, this is a test for autocorrelation. A value of 2 indicates zero autocorrelation, our value is below 2, indicating a positive autocorrelation. A rule of thumb is that values between 1.5 and 2.5 are relatively normal. Even though our value is 1.345, I would say given the nature of the data, this is pretty close.

Conclusion

Thinking back to our scenario of the fisherman caught cheating, if we caught a bream with measurements length=24, height=12 and width=5, then we could make a reasonable guess as to what the weight of that fish would be, as in the following example.

# length, height, width, parkki, perch, pike, roach, smelt, whitefish, volume (Multiply first three numbers)

pred = lr.predict(np.array([[24.0, 12.0, 5.0, 0, 0, 0, 0, 0, 0, 24*12*5]]))
print(math.exp(pred))

323.4571016610546

Another example, if we caught a Pike with average measurements like in the data below, then we get a pretty close guess to the average weight.

df_pike = df4[df4['Species_Pike']==1]
df_pike.describe()

pred2 = lr.predict(np.array([[39.3, 7.2, 4.7, 0, 0, 1, 0, 0, 0, 39.3*7.2*4.7]]))
print(math.exp(pred2))

503.59805282570085

These predictions look fairly close to me. So I think this model is definitely good enough. Of course, in a real life scenario, one would also compare other models such as random forest. For this data set, I would also look into lasso or ridge regression and see how they compare.

If you liked this analysis walkthrough, leave a comment below. If I missed something or gotten something incorrect, please leave a comment! I am always trying to learn and improve. Don’t forget to check out the data set on Kaggle and look at the full notebook on my GitHub.

References

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2013). Introduction to linear regression analysis. Wiley-Blackwell.

The Data Gem