Linear Regression

We now move on to a major application of data science: linear regression. Linear regression is an extremely powerful tool used in both academia and industry to identify trends. In essence, linear regression is a way to find a “line of best fit” given data. For example, here is a linear regression model plotting the relationship between the size of a diamond’s carat and the price of the diamond (data is from the diamonds dataset on kaggle).

In this case it is fairly obvious to see with our eyes that a bigger carat translates to a more expensive diamond, but linear regression allows us to quantify that relationship, which is crucial to the success of businesses and research projects alike. We can predict how much a diamond is worth based on its carat, the revenue of a coffee shop based on the number of customers, the impact of advertising based on monthly sales, and so much more.

What is the Math Behind Simple Linear Regression?

Simple linear regression is the term used to describe a situation like we just saw with diamonds. There is one target variable (such as the price of the diamond) and one predictor variable (the carat of the diamond). Mathematically, the equation looks like this:

y = 0+ 1x +

  • y is the target variable
  • 0 is the intercept term
  • β1 is the coefficient corresponding to the independent variable
  • x is the independent variable
  • is the error term

The equation is very similar to y = mx + b, but now it includes an error term. Using a method called ordinary least squares (OLS), we can now estimate the value of each parameter. For those interested, a full proof is available here. It is very straightforward but requires a basic understanding of calculus to understand. But once that is complete, we get the following values for each of the parameters:

0 = ȳ - 1x
1 =i=1n(xi-x̄)(yi-ȳ)i=1n(xi-x̄)2

  • ȳ is the average of the y values
  • x̄ is the average of the x values

Multiple Regression

Oftentimes there are multiple parameters of interest that are influencing our data. Recalling the diamond example, the carat of a diamond is important, but so are its cut, color, brilliance, and a host of other factors. When trying to estimate a car’s fuel efficiency, we would want to consider the car’s weight, speed, engine model, and manufacturing year. Setting up an equation would look like this:

y = 0+ 1x1+2x2+ 3x3+. . .+ nxn +

Where each ixi corresponds to a coefficient and an independent variable of interest, respectively. Even though it has multiple independent variables, this is still a line, just in an n-dimensional space. For example, the image below shows a line constructed in a 3 dimensional space using three independent variables.

But regardless of the number of input variables, we can still estimate the coefficients from the data. This is done by transforming the linear equation into a matrix equation such that

y = X +

When there are n observations and p predictors,

  • y is an n x 1 vector of the response variables
  • X is an n × (p+1) matrix of of values for each predictor for each observation and a row of 1s for the intercept term
  • is a (p+1) x 1 vector of the predictors
  • is an n x 1 vector of error terms

Using the OLS method, we find that

= (XTX)-1XTy

So the value of each i will be found in the i-1 row of the vector (remember we start at 0). The proof is found here. It utilizes the same principles of calculus but requires knowledge of linear algebra to understand.

The most important thing to remember is that using the OLS method for both simple and multiple regression, the results are consistent and unbiased. This means,

E[our estimation] = and
nestimation =

Before Building The Model: Checking Assumptions

Before we can build models in Python, there are a few things we must keep in mind. Linear regression requires the following assumptions to be met in order for the model to be accurate:

  • Independence: Observations must be independent, and the error terms therefore must be truly random.
  • No multicollinearity: Our independent variables must be independent of each other, which means they must not be correlated. For numerical variables we can check this with a correlation matrix, and for categorical variables we can check using VIF.
  • Homoscedasticity: Once we have a model, we need to check that the errors contain no trend and have a constant variance.
  • Normal Distribution of Errors: The error terms need to follow a normal distribution.

Before we select the parameters for our model, we need to check they are not correlated using a correlation matrix. For numerical variables, we can build a correlation matrix, an example of which is given below:

import pandas as pd
num_var = sample_data[['carat', 'z', 'table']]

# Compute the correlation matrix
cor_matrix = num_var.corr()

print(cor_matrix)

If two variables are highly correlated with each other, we should exclude one of them from the model. Not only is the mathematically necessary, it also just makes practical sense. The two variables move together, so including both does not add any new information into the model. It just serves to make the model more complicated, which is something we want to avoid.

For categorical variables, we can compare the Variance Inflation Factor scores. We first have to convert these variables into “dummy variables”, after which:

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Convert categorical variables to dummy variables
X = pd.get_dummies(sample_data[['carat', 'depth', 'table', 'color', 'cut']], drop_first=True) # Change sample_data to the name of your dataframe
# Change the column labels to the categorical columns in your dataframe

# Add constant for intercept (required for statsmodels)
X = sm.add_constant(X)

# Compute VIF for each feature
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

Lower VIF scores mean less multicollinearity. If you receive a score higher than 5, you should strongly consider removing the variable. If the score is above 10, you should remove it without question.

After checking for multicollinearity between independent variables, we know which to eliminate and which to include in our model.
Hint: For numerical variables, checking to see if they have a strong correlation with the target variable can help you find the most important ones to include in your model.

Using Linear Regression in Python

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

from statsmodels.stats.outliers_influence import variance_inflation_factor

df = pd.read_csv('drive/My Drive/Automobile.csv')
df.head()# Look at the first few rows to understand the data

Seaborn can automatically add a line of best fit onto a graph using the following formula

sns.lmplot(x ="engine_size", y ="price", data = df, order = 2, ci = None)
plt.show()

The next step, as always, is to clean the data before building the model. We need to remove NA values.

df.dropna(inplace=True)

Then we separate the data into the independent and dependent variables. This is also the point in which we choose which parameters we want to regress on. We already identified which parameters possess high levels of multicollinearity and should automatically exclude those. For this model, we will keep all variables as there is no multicollinearity, but in just a second we will see a more refined model.

# Separate target variable 
X = df.drop('price', axis=1)
y = df['price']

Now we need to modify any categorical variables into “dummy variables” so that we can perform a regression.

categorical_features = X.select_dtypes(include=['object']).columns

# Encode categorical features
X_encoded = pd.get_dummies(X, columns=categorical_features, drop_first=True)

Then we create the training group and testing group

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) # the 0.25 means 75% of our data is for training and 25% is for testing

Once this is done, we can train the model. If you are able to identify the key features to regress on, the next step is to remove the unwanted variables from the training and testing steps. From here, we can run the model and view its summary statistics.

# X_train.drop(“col A”, axis=1)
# X_test.drop(“col A”, axis=1)
model = sm.OLS(y_train, X_train).fit()
print(model.summary())

The output will look something like this:

If you are unable to identify key features, backpropagation is an automated method to conduct feature engineering. It fits a model with every available feature and removes the least significant one until all p-values fall below a designated threshold.

Important: This method does not account for multicollinearity, so you will need to remove those parameters before running this backpropagation function.

def backward_elimination(X, y, significance_level=0.05):
   """Perform backward elimination and return the final model and selected columns."""
   X_model = X.copy()
   while True:
       model = sm.OLS(y, X_model).fit()
       p_values = model.pvalues
       p_values = p_values.drop('const')
       max_p_value = p_values.max()
       if max_p_value > significance_level:
           worst_feature = p_values.idxmax()
           print(f"Removing '{worst_feature}' with p-value {max_p_value:.4f}")
           X_model = X_model.drop(columns=[worst_feature])
       else:
           break
   return model, X_model.columns

final_model, final_columns = backward_elimination(X_train, y_train)

Diagnostic Checking

Now that we have models, we need to check to make sure that they align with the assumptions behind linear regression. We first check to ensure that the residuals from each model are truly random and contain no clear trends:

y_pred_model = regr.predict(X_test)
residuals_model = y_test - y_pred_model

# Plot residuals vs. predicted values for the first model
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_pred_model, residuals_model, alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel("Predicted Values (Model)")
plt.ylabel("Residuals (Model)")
plt.title("Residuals vs. Predicted (Model)")

Choosing Between Models

AIC Score (Akaike Information Criterion)

AIC scores models by complexity. If we have two quality models, we can calculate their AIC and pick the model with the lower score (more negative). AIC is calculated with this formula:

AIC = 2k - 2ln(L)

where k is the number of parameters and L is the log-likelihood of the model (a measure of its quality). AIC punishes models with more parameters. AIC can be difficult to calculate by hand, but there are ways to do it in Python:

Using the above code

y_hat = model.predict(X_test)
resid = y_test - y_hat
sse = sum(resid**2)
k= length(regr.get_params(deep=True)) # of variables, will vary depending on the model
AIC= 2k - 2ln(sse)

The Statsmodels package also has a built-in method for finding AIC:

import statsmodels.api as sm

X_train, X_test = sm.add_constant(x) # must manually add the intercept term

model = sm.OLS(y_train, x_train).fit()

print(model.aic)

BIC (Bayesian Information Criterion)

BIC is another common method for judging model complexity. The formula for BIC is

BIC = -2 * L + k * log(n)

where L is the log-likelihood, k is the number of parameters, and n is the number of observations. Just like for AIC, a smaller BIC is better (further left on the number line). In Python, it can be calculated using this:

n*log(residual sum of squares/n) + K*log(n)
k= length(regr.get_params(deep=True))
n = len(y_test)
L = (sum(resid**2) /n)
Bic = n * lnL + k * ln(n)