Regression Chapter 2

Chapter 1: Linear Regression

Nonparametric Techniques

Data can look like almost anything. Sometimes there is a clear linear trend and it is very easy to spot a relationship. But that will not always be the case. What do we do when the data looks like this?

In these scenarios, we need more advanced techniques to properly analyze the data.

Many of these techniques are built using high level mathematics taught in graduate level courses. Unless you are working in an academic research setting, you will not be expected to know the mathematical reasoning behind these functions, but you should be familiar with how to use them.

Polynomial Regression

This technique is very similar to linear regression except we are now looking for a polynomial that best fits the data instead of a line. The numpy library has function that can do exactly this (the data for this example is taken from Mendeley Data):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("non_convex_data_1000.csv")
data.info() # columns are x and f(x)

x = data['x']
y = data['f(x)']
plt.scatter(x, y) # building the plot
poly_model = np.poly1d(np.polyfit(x, y, 4)) # constructing the polynomial of degree 4
line = np.linspace(min(x), max(x), 100) # 100 values between the domain of our data
plt.plot(line, poly_model(line)) # this finds the value of the model at the points of the line and plots them
plt.show()

print(poly_model) # this writes out the polynomial with all its coefficients

As you can see, this was extremely easy to do. The tricky part is figuring out the degree of polynomial required, which may not be as obvious as this case. If that happens, the key is to fit multiple models of different degrees and then compare them.

Let’s see how this would work.

model2 = np.poly1d(np.polyfit(x, y, 2))
plt.plot(line, model2(line), color="green")

Visually we can see one polynomial does a better job capturing the trend than the other. But even if we could not visually notice a difference, we can compare their R-squared and AIC values to find the better model.

y_pred4 = poly_model(x)
y_pred2 = model2(x)
# R-squared
r2_4 = r2_score(y, y_pred4)
r2_2 = r2_score(y, y_pred2)
# AIC
n = len(y)
rss_4 = np.sum((y - y_pred4) ** 2)
rss_2 = np.sum((y - y_pred2) ** 2)
k_4 = len(poly_model.coefficients)
k_2 = len(model2.coefficients)
aic_4 = n * np.log(rss_4 / n) + 2 * k_4
aic_2 = n * np.log(rss_2 / n) + 2 * k_2

print("Degree 4 model:")
print("  R² =", r2_4)
print("  AIC =", aic_4)

print("Degree 2 model:")
print("  R² =", r2_2)
print("  AIC =", aic_2)

Output:
Degree 4 model:
R² = 1.0
AIC = -56994.356405464634
Degree 2 model:
R² = 0.9036661393115496
AIC = 10056.405763093686

When comparing these two models, it is not even close. Our first model has a perfect R-squared value (it was the function used to generate the data so this makes sense) while the second model is good but not great.

For more complex datasets, settling on the best model is often a process of trial and error, comparing multiple models of multiple degrees before finally settling on the best one. It can be tedious and time consuming, but it is necessary.

Bias-Variance Tradeoff

As we move into more complex techniques, we find ourselves running into a common problem in data science: the bias-variance tradeoff. To show this, look at this graph of waiting times for the Old Faithful geyser at Yellowstone. Each of columns has a bandwidth of 4, which in this case means the graph is broken up into five minute intervals.

But what happens if we break it into 10 minute intervals?

Or even 20 minute intervals?

As we increase the size of the bins, we decrease the variance of the graph. But in doing so we increase the bias present. Our first graph told us that a sizable amount of eruptions happened around the 50 minute mark, but then the frequency decreased until the 70 minute mark. This last graph does not show that at all. As a result, we know that the graph contains a high bias.

But what would happen if we divided the graph by the minute?

This is just a graph of the exact minute eruptions occurred. It has absolutely no bias, but it jumps from high to low and is so skinny it is hard to look at. We would say this graph has extremely high variance.

This is what the bias-variance tradeoff looks like when visualized. We can get extremely blocky uniform graphs with incredibly high bias or incredibly skinny but sporadic graphs with high variance. It all depends on how we choose to group the data. The best graph is probably one that divides the data into 4-5 minute intervals, as this captures the underlying trends without sacrificing the variance.

The bias-variance tradeoff is not just a problem when it comes to making graphs. As we look for other ways of identifying trends in data and move towards machine learning, we will find that a high bias leads to underfitting (our model is not sensitive enough to changes), while a high variance leads to overfitting (our model is overly sensitive to changes). In order to be successful, we will need to find a healthy balance that minimizes both as best as possible.

Local Linear Regression

One very good way of modeling data that performs well in balancing bias and variance is local linear regression.

Local linear regression is similar to linear regression except it works with subsets of the data and connects these together.

This line was made using local linear regression and a bandwidth of 5. This means that the model looks at the five closest neighbors of a point, assigning them weight based on proximity, and then builds the local trendline based on that. If we increase the bandwidth, we increase the bias present in the model. For example, if we use a bandwidth of 100, we would get:

Which has much fewer curves (lower variance) and does not really describe the trend in the data very well. This is a key example of underfitting.

Let’s code an example of local linear regression together in Python.

This analysis seeks to model George W. Bush’s approval rating during his time in office. Before running the actual regression, we first need to organize the data. It is currently stored in the following format:

Start        Stop        Approve  Disapprove  None  Date        Percent
2/1/2001     2/4/2001    57       25          18    2001-02-01  69.51220
2/9/2001     2/11/2001   57       24          17    2001-02-09  70.37037
...         ...         ...      ...         ...   ...         ...

We want our x-axis to be the number of days since the first entry in the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.nonparametric.smoothers_lowess import lowess

# Load data
df = pd.read_csv("WBushApproval.csv")

# Convert Start to datetime
df['Date'] = pd.to_datetime(df['Start'], format="%m/%d/%Y")

# Compute approval percentage
df['Percent'] = 100 * df['Approve'] / (df['Approve'] + df['Disapprove'])

# Convert dates to numeric (days since first date)
df['DateNum'] = (df['Date'] - df['Date'].min()).dt.days

Now that we have properly modified the data, we can perform the local linear regression.

# Local linear regression using LOWESS (locally weighted scatterplot smoothing)
bw_value = 4  # try several different values to balance bias and variance
frac = bw_value / df.shape[0]  # proportion of data used in smoothing

smoothed = lowess(endog=df['Percent'], exog=df['DateNum'], frac=frac, it=0, delta=0.0, is_sorted=True)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(df['DateNum'], df['Percent'], color='blue', alpha=0.3, label='Raw Data')
plt.plot(smoothed[:, 0], smoothed[:, 1], color='red', linewidth=2, label='Local Linear Regression')
plt.title("Local Linear Regression of Bush Approval Ratings")
plt.xlabel("Days Since Start of First Term")
plt.ylabel("Approval Percentage")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Using a bandwidth of 4,

This is ok, but it seems as if the variance is too high. The trendline is jumping around when it should probably just be smooth in most sections.

Let’s try again with a bandwidth of 50

That’s too smooth. The model does not do a good job adjusting to the big change caused by 9/11. But we know that our answer is somewhere between 4 and 50. Let’s try 25

That is much better. This model does a good job adjusting to the swings in popularity but is not overly sensitive to small fluctuations.

Local linear regression models are an excellent tool for modeling trends in data regardless of the shape of the data. Perhaps the biggest advantage of the technique is that it can identify any type of trend, parametric or nonparametric, from the data itself. This lets it organically identify whatever pattern may exist, even in noisy environments. Local linear regression also performs relatively well in near boundaries. And, very importantly, local linear regression allows us to approximate the derivative of a model at any given point, which is extremely useful. However, local linear regression can be very computationally expensive, which makes it more difficult to use when dealing with extremely large datasets.

Kernel Regression

The final technique we will cover is kernel regression. Like local linear regression, kernel regression assigns weights to nearby points, but it does so in a different way than local linear regression. There are many different types of kernels such as the Nadaraya-Watson kernel, triangular kernel, and Priestly-Chao kernel that assign weights to points in a different manner. However, the type of kernel is of little importance compared to the bandwidth of the kernel.

Let’s run a kernel regression on the Bush approval rating data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.nonparametric.kernel_regression import KernelReg

# Load data
df = pd.read_csv("WBushApproval.csv")
df['Date'] = pd.to_datetime(df['Start'], format="%m/%d/%Y")
df['Percent'] = 100 * df['Approve'] / (df['Approve'] + df['Disapprove'])
df['DateNum'] = (df['Date'] - df['Date'].min()).dt.days

x = df['DateNum'].values
y = df['Percent'].values

# Kernel regression with Gaussian kernel
# 'c' = continuous variable. 
kr = KernelReg(endog=y, exog=x, var_type='c', bw=[22], ckertype = 'gauss')  # You can set bandwidth manually or leave it and the computer will do it automatically

# We plot it at evenly spaced points just like the polynomial example
x_grid = np.linspace(min(x), max(x), 500)
y_pred, _ = kr.fit(x_grid) # This takes only the first value returned by kr.fit()

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.3, color='blue', label="Raw Data")
plt.plot(x_grid, y_pred, color='red', linewidth=2, label="Kernel Regression (Gaussian)")
plt.title("Bush Approval Ratings - Nadaraya-Watson (Statsmodels)")
plt.xlabel("Days Since Start of First Term")
plt.ylabel("Approval Percentage")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Here are some other types of kernels to try.

ckertype value	Type of Kernel
'gauss'	Gaussian (default)
'uniform'	Uniform
'triangular'	Triangular
'epanechnikov'	Epanechnikov
'biweight'	Biweight
'cosine'	Cosine

Kernels are an extremely powerful tool because they can find any kind of trend in data. You do not need to assume any type of model because it will automatically adapt to the data. It also does a great job of smoothing out noisy data. An advantage of kernels over local linear models is that kernels are significantly computationally cheaper, which is something to consider when dealing with massive sets of data. Kernels do tend to have a high bias at the borders and they are not very good at predicting future values, but they remain an incredibly powerful tool in data analysis.