Chapter 10: Unsupervised Learning, K-Means and Hierarchical Clustering

In supervised learning we trained our model using labeled data, but what if we don’t have labels, what if we just want to explore, group, and summarize patterns in our raw data. This is where the importance of unsupervised learning comes into play. In this chapter we will explore the logic behind clustering, two important clustering techniques in K-Means Clustering and Hierarchical Clustering, real world uses for these techniques, and their python implementations.

10.1 K-Means Clustering

K-Means is one of the most popular clustering algorithms. It partitions your data into K distinct, non-overlapping groups based on similarity.

How it works:

You choose the number of clusters, k
The algorithm places k centroids (center points) randomly
It assigns each data point to the nearest centroid
It then recalculates each centroid as the average of all points in its cluster
Repeat until the assignments stop changing

Why it works: It’s based on a simple, intuitive principle: group data points by minimizing the distance to their group center. It uses an iterative process to refine those groupings.

Where it’s used:

Customer segmentation
Image compression
Grouping behaviors (users, events, purchases)
Finding structure in unlabeled datasets

Where it struggles:

Choosing the right number of clusters k
Sensitive to outliers and initialization
Doesn’t work well with non-spherical clusters

K-Means in Python

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

plt.scatter(X[:, 0], X[:, 1], s=50, c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title("K-Means Clustering")
plt.show()

10.2 Finding the Right Number of Clusters: The Elbow Method

Try different values of k, and plot the within-cluster sum of squares (WCSS) for each. WCSS = total distance of points from their cluster center. Look for the “elbow” point where adding more clusters doesn’t improve much.

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.title("Elbow Method for Optimal K")
plt.show()

10.3 Hierarchical Clustering

Hierarchical clustering builds a tree of clusters, where each level represents a different granularity.

There are two main types:

Agglomerative: Start with each point as its own cluster, and merge them step-by-step.
Divisive: Start with all points in one cluster and split them recursively.

We’ll focus on agglomerative clustering, the most common.

Why it works: It builds a visual summary of relationships between all points. You don’t have to choose k in advance, just cut the tree (dendrogram) at the level you want.

Where it’s used:

Gene expression clustering
Social network analysis
Document or topic grouping
Any case where interpretability matters

Where it struggles:

Not scalable to large datasets
Sensitive to distance metrics and linkage methods

Hierarchical Clustering in Python

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
linked = linkage(X, 'ward')

plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.show()

10.4 Key Takeaways

Unsupervised learning finds structure in data without labeled outcomes.
K-Means partitions data into fixed groups based on minimizing distance to cluster centers.
Hierarchical clustering builds a tree of clusters, offering flexibility and interpretability.
Both methods have strengths depending on your data and your goal: exploration, segmentation, or simplification.

Next Chapter →