Chapter 8: Core Classification Algorithms

Now that we understand a lot of core elements in the background of machine learning and we know how to evaluate the classifiers, it’s now the exciting part exploring the different models themselves. In this chapter we will walk through four of the most commonly used classification algorithms: Logistic Regression, K-Nearest Neighbors (KNN), Decision Trees, and Random Forests. Each algorithm brings its own strengths, limitations, and ideal use cases. We’ll focus on their logic, mechanics, and when to apply them effectively.

8.1 Logistic Regression What It Is and Why It Works:

Logistic regression is one of the most fundamental classification algorithms., despite its name, it’s used for classification, not regression. At its core, logistic regression answers a simple question: How likely is it that this input belongs to class 1 (positive)?

It works by:

  • Taking a weighted sum of the input features
  • Passing that sum through a sigmoid function, which converts the result into a value between 0 and 1 (i.e., a probability)
  • Applying a decision threshold (typically 0.5) to determine the predicted class

Why it works:

  • The boundary between classes is (roughly) linear
  • You want a probabilistic interpretation (not just “yes/no”)

Where it shines:

  • Interpretable models for binary classification
  • Baseline models in healthcare, finance, and social science
  • Fast and robust when the number of features is moderate

Python Implementation Example:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))  # Accuracy: 0.9737

Bonus Mathematical Equation/Context:

Weighted Sum of inputs Equation:

\\[ z = \mathbf{w}^T \mathbf{x} + b \\]

Sigmoid maps the sum to probability:

\\[ \sigma(z) = \frac{1}{1 + e^{-z}} \\]

Loss function (cross-entropy):

\\[ L = -[y \log(\hat{y}) + (1 - y)\log(1 - \hat{y})] \\]

8.2 K-Nearest Neighbors (KNN) What It Is and Why It Works:

KNN is a non-parametric, example-based classifier. It makes predictions based on memorized data rather than learning weights.
It works by:

  • When given a new data point, it searches the training set for the k most similar points
  • It assigns the majority class among these k neighbors as the prediction

Why it works:

  • KNN relies on a simple and often powerful assumption: Similar things tend to have similar labels.
  • KNN can model non-linear decision boundaries and adapt to the local shape of data.

Where it shines:

  • When patterns are complex and local, not easily captured by global rules
  • Problems with very few features and clear class groupings
  • Prototype-based reasoning (e.g., medical diagnoses based on past patients)

Where it struggles:

  • High dimensions (curse of dimensionality)
  • Large datasets (slow at prediction time)
  • Features with different scales or irrelevant features

Python Example:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))  # Accuracy: 0.9561

Mathematical Bonus:
KNN works by calculating the Euclidean distance between points and it classifies different points based on how close they are to a central main class, the class is then calculated based on a main vote.

8.3 Decision Trees What is it and Why it Works:

A decision tree builds a series of if-then rules based on the input features to predict the class of a data point.
How it works:

  • At each node, it chooses the feature and threshold that best splits the data into pure subgroups
  • The process repeats recursively, building branches until:
    • Each leaf is “pure” (all one class), or
    • A stopping condition is met (max depth, min samples, etc.)

Why it works:

  • Features are categorical or ordinal
  • Interpretability is critical
  • You need to handle missing data or outliers

Where it shines:

  • Clear, human-interpretable logic
  • Real-world decision making (credit approval, eligibility)
  • Datasets with mixed feature types

Where it struggles:

  • High variance / overfitting
  • Small training sets (unstable splits)

Python Example:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=4)
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))  # Accuracy: 0.9561

Mathematical Intuition Bonus:
Common split criteria are chosen between Gini-split impurity and Entropy.
The equation for Gini-split impurity is the following:

\\[ G = 1 - \sum p_i^2 \\]

Entropy for Informational Gain:

\\[ H = - \sum p_i \log_2 p_i \\]

You want to choose the split that minimizes impurity or maximizes informational gain.

8.4 Random Forest What is it and Why it Works:

A random forest is an ensemble of decision trees. Instead of relying on one tree (which might overfit), it grows many trees using different random samples and combines their predictions.
How it works:

  • Bootstrapping: Each tree is trained on a different random subset of data (with replacement)
  • Random feature selection: Each split considers a random subset of features
  • Final prediction: majority vote (classification) or average (regression)

Why it works:

  • Random forests reduce variance and overfitting by averaging many diverse trees
  • Be accurate and robust
  • Handle complex, non-linear data
  • Be more stable than a single decision tree

Where it shines:

  • Large datasets with many features
  • Data with noise or complex interactions
  • Feature importance analysis

Where it struggles:

  • Interpretability
  • Training time for very large datasets

Python Example:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))  # Accuracy: 0.9737

Mathematical Background:
Bagging: Final Prediction = Majority Vote of h1(x), h2(x), ..., ht(x)
Feature importance is computed by averaging impurity reductions across all trees

8.5 Final Summary Table:

Model Test Accuracy Strengths Weaknesses
Logistic Regression 0.9737 Fast, interpretable, probabilistic output Linear boundaries only
K-Nearest Neighbors 0.9561 Simple, flexible, no training time Slow predictions, sensitive to scale
Decision Tree 0.9561 Interpretable, handles missing values Prone to overfitting
Random Forest 0.9737 Robust, handles non-linear data, high accuracy Less interpretable, longer training time