Part II · Chapter 5

Logistic Regression

Logistic regression is the canonical probabilistic binary classifier. Despite its name, it is a classification model — deriving the sigmoid function from log-odds, the cross-entropy loss from Bernoulli MLE, and an elegant closed-form gradient.

1. The Sigmoid Function — Derived from Log-Odds

We model the log-odds (logit) of the probability \(P(y=1|\mathbf{x})\) as a linear function of the features:

\[ \log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \mathbf{w}^\top\mathbf{x} = z \]

Solving for \(p = P(y=1|\mathbf{x})\):

\[ \frac{p}{1-p} = e^z \quad \Rightarrow \quad p = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}} \]

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

Useful properties of the sigmoid:

\(\sigma(z) \in (0,1)\) — always a valid probability
\(\sigma(-z) = 1 - \sigma(z)\) — symmetric around 0.5
\(\sigma'(z) = \sigma(z)(1-\sigma(z))\) — clean derivative, crucial for backpropagation

The sigmoid maps any real-valued score z to a probability. The decision boundary is at z = 0 (p = 0.5).

2. Cross-Entropy Loss — Derived from Bernoulli MLE

For binary labels \(y_i \in \{0,1\}\), the predicted probability is \(\hat{p}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i)\). The Bernoulli likelihood for one example is:

\[ P(y_i | \mathbf{x}_i; \mathbf{w}) = \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i} \]

Assuming i.i.d. data, the log-likelihood is:

\[ \ell(\mathbf{w}) = \sum_{i=1}^n \Big[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big] \]

Maximising the log-likelihood is equivalent to minimising the negative log-likelihood — the binary cross-entropy loss:

\[ \mathcal{L}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \Big[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big] \]

3. Gradient Derivation

We derive the gradient \(\nabla_{\mathbf{w}}\mathcal{L}\) step by step. Let \(z_i = \mathbf{w}^\top\mathbf{x}_i\) and \(\hat{p}_i = \sigma(z_i)\).

Step 1: Derivative with respect to \(z_i\):

\[ \frac{\partial \ell_i}{\partial z_i} = y_i(1-\hat{p}_i) - (1-y_i)\hat{p}_i = y_i - \hat{p}_i \]

Using \(\partial \log\hat p_i / \partial z_i = 1 - \hat p_i\) and \(\partial \log(1-\hat p_i)/\partial z_i = -\hat p_i\) (from \(\sigma' = \sigma(1-\sigma)\)).

Step 2: Chain rule to \(\mathbf{w}\): \(\partial z_i / \partial \mathbf{w} = \mathbf{x}_i\):

\[ \frac{\partial \ell_i}{\partial \mathbf{w}} = (y_i - \hat{p}_i)\mathbf{x}_i \]

Step 3: Sum over all examples (with sign flip for loss):

\[ \nabla_{\mathbf{w}}\mathcal{L} = \frac{1}{n}\mathbf{X}^\top\big(\boldsymbol{\sigma}(\mathbf{X}\mathbf{w}) - \mathbf{y}\big) \]

The gradient has the same elegant form as the OLS gradient — a weighted residual. The gradient descent update is \(\mathbf{w} \leftarrow \mathbf{w} - \eta\,\mathbf{X}^\top(\hat{\mathbf{p}} - \mathbf{y})/n\).

4. Newton's Method / IRLS

Newton's method uses the Hessian for second-order updates. The Hessian of the logistic loss is:

\[ \mathbf{H} = \nabla^2\mathcal{L} = \frac{1}{n}\mathbf{X}^\top\mathbf{W}\mathbf{X}, \quad \mathbf{W} = \mathrm{diag}\!\left(\hat{p}_i(1-\hat{p}_i)\right) \]

The Newton update is:

\[ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \mathbf{H}^{-1}\nabla\mathcal{L} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{z} \]

where \(\mathbf{z} = \mathbf{X}\mathbf{w}^{(t)} + \mathbf{W}^{-1}(\mathbf{y} - \hat{\mathbf{p}})\) is the adjusted response. This is exactly Iteratively Reweighted Least Squares (IRLS) — solving a weighted OLS problem at each step. Newton's method converges quadratically vs linearly for gradient descent.

5. Multi-Class: Softmax Derivation

For \(K\) classes, generalise logistic regression to model log-odds relative to a reference class. The class probabilities are given by the softmax function:

\[ P(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^\top\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^\top\mathbf{x}}} \]

Softmax is derived by requiring that all \(K\) probabilities sum to 1 and are proportional to exponentiated scores. For \(K=2\), softmax reduces to sigmoid. The loss is the multi-class cross-entropy (categorical NLL):

\[ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^n\sum_{k=1}^K \mathbf{1}[y_i=k]\log P(y=k|\mathbf{x}_i) \]

The gradient for class \(k\) is \(\nabla_{\mathbf{w}_k}\mathcal{L} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{p}}_k - \mathbf{y}_k)\) where \(\hat{\mathbf{p}}_k\) is the vector of predicted probabilities for class \(k\).

Python: Logistic Regression from Scratch

We implement gradient descent for logistic regression from scratch, plot the decision boundary on 2D data, and show convergence of the cross-entropy loss.

Python

script.py94 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# ── 1. Generate 2D binary classification data ──
n = 150
X1 = rng.multivariate_normal([1.5, 1.5], [[1, 0.3], [0.3, 1]], n // 2)
X2 = rng.multivariate_normal([-1.5, -1.5], [[1, -0.2], [-0.2, 1]], n // 2)
X = np.vstack([X1, X2])
y = np.array([1] * (n // 2) + [0] * (n // 2), dtype=float)

# Add bias column
Xb = np.column_stack([np.ones(n), X])

# ── 2. Sigmoid function ──
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

# ── 3. Logistic regression from scratch (gradient descent) ──
def train_logistic(Xb, y, lr=0.1, n_epochs=500):
    w = np.zeros(Xb.shape[1])
    losses = []
    for _ in range(n_epochs):
        p = sigmoid(Xb @ w)
        p = np.clip(p, 1e-10, 1 - 1e-10)
        loss = -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
        losses.append(loss)
        grad = Xb.T @ (p - y) / len(y)    # gradient: X^T (sigma(Xw) - y) / n
        w = w - lr * grad
    return w, losses

w, losses = train_logistic(Xb, y, lr=0.3, n_epochs=300)
print('Learned weights:', np.round(w, 4))
probs = sigmoid(Xb @ w)
preds = (probs >= 0.5).astype(int)
acc = np.mean(preds == y)
print(f'Training accuracy: {acc * 100:.1f}%')

# ── 4. Plot ──
fig, axes = plt.subplots(1, 3, figsize=(15, 5), facecolor='#0f0f1a')

# Left: decision boundary
ax = axes[0]
ax.set_facecolor('#0f0f1a')
xx, yy = np.meshgrid(np.linspace(-5, 5, 200), np.linspace(-5, 5, 200))
Xgrid = np.column_stack([np.ones(xx.size), xx.ravel(), yy.ravel()])
Z = sigmoid(Xgrid @ w).reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=50, cmap='RdPu', alpha=0.4)
ax.contour(xx, yy, Z, levels=[0.5], colors=['#a78bfa'], linewidths=2)
ax.scatter(X1[:, 0], X1[:, 1], c='#34d399', s=25, alpha=0.7, label='Class 1')
ax.scatter(X2[:, 0], X2[:, 1], c='#f472b6', s=25, alpha=0.7, label='Class 0')
ax.set_xlabel('x1', color='#e2e8f0')
ax.set_ylabel('x2', color='#e2e8f0')
ax.set_title('Decision Boundary', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# Middle: loss convergence
ax = axes[1]
ax.set_facecolor('#0f0f1a')
ax.plot(losses, color='#a78bfa', linewidth=2.5)
ax.set_xlabel('Epoch', color='#e2e8f0')
ax.set_ylabel('Cross-entropy loss', color='#e2e8f0')
ax.set_title('Loss Convergence', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# Right: sigmoid function
ax = axes[2]
ax.set_facecolor('#0f0f1a')
z_range = np.linspace(-7, 7, 300)
ax.plot(z_range, sigmoid(z_range), color='#a78bfa', linewidth=2.5, label='sigma(z)')
ax.axhline(0.5, color='#34d399', linestyle='--', linewidth=1.5, label='threshold 0.5')
ax.axvline(0, color='#64748b', linestyle=':', linewidth=1)
ax.fill_between(z_range, sigmoid(z_range), 0.5, where=(z_range > 0), alpha=0.15, color='#34d399')
ax.fill_between(z_range, sigmoid(z_range), 0.5, where=(z_range < 0), alpha=0.15, color='#f472b6')
ax.set_xlabel('z', color='#e2e8f0')
ax.set_ylabel('sigma(z)', color='#e2e8f0')
ax.set_title('Sigmoid Function', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

plt.suptitle('Logistic Regression from Scratch', color='#ddd6fe', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print('Saved output.png')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

✓ The sigmoid arises naturally from modelling the log-odds as a linear function of features.
✓ Cross-entropy loss is the negative Bernoulli log-likelihood — minimising it is equivalent to MLE.
✓ The gradient \(X^\top(\hat{p} - y)/n\) has the same elegant residual form as OLS, derived via \(\sigma' = \sigma(1-\sigma)\).
✓ IRLS (Newton's method) converges quadratically by solving a sequence of weighted least squares problems.
✓ Softmax generalises sigmoid to \(K\) classes and is the output layer of most neural network classifiers.

Share:X Reddit LinkedIn

← Linear Regression Next: Support Vector Machines