Machine Learning

From linear regression to diffusion models — complete derivations, mathematical foundations, and Python implementations of every major algorithm.

The Machine Learning Landscape

The Equations That Define ML

Linear Regression (OLS)

\( \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \)

Gradient Descent

\( \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t) \)

Bayes’ Theorem

\( P(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \boldsymbol{\theta}) P(\boldsymbol{\theta})}{P(\mathcal{D})} \)

Cross-Entropy Loss

\( \mathcal{L} = -\sum_{i} y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \)

Backpropagation

\( \frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} \cdot a_i^{(l-1)} \)

Attention (Transformer)

\( \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \)

ELBO (VAE)

\( \mathcal{L} = \mathbb{E}_{q}[\log p(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) \)

Bellman Equation (RL)

\( V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right] \)

About This Course

This course teaches machine learning from the mathematics up. Every algorithm is derived from first principles: we start with the objective function, take derivatives, and arrive at the update rules. No black boxes. Every chapter includes complete MathJax derivations, SVG architecture diagrams, and Python simulations that you can run in the browser.

The course spans from classical methods (linear regression, SVMs) through the deep learning revolution (CNNs, RNNs, Transformers) to the research frontier (diffusion models, LLMs, graph neural networks). Part V on Probabilistic ML connects to the Bayesian brain framework in our Music & Mathematics course and to our Information Theory course.

Prerequisites: multivariable calculus, linear algebra, basic probability. Chapter 1–3 provide a thorough review of all necessary mathematics.

Course Structure

Part I

Mathematical Foundations →

The Language of ML

Linear algebra (vectors, matrices, eigendecomposition, SVD), probability theory (Bayes’ theorem, distributions, MLE, MAP), and optimization (gradient descent, convexity, Lagrange multipliers, KKT conditions).

View Part I Overview Ch 1: Linear Algebra for ML Ch 2: Probability & Statistics Ch 3: Optimization Theory

Part II

Supervised Learning →

Learning from Labels

Linear regression (OLS derivation, regularization, bias-variance tradeoff), logistic regression (cross-entropy, softmax, Newton’s method), and SVMs (maximum margin, kernel trick, dual formulation).

View Part II Overview Ch 4: Linear Regression Ch 5: Logistic Regression Ch 6: Support Vector Machines

Part III

Neural Networks →

Deep Learning

The perceptron, backpropagation derivation (chain rule through computational graphs), deep architectures (BatchNorm, dropout, residual connections), and CNNs (convolution theorem, pooling, modern architectures).

View Part III Overview Ch 7: Perceptrons & Backprop Ch 8: Deep Neural Networks Ch 9: Convolutional Networks

Part IV

Unsupervised Learning →

Finding Structure

K-means and Gaussian mixture models (EM algorithm derivation), PCA (eigenvalue formulation, kernel PCA, t-SNE), autoencoders and VAEs (ELBO derivation, reparameterization trick).

View Part IV Overview Ch 10: Clustering: K-Means & GMM Ch 11: Dimensionality Reduction Ch 12: Autoencoders & VAEs

Part V

Probabilistic ML →

Uncertainty Quantification

Bayesian inference (prior → posterior, conjugacy, MCMC), Gaussian processes (kernel functions, predictive distribution), and variational inference (ELBO, mean-field, amortized inference). Cross-links to the Bayesian brain in music perception.

View Part V Overview Ch 13: Bayesian Inference Ch 14: Gaussian Processes Ch 15: Variational Inference

Part VI

Sequence Models →

Language & Time

RNNs (BPTT derivation, LSTM/GRU gating), the attention mechanism (scaled dot-product, multi-head), Transformers (positional encoding, layer norm), and LLMs (GPT, BERT, scaling laws, RLHF).

View Part VI Overview Ch 16: Recurrent Networks Ch 17: Attention & Transformers Ch 18: Large Language Models

Part VII

Advanced Topics →

Research Frontier

Reinforcement learning (Bellman equation, policy gradient, PPO), graph neural networks (message passing, spectral convolution), and diffusion models (forward/reverse process, score matching, classifier-free guidance).

View Part VII Overview Ch 19: Reinforcement Learning Ch 20: Graph Neural Networks Ch 21: Diffusion Models

Pattern Recognition and Machine Learning — Christopher Bishop (2006)
The Elements of Statistical Learning — Hastie, Tibshirani & Friedman (2009)
Deep Learning — Goodfellow, Bengio & Courville (2016)
Mathematics for Machine Learning — Deisenroth, Faisal & Ong (2020)
Probabilistic Machine Learning — Kevin Murphy (2022, 2023)
Reinforcement Learning: An Introduction — Sutton & Barto (2018)

Begin Chapter 1: Linear Algebra for ML →

Share:X Reddit LinkedIn