Skip to content

Topic 04: Random Matrix Theory and NTK

10-Hour Intensity Master Syllabus

Welcome to the theoretical frontier of Large-Scale Deep Learning. In this intensive 10-hour module, we dismantle low-dimensional intuition and construct the rigorous mathematical framework used to design and scale the world's largest AI models.

This course transitions from the geometry of high-dimensional spaces to the spectral laws of random matrices, culminating in the unified theory of Neural Tangent Kernels and Tensor Programs.


Phase 1: High-Dimensional Foundations (2 Hours)

Focus: Why intuition fails in \(\mathbb{R}^d\)

  • 04-1 High-Dimensional Geometry and Probability
  • The Porcupine Cube & The Empty Ball: Proofs for volume concentration on the surface and the equator.
  • The Isoperimetric Inequality: Rigorous proof of measure concentration on \(S^{d-1}\) and its derivation from spherical caps.
  • Logarithmic Sobolev Inequalities: Analytical approach via Gross's Theorem (Gross, 1975) and Herbst's argument for Gaussian concentration.
  • Johnson-Lindenstrauss Lemma: Full proof of isometric preservation under random projections.
  • Dvoretzky's Theorem: Proof sketch of the emergence of Euclidean geometry in arbitrary convex bodies.
  • Practical Guidelines: Impact on initialization, normalization, and high-dimensional nearest neighbor search.

Phase 2: Asymptotic Spectral Theory (2 Hours)

Focus: The Eigenvalues of Noise and Signal

  • 04-2 Spectral Laws of Random Matrix Theory
  • The Stieltjes Transform Method: The "Master Tool" for limiting spectral distributions.
  • Wigner’s Semicircle Law: Rigorous proof for symmetric matrices via the method of moments and Stieltjes.
  • Marchenko-Pastur Law: Derivation for sample covariance matrices and the impact of the aspect ratio \(\gamma = p/n\).
  • The BBP Transition: Full proof of the sharp outlier transition in spiked covariance models (Baik, Ben Arous, Péché, 2005).
  • Circular Law: Eigenvalues of non-symmetric matrices (Girko).
  • Practical Guidelines: Hessian analysis, spectral norm regularization, and signal detection in noise.

Phase 3: Non-Commutative Probability (2 Hours)

Focus: Algebra for Deep Layers

  • 04-3 Free Probability Theory
  • Voiculescu's Freeness: Replacing independence with non-commutative freeness.
  • Speicher’s Lattice: Combinatorial approach to freeness using non-crossing partitions and free cumulants.
  • R and S Transforms: Linearizing non-commutative addition and multiplication.
  • Rectangular Free Probability: Extensions for non-square weight matrices (Benaych-Georges).
  • Dynamical Isometry: Rigorous proof of signal preservation in deep orthogonal networks.
  • Practical Guidelines: Orthogonal initialization vs. non-linear activation stability.

Phase 4: The Kernel Regime (2 Hours)

Focus: Infinite-Width Gradient Descent

  • 04-4 Neural Tangent Kernel (NTK)
  • Infinite-Width Convergence: Proof that wide networks converge to deterministic kernel machines.
  • The Lazy vs. Rich Transition: Mathematical derivation of the phase transition between kernel behavior and feature learning.
  • Eigen-decay & Generalization: Polynomial vs. Exponential spectral decay on the sphere and the "Spectral Bias" phenomenon.
  • Attention NTK: Derivation of the kernel for a single attention head.
  • Practical Guidelines: When to use NTK scaling vs. Feature Learning scaling.

Phase 5: Scaling the Frontier (2 Hours)

Focus: Tensor Programs and Compute Optimality

  • 04-5 Tensor Programs and Scaling Laws
  • The TP Master Theorem: Detailed step-by-step proof of the Gaussian limit for network activations.
  • \(\mu\)P for Adam: Derivation of the Maximal Update Parametrization for adaptive optimizers.
  • \(\mu\)Transfer: The secret sauce for transferring hyper-parameters from tiny models to 100B+ parameter LLMs.
  • The Chinchilla Frontier: Lagrange multiplier derivation of the \(N \propto \sqrt{C}\) optimality.
  • Practical Guidelines: Using the mup library and the "Beyond Chinchilla" regime.

Core Learning Outcomes

  1. Rigorous Mastery: Prove the fundamental theorems of high-dimensional probability and RMT.
  2. Architectural Design: Use NTK and TP theory to initialize and scale stable, high-performance architectures.
  3. Compute Efficiency: Calculate the optimal compute-parameters-data balance for any given budget.
  4. SOTA Research: Transition directly into reading and contributing to the latest papers on LLM geometry and scaling.