Topic 04: Random Matrix Theory and NTK¶
10-Hour Intensity Master Syllabus¶
Welcome to the theoretical frontier of Large-Scale Deep Learning. In this intensive 10-hour module, we dismantle low-dimensional intuition and construct the rigorous mathematical framework used to design and scale the world's largest AI models.
This course transitions from the geometry of high-dimensional spaces to the spectral laws of random matrices, culminating in the unified theory of Neural Tangent Kernels and Tensor Programs.
Phase 1: High-Dimensional Foundations (2 Hours)¶
Focus: Why intuition fails in \(\mathbb{R}^d\)
- 04-1 High-Dimensional Geometry and Probability
- The Porcupine Cube & The Empty Ball: Proofs for volume concentration on the surface and the equator.
- The Isoperimetric Inequality: Rigorous proof of measure concentration on \(S^{d-1}\) and its derivation from spherical caps.
- Logarithmic Sobolev Inequalities: Analytical approach via Gross's Theorem (Gross, 1975) and Herbst's argument for Gaussian concentration.
- Johnson-Lindenstrauss Lemma: Full proof of isometric preservation under random projections.
- Dvoretzky's Theorem: Proof sketch of the emergence of Euclidean geometry in arbitrary convex bodies.
- Practical Guidelines: Impact on initialization, normalization, and high-dimensional nearest neighbor search.
Phase 2: Asymptotic Spectral Theory (2 Hours)¶
Focus: The Eigenvalues of Noise and Signal
- 04-2 Spectral Laws of Random Matrix Theory
- The Stieltjes Transform Method: The "Master Tool" for limiting spectral distributions.
- Wigner’s Semicircle Law: Rigorous proof for symmetric matrices via the method of moments and Stieltjes.
- Marchenko-Pastur Law: Derivation for sample covariance matrices and the impact of the aspect ratio \(\gamma = p/n\).
- The BBP Transition: Full proof of the sharp outlier transition in spiked covariance models (Baik, Ben Arous, Péché, 2005).
- Circular Law: Eigenvalues of non-symmetric matrices (Girko).
- Practical Guidelines: Hessian analysis, spectral norm regularization, and signal detection in noise.
Phase 3: Non-Commutative Probability (2 Hours)¶
Focus: Algebra for Deep Layers
- 04-3 Free Probability Theory
- Voiculescu's Freeness: Replacing independence with non-commutative freeness.
- Speicher’s Lattice: Combinatorial approach to freeness using non-crossing partitions and free cumulants.
- R and S Transforms: Linearizing non-commutative addition and multiplication.
- Rectangular Free Probability: Extensions for non-square weight matrices (Benaych-Georges).
- Dynamical Isometry: Rigorous proof of signal preservation in deep orthogonal networks.
- Practical Guidelines: Orthogonal initialization vs. non-linear activation stability.
Phase 4: The Kernel Regime (2 Hours)¶
Focus: Infinite-Width Gradient Descent
- 04-4 Neural Tangent Kernel (NTK)
- Infinite-Width Convergence: Proof that wide networks converge to deterministic kernel machines.
- The Lazy vs. Rich Transition: Mathematical derivation of the phase transition between kernel behavior and feature learning.
- Eigen-decay & Generalization: Polynomial vs. Exponential spectral decay on the sphere and the "Spectral Bias" phenomenon.
- Attention NTK: Derivation of the kernel for a single attention head.
- Practical Guidelines: When to use NTK scaling vs. Feature Learning scaling.
Phase 5: Scaling the Frontier (2 Hours)¶
Focus: Tensor Programs and Compute Optimality
- 04-5 Tensor Programs and Scaling Laws
- The TP Master Theorem: Detailed step-by-step proof of the Gaussian limit for network activations.
- \(\mu\)P for Adam: Derivation of the Maximal Update Parametrization for adaptive optimizers.
- \(\mu\)Transfer: The secret sauce for transferring hyper-parameters from tiny models to 100B+ parameter LLMs.
- The Chinchilla Frontier: Lagrange multiplier derivation of the \(N \propto \sqrt{C}\) optimality.
- Practical Guidelines: Using the
muplibrary and the "Beyond Chinchilla" regime.
Core Learning Outcomes¶
- Rigorous Mastery: Prove the fundamental theorems of high-dimensional probability and RMT.
- Architectural Design: Use NTK and TP theory to initialize and scale stable, high-performance architectures.
- Compute Efficiency: Calculate the optimal compute-parameters-data balance for any given budget.
- SOTA Research: Transition directly into reading and contributing to the latest papers on LLM geometry and scaling.