Topic 04: Random Matrix Theory and NTK¶

10-Hour Intensity Master Syllabus¶

Welcome to the theoretical frontier of Large-Scale Deep Learning. In this intensive 10-hour module, we dismantle low-dimensional intuition and construct the rigorous mathematical framework used to design and scale the world's largest AI models.

This course transitions from the geometry of high-dimensional spaces to the spectral laws of random matrices, culminating in the unified theory of Neural Tangent Kernels and Tensor Programs.

Phase 1: High-Dimensional Foundations (2 Hours)¶

Focus: Why intuition fails in \(\mathbb{R}^d\)

04-1 High-Dimensional Geometry and Probability
The Porcupine Cube & The Empty Ball: Proofs for volume concentration on the surface and the equator.
The Isoperimetric Inequality: Rigorous proof of measure concentration on \(S^{d-1}\) and its derivation from spherical caps.
Logarithmic Sobolev Inequalities: Analytical approach via Gross's Theorem (Gross, 1975) and Herbst's argument for Gaussian concentration.
Johnson-Lindenstrauss Lemma: Full proof of isometric preservation under random projections.
Dvoretzky's Theorem: Proof sketch of the emergence of Euclidean geometry in arbitrary convex bodies.
Practical Guidelines: Impact on initialization, normalization, and high-dimensional nearest neighbor search.

Phase 2: Asymptotic Spectral Theory (2 Hours)¶

Focus: The Eigenvalues of Noise and Signal

04-2 Spectral Laws of Random Matrix Theory
The Stieltjes Transform Method: The "Master Tool" for limiting spectral distributions.
Wigner’s Semicircle Law: Rigorous proof for symmetric matrices via the method of moments and Stieltjes.
Marchenko-Pastur Law: Derivation for sample covariance matrices and the impact of the aspect ratio \(\gamma = p/n\).
The BBP Transition: Full proof of the sharp outlier transition in spiked covariance models (Baik, Ben Arous, Péché, 2005).
Circular Law: Eigenvalues of non-symmetric matrices (Girko).
Practical Guidelines: Hessian analysis, spectral norm regularization, and signal detection in noise.

Phase 3: Non-Commutative Probability (2 Hours)¶

Focus: Algebra for Deep Layers

04-3 Free Probability Theory
Voiculescu's Freeness: Replacing independence with non-commutative freeness.
Speicher’s Lattice: Combinatorial approach to freeness using non-crossing partitions and free cumulants.
R and S Transforms: Linearizing non-commutative addition and multiplication.
Rectangular Free Probability: Extensions for non-square weight matrices (Benaych-Georges).
Dynamical Isometry: Rigorous proof of signal preservation in deep orthogonal networks.
Practical Guidelines: Orthogonal initialization vs. non-linear activation stability.

Phase 4: The Kernel Regime (2 Hours)¶

Focus: Infinite-Width Gradient Descent

04-4 Neural Tangent Kernel (NTK)
Infinite-Width Convergence: Proof that wide networks converge to deterministic kernel machines.
The Lazy vs. Rich Transition: Mathematical derivation of the phase transition between kernel behavior and feature learning.
Eigen-decay & Generalization: Polynomial vs. Exponential spectral decay on the sphere and the "Spectral Bias" phenomenon.
Attention NTK: Derivation of the kernel for a single attention head.
Practical Guidelines: When to use NTK scaling vs. Feature Learning scaling.

Phase 5: Scaling the Frontier (2 Hours)¶

Focus: Tensor Programs and Compute Optimality

04-5 Tensor Programs and Scaling Laws
The TP Master Theorem: Detailed step-by-step proof of the Gaussian limit for network activations.
\(\mu\)P for Adam: Derivation of the Maximal Update Parametrization for adaptive optimizers.
\(\mu\)Transfer: The secret sauce for transferring hyper-parameters from tiny models to 100B+ parameter LLMs.
The Chinchilla Frontier: Lagrange multiplier derivation of the \(N \propto \sqrt{C}\) optimality.
Practical Guidelines: Using the mup library and the "Beyond Chinchilla" regime.

Core Learning Outcomes¶

Rigorous Mastery: Prove the fundamental theorems of high-dimensional probability and RMT.
Architectural Design: Use NTK and TP theory to initialize and scale stable, high-performance architectures.
Compute Efficiency: Calculate the optimal compute-parameters-data balance for any given budget.
SOTA Research: Transition directly into reading and contributing to the latest papers on LLM geometry and scaling.