Topic 01: Optimization Theory — The Mathematical Foundations of Neural Optimization¶
Master Curriculum Guide (10-Hour Intensity)¶
Optimization is the engine of modern machine learning. While classical convex optimization provides a rigorous starting point, the success of deep learning relies on the surprising efficacy of gradient-based methods in high-dimensional, non-convex landscapes.
This curriculum provides a mathematically rigorous, deep dive into the state-of-the-art theory governing neural network optimization. Expect dense proofs, topological analyses, and direct translation of theory into code.
Module 1: Geometry and Saddle Points in High Dimensions¶
Estimated Time: 2 Hours Why does optimization succeed in spaces with \(10^9\) dimensions? We explore the topological phase transition of non-convex landscapes.
- Advanced Topology: Morse Theory, the Morse Lemma, and the relationship between critical points and sub-level set topology.
- Random Field Theory: Counting critical points via the Kac-Rice Formula.
- The Spin Glass Analogy: Detailed mapping of neural loss landscapes to p-spin spherical spin glass models.
- Hessian Spectral Dynamics: The BBP transition, bulk-outlier separation, and the impact of the signal-to-noise ratio.
- Verification: From-scratch simulation of the Kac-Rice density and empirical Hessian spectrum analysis.
Module 2: Convergence Theory of Gradient Descent¶
Estimated Time: 2 Hours Guaranteed bounds on the speed of learning.
- The Acceleration Frontier: Exhaustive algebraic proof of Nesterov's Accelerated Gradient (NAG) and the optimal lower bounds for first-order methods.
- Non-Convex Frameworks: The Kurdyka-Łojasiewicz (KL) Inequality and its role in proving convergence for non-convex functions.
- Structural Constraints: \(L\)-smoothness, \(\mu\)-strong convexity, and the Polyak-Łojasiewicz (PL) condition.
- Failure Modes: Why constant step sizes oscillate on non-smooth functions and the theoretical limits of smoothness.
Module 3: Stochastic Algorithms and Variance Reduction¶
Estimated Time: 2 Hours Noise is a feature, not a bug.
- Langevin Dynamics: Modeling SGD as a discretised SDE (SGLD) and the Fokker-Planck proof of convergence to the Gibbs distribution.
- Noise Analysis: Empirical and theoretical evidence for heavy-tailed (\(\alpha\)-stable) noise in deep learning (Simsekli et al.).
- Variance Elimination: Comprehensive step-by-step proofs for SVRG and SAG achieving linear convergence on finite sums.
- Averaging Techniques: The statistical optimality of Polyak-Ruppert iterate averaging.
Module 4: Adaptive and Second-Order Methods¶
Estimated Time: 2 Hours Correcting for anisotropy via curvature estimation.
- SOTA Optimizers: Analysis of Lion (sign-momentum) and Sophia (Hessian-clipped adaptive steps).
- Efficient Curvature Estimation: Rigorous proof and implementation of Hutchinson’s Trace Estimator.
- Kronecker Algebra: The full derivation of K-FAC and its reduction of FIM inversion complexity from \(O(d^6)\) to \(O(d^3)\).
- Engineering Guidelines: The mathematical reason why AdamW decoupling is necessary for weight decay.
Module 5: Modern Frontiers of Optimization Theory¶
Estimated Time: 2 Hours Why deep learning works in the overparameterized regime.
- Manifold Connectivity: The geometry of Mode Connectivity (Garipov et al.) and the topology of low-loss paths.
- Implicit Regularization: Rigorous proof of the bias toward Maximum Margin solutions in logistic regression and Low-Rank solutions in matrix factorization.
- The Scaling Law of Optimization: How compute budget \(C\) interacts with parameters \(N\) and the critical batch size \(B_{crit}\).
- Sharpness-Aware Minimization (SAM): Minimax optimization for flatness and its first-order Taylor approximation.
Prerequisites: Graduate-level Multivariable Calculus, Spectral Theory (Linear Algebra), and Stochastic Processes. Target Outcome: Ability to derive convergence rates for new optimizers and diagnose landscape-level training failures using second-order statistics.