Topic 03: Statistical Learning Theory — Practice and Problem Set¶
1. Detailed Examples¶
1.1 Rademacher Complexity of Linear Classes¶
Consider the class of linear functions \(\mathcal{H} = \{x \mapsto w \cdot x : \|w\|_2 \le B\}\) where inputs are bounded as \(\|x_i\|_2 \le L\).
Derivation:
By the Cauchy-Schwarz inequality, \(w \cdot v \le \|w\| \|v\|\). The supremum is attained when \(w\) is in the direction of \(v = \sum \sigma_i x_i\):
Using Jensen's Inequality (\(\mathbb{E}[X] \le \sqrt{\mathbb{E}[X^2]}\)):
Since \(\mathbb{E}[\sigma_i \sigma_j] = 0\) for \(i \neq j\) and \(1\) for \(i=j\):
Conclusion: The complexity of linear models is independent of the input dimension \(d\).
2. Problem Set (Theoretical)¶
Exercise 1: Proving Hoeffding's Inequality¶
- Show that for a random variable \(X\) with \(\mathbb{E}[X]=0\) and \(X \in [a, b]\), the moment generating function satisfies \(\mathbb{E}[e^{\lambda X}] \le e^{\lambda^2 (b-a)^2 / 8}\) (Hoeffding's Lemma).
- Use the Chernoff bound \(P(S_n \ge t) \le e^{-\lambda t} \prod \mathbb{E}[e^{\lambda X_i}]\) and optimize for \(\lambda\) to complete the proof of Hoeffding's inequality.
Exercise 2: The Sauer-Shelah Lemma¶
Let \(\mathcal{H}\) be a hypothesis class with VC-dimension \(d\). The growth function \(m_{\mathcal{H}}(n)\) is the maximum number of ways \(n\) points can be shattered by \(\mathcal{H}\).
- Prove that for \(n \le d\), \(m_{\mathcal{H}}(n) = 2^n\).
- Prove the Sauer-Shelah Lemma by induction: \(m_{\mathcal{H}}(n) \le \sum_{i=0}^d \binom{n}{i}\).
- Show that for \(n > d\), \(\sum_{i=0}^d \binom{n}{i} \le (en/d)^d\).
Exercise 3: Stability of Ridge Regression¶
Consider Ridge Regression \(w = (X^T X + n\lambda I)^{-1} X^T y\).
- Prove that the algorithm is \(\beta\)-uniformly stable with \(\beta = O(1/\lambda n)\).
- How does the stability change as \(\lambda \to 0\) in the over-parameterized regime (\(d > n\))?
Exercise 4: Massart's Finite Class Lemma¶
Let \(\mathcal{A} \subset \mathbb{R}^n\) be a finite set of vectors with \(|\mathcal{A}| = N\). Let \(R = \max_{v \in \mathcal{A}} \|v\|_2\). Show that \(\mathbb{E}_\sigma [\sup_{v \in \mathcal{A}} \sum_{i=1}^n \sigma_i v_i] \le R\sqrt{2 \ln N}\). Hint: Use the Chernoff method and the fact that \(\mathbb{E}[e^{\lambda \sum \sigma_i v_i}] \le e^{\lambda^2 \|v\|^2 / 2}\).
Exercise 5: Rademacher Complexity of MLPs¶
Consider a 2-layer ReLU network \(f(x) = \sum_{j=1}^H a_j \text{ReLU}(w_j^\top x)\) with constraint \(\|a\|_1 \le B_1\) and \(\|w_j\|_2 \le B_2\). Prove that \(\hat{\mathcal{R}}_S(\mathcal{H}) \le \frac{2 B_1 B_2 L}{\sqrt{n}}\). Hint: Use the contraction property of Rademacher complexity: \(\hat{\mathcal{R}}_S(\phi \circ \mathcal{H}) \le 2\hat{\mathcal{R}}_S(\mathcal{H})\) for 1-Lipschitz \(\phi\).
Exercise 6: VC-Dimension of Linear Classifiers¶
Prove that the VC-dimension of linear classifiers in \(\mathbb{R}^d\) is \(d+1\).
- Provide a set of \(d+1\) points that can be shattered.
- Use Radon's Theorem to show that no set of \(d+2\) points can be shattered.
Exercise 7: PAC-Bayes Change of Measure¶
Prove the fundamental identity (Donsker-Varadhan representation):
Hint: Use the definition of KL divergence and Jensen's inequality (or Fenchel duality).
Exercise 8: Benign Overfitting in Linear Regression¶
Consider \(y = w^\top x + \epsilon\). In the regime \(d \gg n\), the model interpolates the data.
- Under what conditions on the eigenvalues of the covariance matrix \(\Sigma = \mathbb{E}[xx^\top]\) is the generalization error \(R(\hat{w}) \to \sigma_\epsilon^2\) as \(n, d \to \infty\)?
- Discuss the "Effective Rank" of \(\Sigma\).
3. Coding Practice¶
Task 1: Visualizing Double Descent¶
Train a series of linear models with increasing polynomial degrees (or a random feature model with increasing width) on a noisy sine wave.
- Dataset: 20 points sampled from \(y = \sin(x) + \epsilon\).
- Model: Linear regression with \(d\) random features.
- Plot: Training and Test error vs. \(d\). Identify the "interpolation threshold" where \(d=n\).
Task 2: Estimating Rademacher Complexity¶
Implement a function to estimate \(\hat{\mathcal{R}}_S(\mathcal{H})\) for a small CNN on CIFAR-10.
- Logic: Train the model to fit random labels. The resulting training accuracy (appropriately scaled) is an empirical estimate of the complexity.
- Experiment: How does the complexity change if you add Dropout or Weight Decay?
Task 3: PAC-Bayes Bound Calculation¶
For a simple Logistic Regression model, calculate the McAllester bound.
- Setup: Use \(\mathcal{N}(0, I)\) as the prior. After training, use the weights as the mean for a posterior \(\mathcal{N}(w, \sigma^2 I)\).
- Analysis: Plot the value of the bound \(\epsilon\) as a function of the training sample size \(n\). Is the bound non-vacuous for \(n=1000\)?
4. Solutions and Hints¶
Solution 4 (Exercise 4)¶
For any \(\lambda > 0\), \(e^{\lambda \mathbb{E}[\sup Z]} \le \mathbb{E}[e^{\lambda \sup Z}] \le \sum_{v \in \mathcal{A}} \mathbb{E}[e^{\lambda \sigma \cdot v}] \le N e^{\lambda^2 R^2 / 2}\). Taking log and dividing by \(\lambda\): \(\mathbb{E}[\sup Z] \le \frac{\ln N}{\lambda} + \frac{\lambda R^2}{2}\). Minimizing over \(\lambda\) gives \(\lambda = \frac{\sqrt{2 \ln N}}{R}\), yielding the result \(R\sqrt{2 \ln N}\).
Hint 5 (Exercise 5)¶
\(\hat{\mathcal{R}}_S(\mathcal{H}) = \mathbb{E}_\sigma [ \sup \sum_{i=1}^n \sigma_i \sum_{j=1}^H a_j \text{ReLU}(w_j^\top x_i) ] = \mathbb{E}_\sigma [ \sup \sum_{j=1}^H a_j (\sum_{i=1}^n \sigma_i \text{ReLU}(w_j^\top x_i)) ]\). By \(\|a\|_1 \le B_1\), this is \(\le B_1 \mathbb{E}_\sigma [ \sup_{w: \|w\| \le B_2} \sum_{i=1}^n \sigma_i \text{ReLU}(w^\top x_i) ]\). Apply the contraction lemma and the linear class complexity result.
5. Open Research Questions¶
- Implicit Regularization in Transformers: Can we find a complexity measure that explains why Transformers generalize despite having billions of parameters and being trained to zero loss?
- Generalization of SAM: Does Sharpness-Aware Minimization lead to a lower Rademacher complexity, or is its benefit purely explained by PAC-Bayes bounds?
- Out-of-Distribution (OOD): Can SLT be extended to the setting where the test distribution is unknown but related to the training distribution via a "latent symmetry"?