Practice 09: Kernel Methods and RKHS¶
1. Theoretical Exercises¶
1.1 Positive Definiteness and Kernel Construction¶
Problem: Prove that if \(k_1(x, y)\) and \(k_2(x, y)\) are positive definite (PD) kernels, then:
- \(k(x, y) = k_1(x, y) + k_2(x, y)\) is PD.
- \(k(x, y) = k_1(x, y) \cdot k_2(x, y)\) is PD (Schur Product Theorem).
- \(k(x, y) = \exp(k_1(x, y))\) is PD.
1.2 The Reproducing Property¶
Problem: Let \(\mathcal{H}\) be an RKHS with kernel \(k\). Prove that:
- \(k(x, y) = \langle k(x, \cdot), k(y, \cdot) \rangle_{\mathcal{H}}\).
- \(\|k(x, \cdot)\|_{\mathcal{H}} = \sqrt{k(x, x)}\).
- \(|f(x)| \le \|f\|_{\mathcal{H}} \sqrt{k(x, x)}\) (Pointwise stability).
1.3 Mercer’s Theorem and the Gaussian Kernel¶
Problem: For the RBF kernel \(k(x, y) = \exp(-\gamma \|x-y\|^2)\), the feature space is infinite-dimensional.
- Why can't we explicitly write down the feature map \(\phi(x)\) as a finite vector?
- How does the choice of \(\gamma\) affect the "smoothness" of functions in the RKHS?
1.4 The Representer Theorem with L2 Regularization¶
Problem: Consider Kernel Ridge Regression: \(\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f\|_{\mathcal{H}}^2\).
- Substitute \(f(x) = \sum \alpha_j k(x_j, x)\) and derive the closed-form solution for the vector \(\alpha\).
- Compare this to the solution of standard Ridge Regression.
1.5 MMD and the Characteristic Property¶
Problem: A kernel is characteristic if the mean embedding \(\mu_P = \int k(x, \cdot) dP(x)\) is injective.
- Prove that the linear kernel \(k(x, y) = x^\top y\) is NOT characteristic.
- Provide two different distributions \(P\) and \(Q\) that have the same mean embedding under the linear kernel.
1.6 Bochner’s Theorem and RFF¶
Problem: For the Laplacian kernel \(k(x, y) = \exp(-\lambda |x-y|)\) in 1D, find the spectral density \(p(\omega)\) such that \(k(\delta) = \int e^{i\omega \delta} p(\omega) d\omega\). Hint: This is the Fourier transform of a double exponential.
1.7 Nyström vs. RFF¶
Problem:
- When is the Nyström method expected to perform better than RFF? (Hint: Consider the rank of the Gram matrix).
- What is the computational complexity of the Nyström approximation for \(m\) landmarks?
1.8 Convergence of RFF¶
Problem: Rahimi & Recht (2007) showed that the error \(|k(x, y) - \hat{k}(x, y)|\) converges as \(O(1/\sqrt{m})\).
- Why is this rate independent of the dimensionality of \(x\)?
- What is the downside of this slow convergence for high-precision applications?
2. Coding Practice¶
2.1 Approximating a Kernel with RFF¶
Task: Implement Random Fourier Features from scratch for the RBF kernel.
- Generate 2D synthetic data.
- Sample \(\omega \sim \mathcal{N}(0, 2\gamma I)\).
- Construct the feature map \(\phi(x) = \sqrt{\frac{2}{m}} [\cos(\omega_1^\top x + b_1), \dots, \cos(\omega_m^\top x + b_m)]\).
- Compare the approximated Gram matrix \(\hat{K} = \Phi \Phi^\top\) with the exact \(K\).
2.2 MMD Two-Sample Test¶
Task: Use the MMD formula to test if two samples come from the same distribution.
- Sample \(X \sim \mathcal{N}(0, 1)\) and \(Y \sim \mathcal{N}(0.5, 1)\).
- Calculate \(MMD^2(X, Y)\) using the RBF kernel.
- Perform a permutation test to determine the p-value.
3. Hints & Solutions¶
3.1 Hints¶
- 1.1: Use the definition \(\sum c_i c_j k(x_i, x_j) \ge 0\).
- 1.4: The solution is \(\alpha = (K + \lambda I)^{-1} y\), where \(K\) is the Gram matrix.
- 2.1: Remember to include the random phase \(b \sim \text{Uniform}(0, 2\pi)\) if using the cosine-only form.
3.2 Solutions (Brief)¶
- 1.2.3: By Cauchy-Schwarz: \(|f(x)| = |\langle f, k_x \rangle| \le \|f\| \|k_x\| = \|f\| \sqrt{k(x, x)}\).
- 1.5.1: For the linear kernel, \(\mu_P = \int x dP(x) = \mathbb{E}_P[x]\). Any two distributions with the same mean (but different variances) will have the same embedding. Thus, it is not injective.
- 1.6: \(p(\omega) = \frac{1}{\pi} \frac{\lambda}{\lambda^2 + \omega^2}\) (Cauchy/Lorentzian distribution).