Practice 09: Kernel Methods and RKHS¶

1. Theoretical Exercises¶

1.1 Positive Definiteness and Kernel Construction¶

Problem: Prove that if \(k_1(x, y)\) and \(k_2(x, y)\) are positive definite (PD) kernels, then:

\(k(x, y) = k_1(x, y) + k_2(x, y)\) is PD.
\(k(x, y) = k_1(x, y) \cdot k_2(x, y)\) is PD (Schur Product Theorem).
\(k(x, y) = \exp(k_1(x, y))\) is PD.

1.2 The Reproducing Property¶

Problem: Let \(\mathcal{H}\) be an RKHS with kernel \(k\). Prove that:

\(k(x, y) = \langle k(x, \cdot), k(y, \cdot) \rangle_{\mathcal{H}}\).
\(\|k(x, \cdot)\|_{\mathcal{H}} = \sqrt{k(x, x)}\).
\(|f(x)| \le \|f\|_{\mathcal{H}} \sqrt{k(x, x)}\) (Pointwise stability).

1.3 Mercer’s Theorem and the Gaussian Kernel¶

Problem: For the RBF kernel \(k(x, y) = \exp(-\gamma \|x-y\|^2)\), the feature space is infinite-dimensional.

Why can't we explicitly write down the feature map \(\phi(x)\) as a finite vector?
How does the choice of \(\gamma\) affect the "smoothness" of functions in the RKHS?

1.4 The Representer Theorem with L2 Regularization¶

Problem: Consider Kernel Ridge Regression: \(\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f\|_{\mathcal{H}}^2\).

Substitute \(f(x) = \sum \alpha_j k(x_j, x)\) and derive the closed-form solution for the vector \(\alpha\).
Compare this to the solution of standard Ridge Regression.

1.5 MMD and the Characteristic Property¶

Problem: A kernel is characteristic if the mean embedding \(\mu_P = \int k(x, \cdot) dP(x)\) is injective.

Prove that the linear kernel \(k(x, y) = x^\top y\) is NOT characteristic.
Provide two different distributions \(P\) and \(Q\) that have the same mean embedding under the linear kernel.

1.6 Bochner’s Theorem and RFF¶

Problem: For the Laplacian kernel \(k(x, y) = \exp(-\lambda |x-y|)\) in 1D, find the spectral density \(p(\omega)\) such that \(k(\delta) = \int e^{i\omega \delta} p(\omega) d\omega\). Hint: This is the Fourier transform of a double exponential.

1.7 Nyström vs. RFF¶

Problem:

When is the Nyström method expected to perform better than RFF? (Hint: Consider the rank of the Gram matrix).
What is the computational complexity of the Nyström approximation for \(m\) landmarks?

1.8 Convergence of RFF¶

Problem: Rahimi & Recht (2007) showed that the error \(|k(x, y) - \hat{k}(x, y)|\) converges as \(O(1/\sqrt{m})\).

Why is this rate independent of the dimensionality of \(x\)?
What is the downside of this slow convergence for high-precision applications?

2. Coding Practice¶

2.1 Approximating a Kernel with RFF¶

Task: Implement Random Fourier Features from scratch for the RBF kernel.

Generate 2D synthetic data.
Sample \(\omega \sim \mathcal{N}(0, 2\gamma I)\).
Construct the feature map \(\phi(x) = \sqrt{\frac{2}{m}} [\cos(\omega_1^\top x + b_1), \dots, \cos(\omega_m^\top x + b_m)]\).
Compare the approximated Gram matrix \(\hat{K} = \Phi \Phi^\top\) with the exact \(K\).

2.2 MMD Two-Sample Test¶

Task: Use the MMD formula to test if two samples come from the same distribution.

Sample \(X \sim \mathcal{N}(0, 1)\) and \(Y \sim \mathcal{N}(0.5, 1)\).
Calculate \(MMD^2(X, Y)\) using the RBF kernel.
Perform a permutation test to determine the p-value.

3. Hints & Solutions¶

3.1 Hints¶

1.1: Use the definition \(\sum c_i c_j k(x_i, x_j) \ge 0\).
1.4: The solution is \(\alpha = (K + \lambda I)^{-1} y\), where \(K\) is the Gram matrix.
2.1: Remember to include the random phase \(b \sim \text{Uniform}(0, 2\pi)\) if using the cosine-only form.

3.2 Solutions (Brief)¶

1.2.3: By Cauchy-Schwarz: \(|f(x)| = |\langle f, k_x \rangle| \le \|f\| \|k_x\| = \|f\| \sqrt{k(x, x)}\).
1.5.1: For the linear kernel, \(\mu_P = \int x dP(x) = \mathbb{E}_P[x]\). Any two distributions with the same mean (but different variances) will have the same embedding. Thus, it is not injective.
1.6: \(p(\omega) = \frac{1}{\pi} \frac{\lambda}{\lambda^2 + \omega^2}\) (Cauchy/Lorentzian distribution).