Note
This post reviews the landmark paper “Improved Bounds for the Nyström Method with Application to Kernel Classification” by Jin, Yang et al. (2013). It marks a transition from pessimistic worst-case analysis to sharper, data-dependent spectral bounds.
🏷️ Introduction: The Nyström Method
The Nyström method is a popular technique for approximating large positive semi-definite (PSD) kernel matrices by sampling a subset of columns. If we denote the sampled columns as and the intersection matrix as , the Nyström approximation is:
While empirically successful, early theoretical bounds were often too loose to explain its performance in practice.
In particular, in the limit , the PSD kernel matrix represents the continuous kernel function after rescaling. Therefore, the projection of the kernel onto a rank kernel will be the product
where the eigenfunctions are orthonormal. The intersection matrix will become the eigenvalue diagonal. Clearly, the error will be depending on the remaining eigenvalues.
🌊 Beyond the Bottleneck
Previous “worst-case” theories often suggested that the spectral norm error decayed at a rate of . In high-dimensional machine learning, this would imply that a prohibitively large number of samples is required to reach high accuracy.
The contribution of Yang et al. was to move beyond this “pessimistic” view by incorporating two key properties of real-world kernel matrices: Coherence and Spectral Decay.
🔍 The Role of Coherence
Coherence measures how “spread out” the information of the top eigenvectors is across the rows.
- If a matrix has low coherence, its information is distributed evenly, and uniform sampling is highly likely to capture the important features of the kernel.
- The authors proved that for low-coherence matrices, the number of samples required for a stable approximation scales linearly with the “effective rank” rather than the total number of points .
📉 Power-Law Decay: The Sharpness of
The most impactful part of the paper is the analysis under the power-law decay assumption: for .
🧠 Mathematical Intuition: Subspace Projection
The spectral error is fundamentally related to the “leakage” of the bottom eigenspace into the sampled subspace. Let be the spectral decomposition. The Nyström error can be bounded by:
where are the top eigenvectors. If the coherence is small, the error is controlled by the first excluded eigenvalue .
📐 The Integration of the Tail
Unlike the best rank- approximation (which has error exactly ), random sampling is sensitive to the entire tail of the spectrum. The authors show that the total spectral norm error is proportional to the sum of the remaining eigenvalues. By assuming and scaling by (since ):
As , this yields the sharp rate:
This demonstrates that for smooth kernels (), the convergence is significantly faster than the “worst-case” rate, which implicitly assumes (a very flat spectrum).
🎯 Impact on Kernel Classification
The paper doesn’t stop at matrix approximation; it links the Nyström error directly to the generalization error in kernel classification (e.g., Kernel SVM).
- It proves that as long as is large enough to resolve the “dominant” subspace of the kernel, the classifier trained on the Nyström-approximated kernel will have essentially the same risk as one trained on the full matrix.
- This provides a theoretical justification for using Nyström in big data regimes where is in the millions.
📊 Summary of Contributions
| Feature | Worst-Case Theory | Yang et al. (2013) | |
|---|---|---|---|
| Error Rate | |||
| Spectral Decay | Ignored (flat) | Exploited (Power-law) | |
| Coherence | Not considered | Critical requirement | |
| Target | Absolute Error | Tail-Sum Error |
📝 Notes
- Ridge Leverage Scores: While this paper focuses on uniform sampling, it laid the groundwork for using Ridge Leverage Score (RLS) sampling, which provides similar relative-error bounds without requiring the incoherence assumption.
- Connection to SDEs: In recent years (2024-2025), these spectral bounds have been used to analyze the stability of “Score-based” models, where the kernel represents the interaction of the score field across data points.
🔗 See Also
- on Sylvester’s determinantal identity and Schweinsian expansion --- Sylvester’s identity provides the algebraic foundation for the determinantal ratios and recursive updates used in spectral approximation.
- on eigenvalue estimate of kernel --- Provides the decay rates for the eigenvalues of integral operators which are approximated by the Nyström method.