Note

This post reviews the landmark paper “Improved Bounds for the Nyström Method with Application to Kernel Classification” by Jin, Yang et al. (2013). It marks a transition from pessimistic worst-case analysis to sharper, data-dependent spectral bounds.

🏷️ Introduction: The Nyström Method

The Nyström method is a popular technique for approximating large positive semi-definite (PSD) kernel matrices by sampling a subset of columns. If we denote the sampled columns as and the intersection matrix as , the Nyström approximation is:

While empirically successful, early theoretical bounds were often too loose to explain its performance in practice.

In particular, in the limit , the PSD kernel matrix represents the continuous kernel function after rescaling. Therefore, the projection of the kernel onto a rank kernel will be the product

where the eigenfunctions are orthonormal. The intersection matrix will become the eigenvalue diagonal. Clearly, the error will be depending on the remaining eigenvalues.

🌊 Beyond the Bottleneck

Previous “worst-case” theories often suggested that the spectral norm error decayed at a rate of . In high-dimensional machine learning, this would imply that a prohibitively large number of samples is required to reach high accuracy.

The contribution of Yang et al. was to move beyond this “pessimistic” view by incorporating two key properties of real-world kernel matrices: Coherence and Spectral Decay.

🔍 The Role of Coherence

Coherence measures how “spread out” the information of the top eigenvectors is across the rows.

  • If a matrix has low coherence, its information is distributed evenly, and uniform sampling is highly likely to capture the important features of the kernel.
  • The authors proved that for low-coherence matrices, the number of samples required for a stable approximation scales linearly with the “effective rank” rather than the total number of points .

📉 Power-Law Decay: The Sharpness of

The most impactful part of the paper is the analysis under the power-law decay assumption: for .

🧠 Mathematical Intuition: Subspace Projection

The spectral error is fundamentally related to the “leakage” of the bottom eigenspace into the sampled subspace. Let be the spectral decomposition. The Nyström error can be bounded by:

where are the top eigenvectors. If the coherence is small, the error is controlled by the first excluded eigenvalue .

📐 The Integration of the Tail

Unlike the best rank- approximation (which has error exactly ), random sampling is sensitive to the entire tail of the spectrum. The authors show that the total spectral norm error is proportional to the sum of the remaining eigenvalues. By assuming and scaling by (since ):

As , this yields the sharp rate:

This demonstrates that for smooth kernels (), the convergence is significantly faster than the “worst-case” rate, which implicitly assumes (a very flat spectrum).

🎯 Impact on Kernel Classification

The paper doesn’t stop at matrix approximation; it links the Nyström error directly to the generalization error in kernel classification (e.g., Kernel SVM).

  • It proves that as long as is large enough to resolve the “dominant” subspace of the kernel, the classifier trained on the Nyström-approximated kernel will have essentially the same risk as one trained on the full matrix.
  • This provides a theoretical justification for using Nyström in big data regimes where is in the millions.

📊 Summary of Contributions

FeatureWorst-Case TheoryYang et al. (2013)
Error Rate
Spectral DecayIgnored (flat)Exploited (Power-law)
CoherenceNot consideredCritical requirement
TargetAbsolute ErrorTail-Sum Error

📝 Notes

  • Ridge Leverage Scores: While this paper focuses on uniform sampling, it laid the groundwork for using Ridge Leverage Score (RLS) sampling, which provides similar relative-error bounds without requiring the incoherence assumption.
  • Connection to SDEs: In recent years (2024-2025), these spectral bounds have been used to analyze the stability of “Score-based” models, where the kernel represents the interaction of the score field across data points.

🔗 See Also

📚 References