on improved Nyström bounds

Note

This post reviews the landmark paper “Improved Bounds for the Nyström Method with Application to Kernel Classification” by Jin, Yang et al. (2013). It marks a transition from pessimistic worst-case analysis to sharper, data-dependent spectral bounds.

🏷️ Introduction: The Nyström Method

The Nyström method is a popular technique for approximating large $n \times n$ positive semi-definite (PSD) kernel matrices $K$ by sampling a subset of $m ≪ n$ columns. If we denote the sampled columns as $C$ and the $m \times m$ intersection matrix as $W$ , the Nyström approximation is:

\hat{K} = C W^{†} C^{T}

While empirically successful, early theoretical bounds were often too loose to explain its performance in practice.

In particular, in the limit $n \to \infty$ , the PSD kernel matrix $K$ represents the continuous kernel function $κ$ after rescaling. Therefore, the $L^{2}$ projection of the kernel $κ$ onto a rank $m$ kernel $κ_{m}$ will be the product

κ_{m} (x, y) = i = 1 \sum m λ_{i} ϕ_{i} (x) ϕ_{i} (y),

where the eigenfunctions are orthonormal. The $m \times m$ intersection matrix $W$ will become the eigenvalue diagonal. Clearly, the error will be depending on the remaining eigenvalues.

🌊 Beyond the $1/ m$ Bottleneck

Previous “worst-case” theories often suggested that the spectral norm error $∥ K - \hat{K} ∥_{2}$ decayed at a rate of $O (n / m)$ . In high-dimensional machine learning, this would imply that a prohibitively large number of samples is required to reach high accuracy.

The contribution of Yang et al. was to move beyond this “pessimistic” view by incorporating two key properties of real-world kernel matrices: Coherence and Spectral Decay.

🔍 The Role of Coherence

Coherence measures how “spread out” the information of the top $k$ eigenvectors is across the rows.

If a matrix $K$ has low coherence, its information is distributed evenly, and uniform sampling is highly likely to capture the important features of the kernel.
The authors proved that for low-coherence matrices, the number of samples $m$ required for a stable approximation scales linearly with the “effective rank” rather than the total number of points $n$ .

📉 Power-Law Decay: The Sharpness of $m^{1 - p}$

The most impactful part of the paper is the analysis under the power-law decay assumption: $λ_{i} \approx i^{- p}$ for $p > 1$ .

🧠 Mathematical Intuition: Subspace Projection

The spectral error $∥ K - \hat{K} ∥_{2}$ is fundamentally related to the “leakage” of the bottom eigenspace into the sampled subspace. Let $K = U Λ U^{T}$ be the spectral decomposition. The Nyström error can be bounded by:

∥ K - \hat{K} ∥_{2} \leq λ_{k + 1} (1 + \frac{max _{i} ∥ ( U _{k} ) _{i, :} ∥ ^{2}}{m / n})

where $U_{k}$ are the top $k$ eigenvectors. If the coherence $μ = \frac{n}{k} max_{i} ∥ (U_{k})_{i, :} ∥^{2}$ is small, the error is controlled by the first excluded eigenvalue $λ_{k + 1}$ .

📐 The Integration of the Tail

Unlike the best rank- $k$ approximation (which has error exactly $λ_{k + 1}$ ), random sampling is sensitive to the entire tail of the spectrum. The authors show that the total spectral norm error is proportional to the sum of the remaining eigenvalues. By assuming $λ_{i} \sim i^{- p}$ and scaling by $n$ (since $Tr (K) \approx n$ ):

Error \propto \frac{n}{m ^{p}} + i = m + 1 \sum n λ_{i} \approx \int_{m}^{n} x^{- p} d x = [\frac{x ^{1 - p}}{1 - p}]_{m}^{n}

As $n \to \infty$ , this yields the sharp rate:

∥ K - \hat{K} ∥_{2} = O (\frac{n}{m ^{p - 1}})

This demonstrates that for smooth kernels ( $p \geq 2$ ), the convergence is significantly faster than the “worst-case” $O (n / m)$ rate, which implicitly assumes $p \to 1$ (a very flat spectrum).

🎯 Impact on Kernel Classification

The paper doesn’t stop at matrix approximation; it links the Nyström error directly to the generalization error in kernel classification (e.g., Kernel SVM).

It proves that as long as $m$ is large enough to resolve the “dominant” subspace of the kernel, the classifier trained on the Nyström-approximated kernel will have essentially the same risk as one trained on the full $n \times n$ matrix.
This provides a theoretical justification for using Nyström in big data regimes where $n$ is in the millions.

📊 Summary of Contributions

Feature	Worst-Case Theory	Yang et al. (2013)
Error Rate	$O (n / m)$	$O (n / m^{p - 1})$
Spectral Decay	Ignored (flat)	Exploited (Power-law)
Coherence	Not considered	Critical requirement
Target	Absolute Error	Tail-Sum Error

📝 Notes

Ridge Leverage Scores: While this paper focuses on uniform sampling, it laid the groundwork for using Ridge Leverage Score (RLS) sampling, which provides similar $1 + ϵ$ relative-error bounds without requiring the incoherence assumption.
Connection to SDEs: In recent years (2024-2025), these spectral bounds have been used to analyze the stability of “Score-based” models, where the kernel represents the interaction of the score field across data points.

🔗 See Also

on Sylvester’s determinantal identity and Schweinsian expansion --- Sylvester’s identity provides the algebraic foundation for the determinantal ratios and recursive updates used in spectral approximation.
on eigenvalue estimate of kernel --- Provides the decay rates for the eigenvalues of integral operators which are approximated by the Nyström method.

Latent Seminar

Explorer

on improved Nyström bounds

🏷️ Introduction: The Nyström Method

🌊 Beyond the $1/ m$ Bottleneck

🔍 The Role of Coherence

📉 Power-Law Decay: The Sharpness of $m^{1 - p}$

🧠 Mathematical Intuition: Subspace Projection

📐 The Integration of the Tail

🎯 Impact on Kernel Classification

📊 Summary of Contributions

📝 Notes

🔗 See Also

📚 References

Graph View

Table of Contents

Backlinks

Latent Seminar

Explorer

on improved Nyström bounds

🏷️ Introduction: The Nyström Method

🌊 Beyond the 1/m​ Bottleneck

🔍 The Role of Coherence

📉 Power-Law Decay: The Sharpness of m1−p

🧠 Mathematical Intuition: Subspace Projection

📐 The Integration of the Tail

🎯 Impact on Kernel Classification

📊 Summary of Contributions

📝 Notes

🔗 See Also

📚 References

Graph View

Table of Contents

Backlinks

🌊 Beyond the $1/ m$ Bottleneck

📉 Power-Law Decay: The Sharpness of $m^{1 - p}$