on eigenvalue estimate of kernel

🏷️Introduction

Given a positive definite kernel function $G (x, y)$ , it is natural to sample some points $(x_{i}, x_{j})$ (so it creates a matrix $G$ ) as an approximation to the kernel function. It is a classical topic to estimate the eigenvalues of $G$ and compare with $G$ ‘s eigenvalues.

🌊Weyl’s inequality

The simplest idea to quantify the difference is probably Weyl’s inequality for self-adjoint compact operators.

Estimate by Weyl's inequality

Let $Ω = [0, 1]^{n}$ be a unit cube and $G (x, y) \in C^{1} (Ω \times Ω)$ is a positive definite kernel. The matrix $G \in R^{m \times m}$ has entries as $G (x_{i}, x_{j})$ for ${x_{i}}_{i = 1}^{m}$ be the regular “lattice points” with $n m$ points along each axis. Then
$∣ μ_{i} (G) - \frac{1}{m} μ_{i} (G) ∣ = O (\frac{1}{n m})$
where $μ_{i}$ denotes the $i$ th eigenvalue in descending order.

Proof

The lattice points automatically generate a partition of $Ω$ , we denote $C (x_{i})$ as the small cube centered at $x_{i}$ which is disjoint from other small cubes.

The matrix $G$ constructs an approximated kernel function on each small cube around the lattice points. We define $G^{*} (x, y)$ as
$G^{*} (x, y) = G (x_{i}, y_{j}) (x, y) \in C (x_{i}) \times C (y_{j}) .$
It is straightforward to see $μ_{i} (G^{*}) = \frac{1}{m} μ_{i} (G)$ . Then by Weyl’s inequality,
$∣ μ_{i} (G) - μ_{i} (G^{*}) ∣ \leq ∥ G - G^{*} ∥_{o p} \leq \int_{Ω \times Ω} ∣ G (x, y) - G^{*} (x, y) ∣^{2} d x d y = i = 1 \sum m j = 1 \sum m \int_{C (x_{i}) \times C (x_{j})} ∣ G (x, y) - G^{*} (x, y) ∣^{2} d x d y = O (\frac{∥\nabla G ∥ _{\infty}}{n m}) .$

🌵Ostrowski Theorem

Intuitively, the insufficiency of samples will only affect smaller eigenvalues (more oscillatory eigenfunctions), thus isolating the “safer” eigenfunctions is natural.

Let the kernel $G$ be expanded into its eigenfunctions:

G (x, y) = k = 1 \sum \infty μ_{k} (G) ψ_{k} (x) ψ_{k} (y) .

And we pick a truncation $G_{l} := \sum_{k = 1}^{l} λ_{k} ψ_{k} (x) ψ_{k} (y)$ . Then the matrix $G$ can be split into two parts:

G_{i, j} = G_{l} (x_{i}, x_{j}) + (G (x_{i}, x_{j}) - G_{l} (x_{i}, x_{j})) .

The first part should be less affected by the sampling, thus we should expect the correlation

\frac{1}{m} i = 1 \sum m ψ_{k} (x_{i}) ϕ_{r} (x_{i}) \approx δ_{k, r}

The common tool we use is Ostrowski’s theorem(Bhatia, 2013).

Ostrowski theorem

If $A \in C^{n \times n}$ is Hermitian and $X \in C^{n \times n}$ , then
$μ_{i} (X^{*} A X) = θ_{i} μ_{i} (A), i \in [n] .$
where $μ_{n} (X^{*} X) \leq θ_{i} \leq μ_{1} (X^{*} X)$ .

Let $(Φ_{l})_{i, j} = ψ_{j} (x_{i})$ and $Λ_{l} = diag (μ_{i} (G))_{i = 1}^{l}$ , then the theorem implies

\frac{1}{m} μ_{i} (Φ_{l} Λ_{l} Φ_{l}^{*}) - μ_{i} (G) \leq μ_{i} (G) \frac{1}{m} Φ_{l}^{T} Φ_{l} - I d_{l}_{o p} .

Then, we obtain the bound (Ostrowski theorem and Weyl’s theorem, respectively)

∣ μ_{i} (G) - \frac{1}{m} μ_{i} (G) ∣ \leq ∣ μ_{i} (G) - \frac{1}{m} μ_{i} (Φ_{l} Λ_{l} Φ_{l}^{T}) ∣ + ∣ \frac{1}{m} μ_{i} (Φ_{l} Λ_{l} Φ_{l}^{T}) - \frac{1}{m} μ_{i} (G) ∣ \leq μ_{i} (G) \frac{1}{m} Φ_{l}^{T} Φ_{l} - I d_{l}_{o p} + by Weyl’s inequality \frac{1}{m} ∥ G - Φ_{l} Λ_{l} Φ_{l}^{*} ∥_{o p} .

🌀Conservative estimates

The remainder term $G - Φ_{l} Λ_{l} Φ_{l}^{*} = \sum_{k > l} μ_{k} ψ_{k} (x_{i}) ψ_{k} (x_{j})$ , a naive bound of the 2nd term will be $C \sum_{k > l} μ_{k}$ if the eigenfunctions are uniformly bounded, otherwise the growth needs to be considered. Of course, this bound is quite conservative. Few possible directions are viable.

The ${ψ_{k} (x_{i})}_{i = 1}^{m}$ can be viewed as some random vector for large $k$ , then the central limit theorem can create a more aggressive estimate, see related work(Chun, Chung & Lee, 2025; Kong & Valiant, 2017).
If $ψ_{k}$ is close to a standard Fourier mode (up to a fixed phase shift), then on an equispaced lattice of ${x_{i}}_{i = 1}^{m}$ , it is still expecting the (somewhat) orthogonality between the discrete vectors $Φ_{l}$ .

Random vector perspective

Consider ${ψ_{k} (x_{i})}_{i = 1}^{n}$ as a random vector with respect to i.i.d random variables $x_{i}$ . We set the truncated kernel (still positive definite) $g (x, y) = G (x, y) - G_{l} (x, y)$ , then we can use the spectrum estimator idea, take $m (n)$ as a moment estimator (biased), then
$E_{x} (m (n)) = E [tr ((\frac{1}{m} G - \frac{1}{m} Φ_{l} Λ_{l} Φ_{l}^{*})^{n})]$
For instance, $n = 2$ ,
$E [tr ((\frac{1}{m} G - \frac{1}{m} Φ_{l} Λ_{l} Φ_{l}^{*})^{2})] = \frac{1}{m ^{2}} i, j = 1 \sum m E [g (x_{i}, x_{j}) g (x_{j}, x_{i})] = \frac{1}{m ^{2}} k, s > l \sum i, j \neq = i \sum m μ_{k} μ_{s} E [ψ_{k} (x_{i}) ψ_{s} (x_{i})] E [ψ_{k} (x_{j}) ψ_{s} (x_{j})] + \frac{1}{m ^{2}} k, s > l \sum i = 1 \sum m μ_{k} μ_{s} E [ψ_{k} (x_{i}) ψ_{s} (x_{i}) ψ_{k} (x_{i}) ψ_{s} (x_{i})] = \frac{m - 1}{m} k > l \sum μ_{k}^{2} + \frac{1}{m} \int_{D} g (x, x)^{2} d x .$
The last term is bounded by $m^{- 1} O (\sum_{k > l} μ_{k})^{2}$ . Once the decay rate of $μ_{k} = O (k^{- β})$ , we can expect a bound of $O (l^{1 - 2 β} + m^{- 1} l^{2 - 2 β})$ .

The variance can be estimated similarly. We only need
$E [tr ((\frac{1}{m} G - \frac{1}{m} Φ_{l} Λ_{l} Φ_{l}^{*})^{2})^{2}] = E [\frac{1}{m ^{4}} i, j, s, t = 1 \sum m g (x_{i}, x_{j})^{2} g (x_{s}, x_{t})^{2}] = E \frac{1}{m ^{4}} distinct, i, j, s, t = 1 \sum m g (x_{i}, x_{j})^{2} g (x_{s}, x_{t})^{2} + O (\frac{1}{m}) (k > l \sum μ_{k})^{4} = (1 - O (m^{- 1})) [\int_{D \times D} ∣ g (x, y) ∣^{2} d x d y]^{2} + O (\frac{1}{m}) (k > l \sum μ_{k})^{4} = (1 - O (m^{- 1})) (k > l \sum μ_{k}^{2})^{2} + O (\frac{1}{m}) (k > l \sum μ_{k})^{4}$
Therefore, the variance is bounded by $O (m^{- 1}) ((\sum_{k > l} μ_{k}^{2})^{2} + (\sum_{k > l} μ_{k})^{4}) = O (m^{- 1}) l^{4 - 4 β}$ and
$∥ \frac{1}{m} G - \frac{1}{m} Φ_{l} Λ_{l} Φ_{l}^{*} ∥_{o p}^{2} \leq m (2) \leq O_{p} (l^{1 - 2 β} + m^{- 1/2} l^{2 - 2 β})) .$

Extension to higher moments

When a higher moment $m (n)$ is involved, we should still expect something like
$O (1) k > l \sum μ_{k}^{n} + O_{p} (m^{- 1/2}) (k > l \sum μ_{k})^{n} = O_{p} (l^{1 - n β} + m^{- 1/2} l^{n - n β})$
as the upper bound. As we can see later, $n = 2$ probably is the best we can do for the condition number estimate (since the first term dominates for $n = 1$ and $l = m$ ).

〰️Concentration argument

The first term can be quantified by viewing $x_{i}$ as a certain quadrature rule, or ready for Hoeffding’s inequality. We should expect the stochastic bound $O (m^{- 1/2} lo g l^{2} / p)$ for each entry with probability $1 - \frac{p}{l ^{2}}$ , therefore, with probability $1 - O (p)$ ,

\frac{1}{m} Φ_{l}^{T} Φ_{l} - I d_{l}_{o p} = O (m^{- 1/2} l lo g l^{2} / p) .

The final bound will be a compromise between these two terms (in high probability). Under the assumption that $μ_{k} (G) = O (k^{- β})$ that $β > 1$ ,

∣ μ_{i} (G) - \frac{1}{m} μ_{i} (G) ∣ \leq μ_{i} (G) O_{p} (m^{- 1/2} l lo g l^{2} / p) + O_{p} (l^{1/2 - β} + m^{- 1/4} l^{1 - β}) .

Then, we should truncate at $i^{- β} m^{- 1/2} l \approx l^{- β + 1/2}$ if $l \geq m$ , that is, $l \approx m^{1/ (2 β + 1)} i^{β / (β + 1/2)}$ if it is not exceeding $m$ . Otherwise, $i^{- β} m^{- 1/2} l \approx m^{- 1/4} l^{1 - β}$ , that is, $l \approx i \cdot m^{1/2 β}$ if not exceeding $m$ . The condition number of the sample matrix can be estimated from below.

κ = \frac{μ _{1} ( G )}{μ _{m} ( G )} = Ω_{p} (m^{β - 3/4}) .

Since $μ_{1} (G) = Θ (m)$ , and $μ_{m} (G) = O (m^{- β + 3/4})$ in high probability sense.

🔦 Examples

Two-layer ReLU neural network with random sampling

Consider the two-layer ReLU network’s NTK regime (considered in on two-layer ReLU networks), the kernel permits a decay rate $μ_{k} \sim O (k^{- (n + 3) / n})$ , then we can expect the condition number is bounded below by $O (m^{1/4 + 3/ n})$ for a total number of $m$ samples.

Two-layer ReLU neural network with equispaced sampling

In this case, the estimate can be more aggressive since the eigenfunctions are almost “Fourier” modes. Because the equispaced lattice points ${x_{i}}_{i = 1}^{m}$ still sustain the orthogonality relation for the sampled eigenfunctions. In this scenario, we can expect $μ_{m} (G) = O (m^{- β})$ .

References

🐻 Bhatia, R. 2013. Matrix Analysis, Springer Science & Business Media,p.

🐻 Chun, C., Chung, S. & Lee, D.D. 2025. Estimating the Spectral Moments of the Kernel Integral Operator from Finite Sample Matrices.

🐻 Kong, W. & Valiant, G. 2017. Spectrum Estimation from Samples.

Latent Seminar

Explorer

on eigenvalue estimate of kernel

🏷️Introduction

🌊Weyl’s inequality

🌵Ostrowski Theorem

🌀Conservative estimates

〰️Concentration argument

🔦 Examples

References

Graph View

Table of Contents

Backlinks