🏷️Introduction

Given a positive definite kernel function , it is natural to sample some points (so it creates a matrix ) as an approximation to the kernel function. It is a classical topic to estimate the eigenvalues of and compare with ‘s eigenvalues.

🌊Weyl’s inequality

The simplest idea to quantify the difference is probably Weyl’s inequality for self-adjoint compact operators.

Estimate by Weyl's inequality

Let be a unit cube and is a positive definite kernel. The matrix has entries as for be the regular “lattice points” with points along each axis. Then

where denotes the th eigenvalue in descending order.

🌵Ostrowski Theorem

Intuitively, the insufficiency of samples will only affect smaller eigenvalues (more oscillatory eigenfunctions), thus isolating the “safer” eigenfunctions is natural.

Let the kernel be expanded into its eigenfunctions:

And we pick a truncation . Then the matrix can be split into two parts:

The first part should be less affected by the sampling, thus we should expect the correlation

The common tool we use is Ostrowski’s theorem(Bhatia, 2013).

Ostrowski theorem

If is Hermitian and , then

where .

Let and , then the theorem implies

Then, we obtain the bound (Ostrowski theorem and Weyl’s theorem, respectively)

🌀Conservative estimates

The remainder term , a naive bound of the 2nd term will be if the eigenfunctions are uniformly bounded, otherwise the growth needs to be considered. Of course, this bound is quite conservative. Few possible directions are viable.

  • The can be viewed as some random vector for large , then the central limit theorem can create a more aggressive estimate, see related work(Chun, Chung & Lee, 2025; Kong & Valiant, 2017).
  • If is close to a standard Fourier mode (up to a fixed phase shift), then on an equispaced lattice of , it is still expecting the (somewhat) orthogonality between the discrete vectors .

Random vector perspective

Consider as a random vector with respect to i.i.d random variables . We set the truncated kernel (still positive definite) , then we can use the spectrum estimator idea, take as a moment estimator (biased), then

For instance, ,

The last term is bounded by . Once the decay rate of , we can expect a bound of .

The variance can be estimated similarly. We only need

Therefore, the variance is bounded by and

Extension to higher moments

When a higher moment is involved, we should still expect something like

as the upper bound. As we can see later, probably is the best we can do for the condition number estimate (since the first term dominates for and ).

〰️Concentration argument

The first term can be quantified by viewing as a certain quadrature rule, or ready for Hoeffding’s inequality. We should expect the stochastic bound for each entry with probability , therefore, with probability ,

The final bound will be a compromise between these two terms (in high probability). Under the assumption that that ,

Then, we should truncate at if , that is, if it is not exceeding . Otherwise, , that is, if not exceeding . The condition number of the sample matrix can be estimated from below.

Since , and in high probability sense.

🔦 Examples

Two-layer ReLU neural network with random sampling

Consider the two-layer ReLU network’s NTK regime (considered in on two-layer ReLU networks), the kernel permits a decay rate , then we can expect the condition number is bounded below by for a total number of samples.

Two-layer ReLU neural network with equispaced sampling

In this case, the estimate can be more aggressive since the eigenfunctions are almost “Fourier” modes. Because the equispaced lattice points still sustain the orthogonality relation for the sampled eigenfunctions. In this scenario, we can expect .

References

🐻  Bhatia, R. 2013. Matrix Analysis, Springer Science & Business Media,p.
🐻  Chun, C., Chung, S. & Lee, D.D. 2025. Estimating the Spectral Moments of the Kernel Integral Operator from Finite Sample Matrices.
🐻  Kong, W. & Valiant, G. 2017. Spectrum Estimation from Samples.