Overview

This post explores the training dynamics of a two-layer convolutional neural network (CNN) under gradient flow. We analyze the invariance properties of the kernel-coefficient pairs and discuss the challenges of spectral bias, particularly the exponential time required to recover high-frequency information from band-limited data.

🏷️ Problem Formulation

We consider a two-layer convolutional network governed by time-dependent coefficients and kernels :

where denotes convolution and is a non-linear activation function. Given a target linear affine mapping , we minimize the loss functional over a data manifold :

🏷️ Gradient Flow and Invariance

The parameters evolve according to the gradient descent dynamics:

For homogeneous activations such as the ReLU (), we observe a fundamental conservation law. The difference between the coefficient energy and the kernel energy remains invariant:

This indicates that the training trajectory is confined to a tensor product of Lorentzian manifolds of type .

🏷️ Stationary Points and Convergence

The loss function is monotonically non-increasing under gradient flow:

The coefficient vector satisfies a non-homogeneous linear evolution:

where is the Gram matrix (Neural Tangent Kernel) of the features. Convergence depends on the spectral properties of and the ability of the basis to resolve the target .

🏷️ Spectral Bias in Frequency Retrieval

In the context of image regression, we model as a frequency-truncated signal. The task of is effectively to perform analytic continuation in the Fourier domain to retrieve high-frequency components.

Our analysis suggests a sharp transition in the training timescales:

  • Low Frequencies: Captured rapidly within a threshold .
  • High Frequencies: The transition from a frequency band to requires a training time that scales exponentially with the gap .

🏷️ Fourier Representation

Utilizing the Plancherel identity, we analyze the kernel initialization in the dual space. The initial kernels are viewed as random Gaussian fields:

The distribution of the coefficients is governed by the Law of Large Numbers for high bandwidth , ensuring that the initial spectrum covers the necessary data bandwidth on .

🏷️ Notes

  • The Lorentzian structure implies that small initial weights can lead to β€œexploding” or β€œvanishing” gradients if not properly balanced.
  • The spectral bias observed here is a specific instance of the β€œF-Principle” where neural networks fit low frequencies before high frequencies.

πŸ”— See Also

πŸ“š References