Overview
This post explores the training dynamics of a two-layer convolutional neural network (CNN) under gradient flow. We analyze the invariance properties of the kernel-coefficient pairs and discuss the challenges of spectral bias, particularly the exponential time required to recover high-frequency information from band-limited data.
π·οΈ Problem Formulation
We consider a two-layer convolutional network governed by time-dependent coefficients and kernels :
where denotes convolution and is a non-linear activation function. Given a target linear affine mapping , we minimize the loss functional over a data manifold :
π·οΈ Gradient Flow and Invariance
The parameters evolve according to the gradient descent dynamics:
For homogeneous activations such as the ReLU (), we observe a fundamental conservation law. The difference between the coefficient energy and the kernel energy remains invariant:
This indicates that the training trajectory is confined to a tensor product of Lorentzian manifolds of type .
π·οΈ Stationary Points and Convergence
The loss function is monotonically non-increasing under gradient flow:
The coefficient vector satisfies a non-homogeneous linear evolution:
where is the Gram matrix (Neural Tangent Kernel) of the features. Convergence depends on the spectral properties of and the ability of the basis to resolve the target .
π·οΈ Spectral Bias in Frequency Retrieval
In the context of image regression, we model as a frequency-truncated signal. The task of is effectively to perform analytic continuation in the Fourier domain to retrieve high-frequency components.
Our analysis suggests a sharp transition in the training timescales:
- Low Frequencies: Captured rapidly within a threshold .
- High Frequencies: The transition from a frequency band to requires a training time that scales exponentially with the gap .
π·οΈ Fourier Representation
Utilizing the Plancherel identity, we analyze the kernel initialization in the dual space. The initial kernels are viewed as random Gaussian fields:
The distribution of the coefficients is governed by the Law of Large Numbers for high bandwidth , ensuring that the initial spectrum covers the necessary data bandwidth on .
π·οΈ Notes
- The Lorentzian structure implies that small initial weights can lead to βexplodingβ or βvanishingβ gradients if not properly balanced.
- The spectral bias observed here is a specific instance of the βF-Principleβ where neural networks fit low frequencies before high frequencies.