Note

This is a short note about the training dynamics of a two-layer convolution network. Just for some understanding.

Let the network ( for time variable of the dynamics) be

where are convolution kernels with a finite bandwidth. Suppose the target function is is a linear affine mapping, then the usual loss function is:

Gradient Flow

The gradient flow is

One thing we can observe is that . Therefore,

We can write the convolution kernel , thus

Showing that the is invariant for each . Therefore, the training is confined on a tensor product of Lorentzian manifolds of type , where is the dimension of convolution kernel. However, this does not reveal any training dynamics. For images, asymptotically we can assume the support of kernel is finite.

First, we know the dynamics will converge to a stationary point.

Regarding the coefficients , the dynamics will satisfy

Then it can be solved by

It means, once the Gram matrix satisfies some positive definiteness, and can be always captured well (in terms of eigenvalue) by the basis .

In our case, is an image of truncated frequencies and is a continuation of (assuming we have some analyticity in frequency, which is true), and needs to fit the extended image, essentially is to retrieve high frequencies. Our idea is, the lower frequencies are all easy to capture, but higher frequencies are difficult to retrieve in reasonable time. It should be significantly longer time (not just polynomial relation). The plan is the following:

  • For lower frequencies, show the error can be under a threshold.
  • For higher frequencies, show it will take unreasonable time to arrive. The transition from lower frequency band to should take exponentially long in terms of gap.

Fourier Space

Since it is about convolution, it might be suitable to consider the dual space (Fourier), then Plancherel identity shows that . The initialization of each can be viewed as a random Gaussian field and decomposed into

The norm (assume support is unit square), the distribution is governed by the law of large number when is large. We only assume is large enough to cover the band of on . For simplicity, we assume that the affine mapping in the following diagonal form under Fourier representation.

ReLU Activation

The ReLU activation function imposes some challenges for its nonlinearity. But formally, we still can write

where .

Notes