on U-Net regression

Note

This is a short note about the training dynamics of a two-layer convolution network. Just for some understanding.

Let the network ( $t$ for time variable of the dynamics) be

f (X, t) = i = 1 \sum n a_{i} (t) σ (K_{i} (\cdot, t) * X),

where $K_{i}$ are convolution kernels with a finite bandwidth. Suppose the target function is $h (X)$ is a linear affine mapping, then the usual loss function is:

L (t) = \frac{1}{2} \int_{M} ∥ f (X, t) - h (X) ∥^{2} d μ (X)

Gradient Flow

The gradient flow is

\frac{d}{d t} a_{i} (t) \frac{d}{d t} K_{i} (w, t) = - \int_{M} (f (X, t) - h (X)) \cdot σ (K_{i} (\cdot, t) * X) d μ (X) = - a_{i} (t) \int_{M} (f (X, t) - h (X)) \cdot [σ^{'} (K_{i} (\cdot, t) * X) ⊙ X (\cdot - w)] d μ (X)

One thing we can observe is that $σ (x) = x σ^{'} (x)$ . Therefore,

\frac{d}{d t} a_{i} (t) = \int_{M} (f (X, t) - h (X)) \cdot [σ^{'} (K_{i} (\cdot, t) * X) ⊙ K_{i} (\cdot, t) * X] d μ (X)

We can write the convolution kernel $K_{i} (x, t) * X = \int X (x - w) K_{i} (w, t) d w$ , thus

\int K_{i} (w, t) \frac{d}{d t} K_{i} (w, t) d w = a_{i} (t) \frac{d}{d t} a_{i} (t)

Showing that the $∣ a_{i} (t) ∣^{2} - ∥ K_{i} (\cdot, t) ∥^{2}$ is invariant for each $i \in [n]$ . Therefore, the training is confined on a tensor product of Lorentzian manifolds of type $(N, 1)$ , where $N$ is the dimension of convolution kernel. However, this does not reveal any training dynamics. For images, asymptotically we can assume the support of kernel is finite.

First, we know the dynamics will converge to a stationary point.

\frac{d}{d t} L (t) = \int_{M} (f (X, t) - h (X)) \cdot \frac{df ( X , t )}{d t} d μ (X) = i \sum a_{i}^{'} (t) \int_{M} (f (X, t) - h (X)) σ (K_{i} (\cdot, t) * X) d μ (X) - i \sum a_{i} (t) \int_{M} (f (X, t) - h (X)) \cdot σ^{'} (K_{i} (\cdot, t) * X) ⊙ (\frac{d}{d t} K_{i} (\cdot, t) * X) d μ (X) = - i \sum ∣ a_{i}^{'} (t) ∣^{2} - i \sum ∥ K_{i}^{'} (\cdot, t) ∥^{2} .

Regarding the coefficients $a_{i} (t)$ , the dynamics will satisfy

\frac{d}{d t} a = - G (t) a + \int_{M} h (X) σ (K (\cdot, t) * X) d μ (X)

Then it can be solved by

a = exp (- \int_{0}^{t} G (s) d s) a (0) + \int_{0}^{t} exp (- \int_{u}^{t} G (s) d s) \int_{M} h (X) σ (K (\cdot, u) * X) d μ (X) d u

It means, once the Gram matrix $G (t) = \int_{M} [σ (K_{i} (\cdot, t) * X)] \cdot [σ (K_{j} (\cdot, t)) * X] d μ (X)$ satisfies some positive definiteness, and $h (X)$ can be always captured well (in terms of eigenvalue) by the basis $σ (K (\cdot, u) * X)$ .

In our case, $X$ is an image of truncated frequencies and $h (X)$ is a continuation of $X$ (assuming we have some analyticity in frequency, which is true), and $f (X, t)$ needs to fit the extended image, essentially is to retrieve high frequencies. Our idea is, the lower frequencies are all easy to capture, but higher frequencies are difficult to retrieve in reasonable time. It should be significantly longer time (not just polynomial relation). The plan is the following:

For lower frequencies, show the error can be under a threshold.
For higher frequencies, show it will take unreasonable time to arrive. The transition from lower frequency band $B$ to $B (1 + ϵ)$ should take exponentially long in terms of $ϵ B$ gap.

Fourier Space

Since it is about convolution, it might be suitable to consider the dual space (Fourier), then Plancherel identity shows that $∥ K_{i} (\cdot, t) ∥ = ∥ K_{i} (\cdot, t) ∥$ . The initialization of each $K_{i}$ can be viewed as a random Gaussian field and decomposed into

K_{i} (x, 0) = ∣ ζ ∣ < B \sum C_{i, ζ} (ω) e^{i 2 π ζ \cdot x} .

The norm $∥ K_{i} ∥^{2} = \sum_{∣ ζ ∣ < B} ∣ C_{i, ζ} ∣^{2}$ (assume support is unit square), the distribution is governed by the law of large number when $B$ is large. We only assume $B$ is large enough to cover the band of $X$ on $M$ . For simplicity, we assume that the affine mapping $h$ in the following diagonal form under Fourier representation.

h (X) = ∣ ζ ∣ < B \sum h_{ζ} X (ζ) e^{i 2 π ζ \cdot x}

ReLU Activation

The ReLU activation function imposes some challenges for its nonlinearity. But formally, we still can write

\frac{d}{d t} a_{i} (t) = - \int_{M} ∣ ζ ∣ < B \sum (f (ζ, t) - h (ζ) X (ζ)) \overline{P_{i} (ζ, t)} d X .

where $σ (K_{i} (\cdot, t) * X) = \sum_{ζ} P_{i} (ζ, t) e^{2 πi ζ \cdot x}$ .

Latent Seminar

Explorer

on U-Net regression

Gradient Flow

Fourier Space

ReLU Activation

Notes

Links

Graph View

Table of Contents