on two-layer ReLU networks

Info

This note is not intended to be a literature review of any kind nor an original research piece, it is just for fun.

🏷️ Introduction

The ReLU (rectified linear unit) refers to the following operation

ReLU (x) = max (0, x) .

Let us denote this function by $σ$ for short. Trivially, we find that

$σ^{''} (x) = δ (x)$ which makes $σ$ a certain Green’s function (w.r.t some boundary conditions) for one dimensional Laplacian.
$σ (c x) = c σ (x)$ for any positive real number $c$ .

In higher dimensions, the ReLU function is often used with a linear layer:

y = σ (w \cdot x - b),

where the vector/matrix $w$ is the weights, and the scalar/vector $b$ is the bias. The commutativity with scalar multiplication makes it possible to consider only normalized weights or normalized input $x$ .

Definition

Two layer ReLU network is defined by
$f (x) = \int_{V} h (w, b) σ (w \cdot x - b) d μ (w, b),$
where $μ$ is the usual measure on the joint space $V = S^{n - 1} \times R$ .

☘️ Radon transform

For some input $x \in R^{n}$ , if we let $w \in S^{n - 1}$ , then

\nabla σ (w \cdot x - b) = w H (w \cdot x - b),

where $H$ is the Heaviside function and $Δ σ (w \cdot x - b) = \nabla \cdot \nabla σ (w \cdot x - b) = w \cdot w δ (w \cdot x - b) = δ (w \cdot x - b)$ . Therefore, (at least formally or in weak sense)

Δ f (x) = \int_{V} h (w, b) δ (w \cdot x - b) d μ (w, b) = \int_{S^{n - 1}} h (w, w \cdot x) d μ (w) = R^{*} h,

where $R^{*}$ is the adjoint Radon transform and $h$ is a distribution in the dual space. The intertwining property and the inversion formula are well-known.

Intertwining Property

$R Δ f = \partial_{b}^{2} R f, R^{*} \partial_{b}^{2} g = Δ R^{*} g .$

Helgason's Inversion Formula

In Fourier sense, the inversion formula holds for $f \in C^{\infty} (R^{n})$ with decay $∣ f (x) ∣ = O (∣ x ∣^{- N})$ for some $N > n$ :
$(- Δ)^{\frac{n - 1}{2}} R^{*} R f = c_{n} f$

When appropriate, we can write $f (x) = Δ^{- 1} R^{*} h$ in Fourier sense, where $Δ^{- 1}$ is a Fourier multiplier, and the singular values of the operator $Δ^{- 1} R^{*}$ is essential in terms of $L^{2}$ optimization.

Observation

In Fourier sense
$Δ^{- 1} R^{*} R Δ^{- 1} = \frac{1}{c _{n}} (- Δ)^{- \frac{n + 3}{2}} .$

By Weyl’s law, it implies that the singular values of the operator $Δ^{- 1} R^{*}$ will be decaying like $σ_{k} = Θ (k^{- (n + 3) / n})$ . It explains the low-pass filtration property of two-layer ReLU network.

🔭 Spectral perspective

There are two common (easy yet boring) configurations under the scope of spectral perspectives.

🧩 Neural-Tangent-Kernel (NTK)

The NTK configuration assumes sufficiently wide network, which makes the training behavior degenerated to a linear ODE (under gradient flow).

\frac{d R ^{*} h}{d t} = - (- Δ)^{- \frac{n + 3}{2}} R^{*} h,

At a first sight, we find this evolution equation will be sustaining the initial “noises” for a long time. It also infers that a highly sophisticated initialization (that contains high frequencies) could be toxic to the training process if those are not desired. Interestingly, a similar formulation is also seen in on subspace spanned by trajectory, but from a different perspective.

🧩 Random Feature (RF)

The RF configuration can be viewed as a discrete version of NTK, but the width of network does not have to be wide. The discrepancy between the eigenvalues in continuous and discrete cases can be studied through trivial tools like Weyl’s inequality or slightly more complex ways. It is not the focus of this post, we will leave this topic to a later post, see on eigenvalue estimate of kernel.

💡Training dynamics

A central topic around neural networks is the quantification of generalization (statistical, approximation, training) error. Mathematically, the only challenging term is to bound the training error, which demands a comprehensive exploration of the training dynamics.

🌵 Preliminaries

We consider a generic formulation in one dimension:

f (x, t) = i = 1 \sum n a_{i} (t) σ (x - b_{i} (t))

The biases $b_{i} (0)$ are initialized on $[- 1, 1]$ uniformly (or equispaced), the weight’s initialization will be discussed later. Suppose the target function is denoted by $g$ , and we naively consider the gradient flow.

we introduce the auxiliary function $w (b, t)$ that

\partial_{b}^{2} w (b, t) = f (b, t) - g (b) .

Then, taking account of the ReLU function, it is quite easy to derive:

Observation

$\partial_{b}^{4} w (b, t) = k = 1 \sum \infty Θ_{k} (t) ϕ_{k} (b)$
where ${ϕ_{k}}_{k \geq 1}$ are the eigenfunctions of the following problem:
$ϕ_{k}^{(4)} = λ_{k} ϕ_{k}, ϕ_{k} (1) = ϕ_{k}^{'} (1) = ϕ_{k}^{''} (- 1) = ϕ_{k}^{'''} (- 1) = 0.$

Latent Seminar

Explorer

on two-layer ReLU networks

🏷️ Introduction

☘️ Radon transform

🔭 Spectral perspective

🧩 Neural-Tangent-Kernel (NTK)

🧩 Random Feature (RF)

💡Training dynamics

🌵 Preliminaries

💬 Further discussion

🌊 Initialization

Notes

Links

Graph View

Table of Contents

Backlinks