Info

This note is not intended to be a literature review of any kind nor an original research piece, it is just for fun.

🏷️ Introduction

The ReLU (rectified linear unit) refers to the following operation

Let us denote this function by for short. Trivially, we find that

  • which makes a certain Green’s function (w.r.t some boundary conditions) for one dimensional Laplacian.
  • for any positive real number .

In higher dimensions, the ReLU function is often used with a linear layer:

where the vector/matrix is the weights, and the scalar/vector is the bias. The commutativity with scalar multiplication makes it possible to consider only normalized weights or normalized input .

Definition

Two layer ReLU network is defined by

where is the usual measure on the joint space .

☘️ Radon transform

For some input , if we let , then

where is the Heaviside function and . Therefore, (at least formally or in weak sense)

where is the adjoint Radon transform and is a distribution in the dual space. The intertwining property and the inversion formula are well-known.

Intertwining Property

Helgason's Inversion Formula

In Fourier sense, the inversion formula holds for with decay for some :

When appropriate, we can write in Fourier sense, where is a Fourier multiplier, and the singular values of the operator is essential in terms of optimization.

Observation

In Fourier sense

By Weyl’s law, it implies that the singular values of the operator will be decaying like . It explains the low-pass filtration property of two-layer ReLU network.

πŸ”­ Spectral perspective

There are two common (easy yet boring) configurations under the scope of spectral perspectives.

🧩 Neural-Tangent-Kernel (NTK)

The NTK configuration assumes sufficiently wide network, which makes the training behavior degenerated to a linear ODE (under gradient flow).

At a first sight, we find this evolution equation will be sustaining the initial β€œnoises” for a long time. It also infers that a highly sophisticated initialization (that contains high frequencies) could be toxic to the training process if those are not desired. Interestingly, a similar formulation is also seen in on subspace spanned by trajectory, but from a different perspective.

🧩 Random Feature (RF)

The RF configuration can be viewed as a discrete version of NTK, but the width of network does not have to be wide. The discrepancy between the eigenvalues in continuous and discrete cases can be studied through trivial tools like Weyl’s inequality or slightly more complex ways. It is not the focus of this post, we will leave this topic to a later post, see on eigenvalue estimate of kernel.

πŸ’‘Training dynamics

A central topic around neural networks is the quantification of generalization (statistical, approximation, training) error. Mathematically, the only challenging term is to bound the training error, which demands a comprehensive exploration of the training dynamics.

🌡 Preliminaries

We consider a generic formulation in one dimension:

The biases are initialized on uniformly (or equispaced), the weight’s initialization will be discussed later. Suppose the target function is denoted by , and we naively consider the gradient flow.

we introduce the auxiliary function that

Then, taking account of the ReLU function, it is quite easy to derive:

Observation

where are the eigenfunctions of the following problem:

πŸ’¬ Further discussion

🌊 Initialization

Notes