08-3 Gaussian Processes and the Neural Network Gaussian Process (NNGP) Correspondence¶

1. Introduction: The Function-Space Perspective of Deep Learning¶

Historically, deep learning and kernel methods were viewed as competing paradigms. Deep learning models, particularly neural networks, are parametric models. We define an architecture governed by a finite (though large) set of weights and biases $\theta \in \mathbb{R}^D$, and we optimize these parameters to minimize a loss function. This parameter-space perspective is intuitive for optimization but often mathematically opaque. It is exceedingly difficult to answer questions like: "What is the prior distribution over functions induced by initializing a ResNet-50 with He initialization?" or "How does the uncertainty of this model behave far from the training data?"

Conversely, Gaussian Processes (GPs) and Support Vector Machines (SVMs) are non-parametric kernel methods. They operate directly in function space. Instead of searching for parameters, we specify a covariance function (a kernel) that dictates how similar two inputs are, and we perform exact Bayesian inference over the functions themselves. GPs offer rigorous uncertainty quantification, elegant mathematical tractability, and a clear interpretation of priors. However, standard GPs with simple kernels (like the Radial Basis Function) struggle with high-dimensional structured data like images and text, where deep learning excels.

The Neural Network Gaussian Process (NNGP) correspondence, first discovered by Radford Neal in 1994 for single-layer networks and generalized to deep networks by Lee et al. and Matthews et al. in 2018, unites these two worlds. It proves that an infinitely wide neural network, initialized with random weights, is exactly equivalent to a Gaussian Process. This profound result implies that deep learning architectures (the arrangement of linear layers, convolutions, and activations) are essentially complex, hierarchical methods for constructing kernels.

This lecture will provide an exhaustive mathematical derivation of Gaussian Processes, Mercer's theorem, and the NNGP limit. We will explore how to compute the NNGP kernel for Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs), and we will dissect the philosophical implications for the future of Bayesian Deep Learning.

2. Foundations of Gaussian Processes¶

Before we can bridge the gap to neural networks, we must build a rigorous foundation of Gaussian Processes from the ground up.

2.1 The Definition of a Gaussian Process¶

A Gaussian Process is a generalization of the multivariate Gaussian distribution to infinite dimensions. While a multivariate Gaussian defines a distribution over vectors, a GP defines a distribution over functions.

Definition 2.1 (Gaussian Process)

A stochastic process $\{f(x) : x \in \mathcal{X}\}$ indexed by an input space $\mathcal{X}$ is a Gaussian Process if and only if, for any finite collection of input points $X = \{x_1, x_2, \dots, x_N\} \subset \mathcal{X}$, the joint distribution of the function values $\mathbf{f} = [f(x_1), f(x_2), \dots, f(x_N)]^T$ is a multivariate Gaussian distribution.

A GP is completely specified by two functions: 1. The Mean Function $m(x)$: $m(x) = \mathbb{E}[f(x)]$ This represents our expected value of the function before seeing any data. In most applications, we assume $m(x) = 0$ without loss of generality, because if we have a known mean, we can simply subtract it from the data, perform zero-mean GP regression, and add it back at the end.

The Covariance Function (Kernel) $k(x, x')$: $k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]$ This function dictates the shape, smoothness, and properties of the functions. If $x$ and $x'$ are "close" according to the kernel, then $f(x)$ and $f(x')$ will be highly correlated.

We denote this as:

\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]

2.2 Positive Semi-Definiteness and Validity of Kernels¶

Not every function $k(x, x')$ can serve as a valid kernel. Because $k$ must construct a covariance matrix $\mathbf{K}$ for any arbitrary set of points $X$, the resulting matrix $\mathbf{K}$ must always be a valid covariance matrix. By the laws of probability, covariance matrices must be symmetric and Positive Semi-Definite (PSD).

Definition 2.2 (Positive Semi-Definite Kernel)

A symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is PSD if, for any $N \in \mathbb{N}$, any set of points $\{x_1, \dots, x_N\} \subset \mathcal{X}$, and any set of real coefficients $\{c_1, \dots, c_N\} \subset \mathbb{R}$, the following inequality holds:

\[ \sum_{i=1}^N \sum_{j=1}^N c_i c_j k(x_i, x_j) \geq 0 \]

This is equivalent to stating that the quadratic form $\mathbf{c}^T \mathbf{K} \mathbf{c} \geq 0$ for all vectors $\mathbf{c}$, meaning all eigenvalues of $\mathbf{K}$ are non-negative. If a kernel satisfies this property, it is guaranteed to produce valid Gaussian distributions.

2.3 Mercer's Theorem: The Feature Space Perspective¶

To truly understand what a kernel is doing, we turn to Mercer's Theorem. This theorem establishes a dual perspective: operating with a kernel is mathematically equivalent to mapping the inputs into an (often infinite-dimensional) feature space and performing a linear inner product.

Theorem 2.1 (Mercer's Theorem)

Let $\mathcal{X}$ be a compact metric space, and let $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ be a continuous, symmetric, positive semi-definite kernel. There exists an orthonormal basis of eigenfunctions $\{\phi_i(x)\}_{i=1}^\infty$ in $L^2(\mathcal{X})$ and corresponding non-negative eigenvalues $\{\lambda_i\}_{i=1}^\infty$ such that:

\[ k(x, x') = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x') \]

where the series converges absolutely and uniformly on $\mathcal{X} \times \mathcal{X}$.

Proof of Theorem 2.1 (Rigorous Derivation):

Integral Operator Formulation: We define a linear integral operator $T_k: L^2(\mathcal{X}) \to L^2(\mathcal{X})$ as follows:

$$ (T_k f)(x) = \int_{\mathcal{X}} k(x, y) f(y) \, d\mu(y) $$

where $\mu$ is a finite Borel measure on $\mathcal{X}$.

Properties of the Operator:
Compactness: Because the kernel $k(x, y)$ is continuous on the compact domain $\mathcal{X} \times \mathcal{X}$, it is bounded. This guarantees that $T_k$ maps bounded sets in $L^2(\mathcal{X})$ to relatively compact sets, meaning $T_k$ is a compact operator.
Self-Adjointness: Since the kernel is symmetric ($k(x, y) = k(y, x)$), it follows that for any two functions $f, g \in L^2(\mathcal{X})$, the inner product satisfies $\langle T_k f, g \rangle = \langle f, T_k g \rangle$. Thus, $T_k$ is a self-adjoint operator.
Positivity: Because the kernel is PSD, the operator is positive:

$$ \langle f, T_k f \rangle = \int_{\mathcal{X}} \int_{\mathcal{X}} k(x, y) f(x) f(y) \, d\mu(x) d\mu(y) \geq 0 $$
Spectral Theorem for Compact Self-Adjoint Operators: By the spectral theorem, any compact, self-adjoint operator on a Hilbert space possesses an orthonormal basis of eigenfunctions $\{\phi_i\}$ corresponding to real eigenvalues $\lambda_i$. Since the operator is positive, all $\lambda_i \geq 0$. Furthermore, because $T_k$ is compact, the sequence of eigenvalues must converge to zero: $\lim_{i \to \infty} \lambda_i = 0$.
Uniform Convergence (Mercer's Contribution): The spectral theorem alone only gives us convergence in the $L^2$ norm. Mercer's specific contribution was proving that because $k$ is continuous and PSD, the spectral decomposition converges absolutely and uniformly. We define the partial sums:

$$ k_N(x, y) = \sum_{i=1}^N \lambda_i \phi_i(x) \phi_i(y) $$

By Dini's Theorem, a monotonic sequence of continuous functions converging point-wise to a continuous function on a compact space must converge uniformly. The sequence $R_N(x, x) = k(x, x) - k_N(x, x)$ is a decreasing sequence of continuous, non-negative functions converging to 0. Thus, the convergence is uniform. $\blacksquare$

The Infinite-Dimensional Linear Model: Mercer's Theorem provides a stunning reinterpretation of Gaussian Processes. Let's define an infinite-dimensional feature vector for an input $x$:

\[ \Phi(x) = [\sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots]^T \]

Then the kernel is exactly the dot product in this space: $k(x, x') = \Phi(x)^T \Phi(x')$. Now, consider a linear model with infinite parameters $w_i \sim \mathcal{N}(0, 1)$:

\[ f(x) = \mathbf{w}^T \Phi(x) = \sum_{i=1}^\infty w_i \sqrt{\lambda_i} \phi_i(x) \]

Because linear combinations of Gaussians are Gaussian, $f(x)$ is a Gaussian Process. Its mean is zero, and its covariance is:

\[ \mathbb{E}[f(x) f(x')] = \mathbb{E}[\Phi(x)^T \mathbf{w} \mathbf{w}^T \Phi(x')] = \Phi(x)^T \mathbb{E}[\mathbf{w} \mathbf{w}^T] \Phi(x') = \Phi(x)^T \mathbf{I} \Phi(x') = k(x, x') \]

Thus, every Gaussian Process is equivalent to a Bayesian linear regression model with an infinite number of basis functions.

2.4 Exact Bayesian Inference with GPs¶

The primary advantage of GPs is that Bayesian inference can be carried out exactly via linear algebra, without the need for MCMC or Variational Inference.

Suppose we have a dataset $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N$. We assume the targets are corrupted by i.i.d. Gaussian noise: $y_i = f(\mathbf{x}_i) + \epsilon_i$, where $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$. Let $\mathbf{X}$ be the matrix of training inputs and $\mathbf{y}$ be the vector of training targets.

We want to predict the function value $f_*$ at a new test point $\mathbf{x}_*$. By the definition of a GP, the joint distribution of the training targets $\mathbf{y}$ and the test prediction $f_*$ is a multivariate Gaussian:

\[ \begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{bmatrix} \mathbf{K}(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I} & \mathbf{k}(\mathbf{X}, \mathbf{x}_*) \\ \mathbf{k}(\mathbf{x}_*, \mathbf{X}) & k(\mathbf{x}_*, \mathbf{x}_*) \end{bmatrix} \right) \]

To find the posterior predictive distribution $P(f_* \mid \mathbf{X}, \mathbf{y}, \mathbf{x}_*)$, we apply the standard conditioning rules for multivariate Gaussians:

\[ f_* \mid \mathcal{D}, \mathbf{x}_* \sim \mathcal{N}(\mu_*, \sigma_*^2) \]

Where the predictive mean is:

\[ \mu_* = \mathbf{k}(\mathbf{x}_*, \mathbf{X}) [\mathbf{K}(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I}]^{-1} \mathbf{y} \]

And the predictive variance (representing epistemic uncertainty) is:

\[ \sigma_*^2 = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}(\mathbf{x}_*, \mathbf{X}) [\mathbf{K}(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I}]^{-1} \mathbf{k}(\mathbf{X}, \mathbf{x}_*) \]

The inversion of the $N \times N$ matrix $[\mathbf{K} + \sigma_n^2 \mathbf{I}]$ requires $O(N^3)$ operations. This cubic scaling is the primary computational bottleneck of Gaussian Processes, preventing their direct application to datasets with millions of examples.

3. Neal's Theorem: The Single-Layer NNGP Limit¶

In 1994, Radford Neal demonstrated that a neural network with a single hidden layer, initialized with random weights, converges to a Gaussian Process as the width of the hidden layer goes to infinity.

3.1 Network Architecture and Assumptions¶

Consider a Multi-Layer Perceptron (MLP) with an input dimension $d$, a single hidden layer of width $H$, and a scalar output $f(x)$. The function is defined as:

\[ f(x) = b + \frac{1}{\sqrt{H}} \sum_{j=1}^H v_j \phi(\mathbf{w}_j^T x + b_j) \]

Here, $\phi(\cdot)$ is a non-linear activation function (e.g., tanh, sigmoid, ReLU). The scaling factor $1/\sqrt{H}$ is crucial; it is known as "NTK parameterization" and ensures the variance of the output remains finite as $H \to \infty$.

We assign independent Gaussian priors to all parameters:

Input weights: $\mathbf{w}_j \sim \mathcal{N}(\mathbf{0}, \frac{\sigma_w^2}{d} \mathbf{I})$
Hidden biases: $b_j \sim \mathcal{N}(0, \sigma_b^2)$
Output weights: $v_j \sim \mathcal{N}(0, \sigma_v^2)$
Output bias: $b \sim \mathcal{N}(0, \sigma_{b,out}^2)$

3.2 Proof of the Infinite-Width Limit¶

Theorem 3.1 (Neal's Limit)

As the hidden layer width $H \to \infty$, the stochastic process defined by the neural network $f(x)$ converges in distribution to a Gaussian Process with mean zero and covariance function $K(x, x')$.

Proof of Theorem 3.1:

Step 1: Independence of Hidden Units Let the pre-activation of the $j$-th hidden unit be $g_j(x) = \mathbf{w}_j^T x + b_j$. Because the input weights $\mathbf{w}_j$ and biases $b_j$ are drawn independently from identically distributed Gaussian priors, the pre-activations $g_1(x), g_2(x), \dots, g_H(x)$ are independent and identically distributed (i.i.d.) random variables for any fixed input $x$. Consequently, the post-activations $h_j(x) = \phi(g_j(x))$ are also i.i.d. random variables.

Step 2: Defining the Summands We can rewrite the network output as:

\[ f(x) = b + \sum_{j=1}^H \zeta_j(x) \]

where we define $\zeta_j(x) = \frac{1}{\sqrt{H}} v_j h_j(x)$. Because the output weights $v_j$ are i.i.d. and independent of the hidden layer parameters, the terms $\zeta_1(x), \dots, \zeta_H(x)$ are mutually independent and identically distributed.

Step 3: Calculating the Mean The expected value of each summand is:

\[ \mathbb{E}[\zeta_j(x)] = \frac{1}{\sqrt{H}} \mathbb{E}[v_j] \mathbb{E}[h_j(x)] \]

Since the prior on $v_j$ has zero mean ($\mathbb{E}[v_j] = 0$), the expected value of $\zeta_j(x)$ is zero. Therefore, the expected value of the network output is:

\[ \mathbb{E}[f(x)] = \mathbb{E}[b] + \sum_{j=1}^H \mathbb{E}[\zeta_j(x)] = 0 + 0 = 0 \]

Step 4: Calculating the Covariance (The NNGP Kernel) We evaluate the covariance between the network outputs at two different inputs, $x$ and $x'$:

\[ K(x, x') = \mathbb{E}[f(x) f(x')] = \mathbb{E}\left[ \left( b + \sum_{j=1}^H \zeta_j(x) \right) \left( b + \sum_{k=1}^H \zeta_k(x') \right) \right] \]

Expanding the product and using the linearity of expectation:

\[ K(x, x') = \mathbb{E}[b^2] + \mathbb{E}\left[ \sum_{j=1}^H \sum_{k=1}^H \zeta_j(x) \zeta_k(x') \right] \]

Because $b \sim \mathcal{N}(0, \sigma_{b,out}^2)$, $\mathbb{E}[b^2] = \sigma_{b,out}^2$. For the double sum, we use the fact that for $j \neq k$, $\zeta_j$ and $\zeta_k$ are independent with mean zero, so $\mathbb{E}[\zeta_j(x) \zeta_k(x')] = 0$. Only the diagonal terms $j=k$ survive:

\[ K(x, x') = \sigma_{b,out}^2 + \sum_{j=1}^H \mathbb{E}[\zeta_j(x) \zeta_j(x')] \]

Substitute the definition of $\zeta_j(x)$:

\[ \mathbb{E}[\zeta_j(x) \zeta_j(x')] = \frac{1}{H} \mathbb{E}[v_j^2] \mathbb{E}[h_j(x) h_j(x')] \]

Since $v_j \sim \mathcal{N}(0, \sigma_v^2)$, $\mathbb{E}[v_j^2] = \sigma_v^2$. Furthermore, because all $h_j$ are identically distributed, the expectation $\mathbb{E}[h_j(x) h_j(x')]$ is the same for all $j$. Let's denote the generic weights and bias as $\mathbf{w}$ and $b_{hid}$. The sum simplifies:

\[ \sum_{j=1}^H \frac{1}{H} \sigma_v^2 \mathbb{E}[h_j(x) h_j(x')] = \sigma_v^2 \mathbb{E}_{\mathbf{w}, b_{hid}} [\phi(\mathbf{w}^T x + b_{hid}) \phi(\mathbf{w}^T x' + b_{hid})] \]

Thus, the analytic expression for the covariance is:

\[ K(x, x') = \sigma_{b,out}^2 + \sigma_v^2 \mathbb{E}_{\mathbf{w}, b_{hid}} [\phi(\mathbf{w}^T x + b_{hid}) \phi(\mathbf{w}^T x' + b_{hid})] \]

Step 5: Applying the Central Limit Theorem We have established that for any finite set of inputs $X = \{x_1, \dots, x_N\}$, the output vector $\mathbf{f} = [f(x_1), \dots, f(x_N)]^T$ is composed of a constant bias vector $\mathbf{b}$ plus a sum of $H$ independent and identically distributed random vectors $\boldsymbol{\zeta}_j = [\zeta_j(x_1), \dots, \zeta_j(x_N)]^T$. By the Multivariate Central Limit Theorem (CLT), as the number of terms $H \to \infty$, the sum of i.i.d. random vectors with finite variance converges in distribution to a multivariate Gaussian distribution. The mean of this distribution is $\mathbf{0}$, and the covariance matrix is exactly $\mathbf{K}$ with elements computed in Step 4. Since the joint distribution of any finite set of outputs is Gaussian, the stochastic process $f(x)$ is, by definition, a Gaussian Process. $\blacksquare$

4. Deep NNGPs and Recursive Kernel Composition¶

While Neal's theorem applied to shallow networks, Lee et al. (2018) and Matthews et al. (2018) proved that this equivalence holds for deep networks of arbitrary depth $L$, provided that the width of every hidden layer goes to infinity sequentially (or simultaneously under certain mathematical conditions).

4.1 The Recursive Formulation¶

Consider a deep MLP. Let $g^{(l)}(x)$ denote the pre-activations at layer $l$, and $h^{(l)}(x) = \phi(g^{(l)}(x))$ denote the post-activations. We model the network recursively:

\[ g_j^{(l+1)}(x) = b_j^{(l+1)} + \frac{1}{\sqrt{H_l}} \sum_{k=1}^{H_l} W_{jk}^{(l+1)} h_k^{(l)}(x) \]

If the previous layer $g^{(l)}$ is a Gaussian Process, then the activations $h_k^{(l)}$ are independent as $H_l \to \infty$. The sum forming $g^{(l+1)}$ again invokes the Central Limit Theorem. Thus, $g^{(l+1)}$ is also a Gaussian Process. This inductive reasoning shows that every layer in an infinite-width network represents a GP.

The covariance of the $l$-th layer, $K^{(l)}(x, x')$, is computed via a recursive integral. For the input layer ($l=0$):

\[ K^{(0)}(x, x') = \frac{\sigma_w^2}{d} x^T x' + \sigma_b^2 \]

For subsequent layers ($l \geq 1$):

\[ K^{(l)}(x, x') = \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v) \sim \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}^{(l-1)})} [\phi(u) \phi(v)] \]

Where the 2D Gaussian distribution for $(u, v)$ has the covariance matrix:

\[ \mathbf{\Sigma}^{(l-1)} = \begin{pmatrix} K^{(l-1)}(x, x) & K^{(l-1)}(x, x') \\ K^{(l-1)}(x', x) & K^{(l-1)}(x', x') \end{pmatrix} \]

This remarkable formula allows us to compute the exact, analytic prior covariance between two images as they pass through an infinitely deep network, without ever instantiating a single random weight!

4.2 Closed-Form Solutions: The Arc-Cosine Kernel¶

Computing the 2D Gaussian expectation $\mathbb{E} [\phi(u) \phi(v)]$ generally requires numerical integration (e.g., Gauss-Hermite quadrature). However, for certain activation functions, closed-form analytical solutions exist.

The most famous is the ReLU activation ($\phi(z) = \max(0, z)$), which leads to the Arc-Cosine Kernel (Cho & Saul, 2009). Let $\mathbf{\Sigma}^{(l-1)}$ be the covariance matrix. We define the variances and the correlation coefficient:

\[ \Sigma_{11} = K^{(l-1)}(x, x) \]

\[ \Sigma_{22} = K^{(l-1)}(x', x') \]

\[ \Sigma_{12} = K^{(l-1)}(x, x') \]

\[ \cos \theta = \frac{\Sigma_{12}}{\sqrt{\Sigma_{11} \Sigma_{22}}} \implies \theta = \arccos\left(\frac{\Sigma_{12}}{\sqrt{\Sigma_{11} \Sigma_{22}}}\right) \]

The integral evaluates analytically to:

\[ \mathbb{E}[\text{ReLU}(u) \text{ReLU}(v)] = \frac{\sqrt{\Sigma_{11} \Sigma_{22}}}{2\pi} (\sin \theta + (\pi - \theta) \cos \theta) \]

This formula is computationally trivial to evaluate, allowing us to compute exact deep NNGP kernels for MLPs of arbitrary depth on standard hardware.

5. Architectural Extensions: CNNs and RNNs¶

The NNGP correspondence is not limited to MLPs. Any architecture constructed from standard linear operations (convolutions, attention) combined with element-wise non-linearities possesses a corresponding infinite-width limit.

5.1 Convolutional NNGP (ConvNNGP)¶

In a Convolutional Neural Network (CNN), the hidden layers are spatial feature maps with $H$ channels. The infinite-width limit corresponds to letting the number of channels $H \to \infty$.

Instead of a single scalar covariance $K(x, x')$ for a pair of images, the ConvNNGP tracks a spatial covariance tensor $K(x, x', p, p')$, which represents the covariance between the pixel at position $p$ in image $x$ and the pixel at position $p'$ in image $x'$.

For a standard convolution with filter size $F \times F$, the pre-activation covariance at layer $l+1$ is computed by aggregating the post-activation covariances from the local receptive fields in layer $l$:

\[ K^{(l+1)}(x, x', p, p') = \sigma_b^2 + \frac{\sigma_w^2}{F^2} \sum_{\Delta p \in F} \sum_{\Delta p' \in F} \mathbb{E} [\phi(u_{\Delta p}) \phi(v_{\Delta p'})] \]

Applying pooling operations (like Global Average Pooling) at the end of the network simply averages these spatial covariances into a single $1 \times 1$ scalar covariance matrix, which can then be used for standard GP classification or regression.

5.2 Recurrent NNGP¶

Recurrent Neural Networks (RNNs) apply the same weight matrix iteratively over a sequence. In the infinite-width limit, the hidden state vector $\mathbf{h}_t$ becomes infinitely wide. The NNGP recursion unfolds over the sequence length. If we have two sequences $\mathbf{x} = (x_1, \dots, x_T)$ and $\mathbf{x}' = (x'_1, \dots, x'_{T'})$, we compute the covariance matrix between all pairs of time steps $(t, t')$ recursively, using the covariance from $(t-1, t'-1)$.

6. Worked Examples¶

6.1 Worked Example 1: GP Regression from Scratch¶

Problem: We have two training points $\mathbf{X}_{train} = [1.0, 3.0]^T$ with targets $\mathbf{y}_{train} = [0.5, 2.0]^T$. The observation noise variance is $\sigma_n^2 = 0.01$. We use an RBF kernel $k(x, x') = \exp(-\frac{1}{2}(x - x')^2)$. Compute the exact predictive mean and variance at the test point $x_* = 2.0$.

Step 1: Compute the Training Covariance Matrix $\mathbf{K}$ $k(1, 1) = \exp(0) = 1.0$ $k(3, 3) = \exp(0) = 1.0$ $k(1, 3) = k(3, 1) = \exp(-\frac{1}{2}(1 - 3)^2) = \exp(-2) \approx 0.1353$

\[ \mathbf{K} = \begin{bmatrix} 1.0 & 0.1353 \\ 0.1353 & 1.0 \end{bmatrix} \]

Step 2: Add Noise and Invert

\[ \mathbf{K} + \sigma_n^2 \mathbf{I} = \begin{bmatrix} 1.01 & 0.1353 \\ 0.1353 & 1.01 \end{bmatrix} \]

Determinant $D = 1.01^2 - 0.1353^2 = 1.0201 - 0.0183 = 1.0018$

\[ (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} = \frac{1}{1.0018} \begin{bmatrix} 1.01 & -0.1353 \\ -0.1353 & 1.01 \end{bmatrix} \approx \begin{bmatrix} 1.008 & -0.135 \\ -0.135 & 1.008 \end{bmatrix} \]

Step 3: Compute Test-Train Covariance Vector $\mathbf{k}_*$ $k(2, 1) = \exp(-\frac{1}{2}(2-1)^2) = \exp(-0.5) \approx 0.6065$ $k(2, 3) = \exp(-\frac{1}{2}(2-3)^2) = \exp(-0.5) \approx 0.6065$

\[ \mathbf{k}_* = \begin{bmatrix} 0.6065 \\ 0.6065 \end{bmatrix} \]

Step 4: Compute Predictive Mean $\mu_*$

\[ \mu_* = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}_{train} \]

\[ \mu_* = \begin{bmatrix} 0.6065 & 0.6065 \end{bmatrix} \begin{bmatrix} 1.008 & -0.135 \\ -0.135 & 1.008 \end{bmatrix} \begin{bmatrix} 0.5 \\ 2.0 \end{bmatrix} \]

\[ = \begin{bmatrix} 0.6065(1.008 - 0.135) & 0.6065(-0.135 + 1.008) \end{bmatrix} \begin{bmatrix} 0.5 \\ 2.0 \end{bmatrix} \]

\[ = \begin{bmatrix} 0.529 & 0.529 \end{bmatrix} \begin{bmatrix} 0.5 \\ 2.0 \end{bmatrix} = 0.2645 + 1.058 = 1.3225 \]

Step 5: Compute Predictive Variance $\sigma_*^2$

\[ \sigma_*^2 = k(x_*, x_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_* \]

\[ = 1.0 - \begin{bmatrix} 0.529 & 0.529 \end{bmatrix} \begin{bmatrix} 0.6065 \\ 0.6065 \end{bmatrix} = 1.0 - (0.3208 + 0.3208) = 1.0 - 0.6416 = 0.3584 \]

The predictive distribution at $x=2.0$ is $\mathcal{N}(1.3225, 0.3584)$.

6.2 Worked Example 2: NNGP Kernel for Identity Activation¶

Problem: Prove analytically that an infinite-width MLP of any depth $L$ with identity activations $\phi(z) = z$ and no biases ($\sigma_b=0$) reduces to Bayesian Linear Regression, regardless of the depth.

Solution: For the first hidden layer ($l=1$), the base recursion gives:

\[ K^{(1)}(x, x') = \sigma_w^2 \mathbb{E}_{\mathbf{w}} [(\mathbf{w}^T x)(\mathbf{w}^T x')] = \sigma_w^2 x^T \mathbb{E}[\mathbf{w} \mathbf{w}^T] x' = \sigma_w^2 x^T \mathbf{I} x' = \sigma_w^2 x^T x' \]

This is exactly the standard linear kernel.

For the second hidden layer ($l=2$), we use the recursive formula:

\[ K^{(2)}(x, x') = \sigma_w^2 \mathbb{E}_{(u, v) \sim \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}^{(1)})} [\phi(u)\phi(v)] \]

Since $\phi$ is the identity function, this is just $\mathbb{E}[u v]$. By definition, the expectation of the product of two zero-mean correlated Gaussian variables $u, v$ is their covariance, which is exactly the off-diagonal element of $\mathbf{\Sigma}^{(1)}$. Therefore:

\[ \mathbb{E}[u v] = \Sigma^{(1)}_{12} = K^{(1)}(x, x') \]

Substituting this back:

\[ K^{(2)}(x, x') = \sigma_w^2 K^{(1)}(x, x') = \sigma_w^2 (\sigma_w^2 x^T x') = (\sigma_w^2)^2 x^T x' \]

By strict mathematical induction, for any depth $L$:

\[ K^{(L)}(x, x') = (\sigma_w^2)^L x^T x' \]

This proves that stacking linear layers, even infinitely wide ones, does not increase the expressive capacity of the model beyond linear regression. It merely scales the variance of the linear kernel by $(\sigma_w^2)^L$.

7. Exhaustive Coding Demonstrations¶

7.1 Coding Demo 1: Full Implementation of the Deep Arc-Cosine Kernel¶

This demo shows how to construct the exact NNGP kernel for a deep ReLU network and use it for Gaussian Process regression.

Python

import matplotlib
matplotlib.use('Agg')
import numpy as np
import matplotlib.pyplot as plt

def compute_relu_nngp_kernel(X1, X2, depth=3, var_w=1.0, var_b=0.0):
    """
    Computes the analytic NNGP kernel for a deep ReLU MLP.
    X1, X2: Matrices of shape (N1, d) and (N2, d).
    Returns a covariance matrix of shape (N1, N2).
    """
    N1 = X1.shape[0]
    N2 = X2.shape[0]

    # Layer 0 (Input Kernel)
    # The initial covariance is the dot product of the inputs divided by dimension
    # plus the bias variance.
    d = X1.shape[1]
    K_11 = np.sum(X1 * X1, axis=1) / d + var_b
    K_22 = np.sum(X2 * X2, axis=1) / d + var_b
    K_12 = np.dot(X1, X2.T) / d + var_b

    for l in range(depth):
        # Prevent numerical instability (division by zero or negative sqrt)
        K_11 = np.clip(K_11, 1e-10, None)
        K_22 = np.clip(K_22, 1e-10, None)

        # Compute the normalization matrix (sqrt(K_11 * K_22))
        norm_matrix = np.sqrt(np.outer(K_11, K_22))

        # Compute the cosine of the angle theta
        cos_theta = np.clip(K_12 / norm_matrix, -1.0, 1.0)
        theta = np.arccos(cos_theta)

        # The Arc-Cosine formula for the expectation E[ReLU(u)*ReLU(v)]
        J_12 = (np.sin(theta) + (np.pi - theta) * cos_theta) / (2 * np.pi)

        # Compute the new covariance
        K_12 = var_b + var_w * norm_matrix * J_12

        # Update the self-covariances (where cos_theta = 1, theta = 0, J = 1/2)
        K_11 = var_b + var_w * K_11 / 2.0
        K_22 = var_b + var_w * K_22 / 2.0

    return K_12

# 1. Generate Non-Linear Synthetic Data
np.random.seed(42)
X_train = np.random.uniform(-4, 4, 15).reshape(-1, 1)
y_train = np.sin(X_train) + 0.1 * np.random.randn(15, 1)
X_test = np.linspace(-6, 6, 100).reshape(-1, 1)

# 2. Compute the Deep NNGP Kernel Matrices
depth = 5 # 5 hidden layers
var_w = 2.0 # He initialization equivalent scaling
var_b = 0.05
noise_var = 0.01

K_train_train = compute_relu_nngp_kernel(X_train, X_train, depth, var_w, var_b)
K_test_train = compute_relu_nngp_kernel(X_test, X_train, depth, var_w, var_b)
K_test_test = compute_relu_nngp_kernel(X_test, X_test, depth, var_w, var_b)

# 3. Perform Exact Bayesian GP Inference
K_inv = np.linalg.inv(K_train_train + noise_var * np.eye(15))
mu_test = K_test_train @ K_inv @ y_train

# Extract diagonal for variance
var_test = np.diag(K_test_test) - np.diag(K_test_train @ K_inv @ K_test_train.T)
std_test = np.sqrt(np.clip(var_test, 0, None)).reshape(-1, 1)

# 4. Visualization
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='black', zorder=10, label='Training Data')
plt.plot(X_test, mu_test, c='blue', label='NNGP Predictive Mean')
plt.fill_between(X_test.flatten(), 
                 (mu_test - 2 * std_test).flatten(), 
                 (mu_test + 2 * std_test).flatten(), 
                 color='blue', alpha=0.2, label='95% Epistemic Confidence')
plt.plot(X_test, np.sin(X_test), 'r--', label='True Function')
plt.title(f"Exact Bayesian Inference with Infinite Width MLP (Depth={depth})")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('figures/08-3-demo1.png', dpi=150, bbox_inches='tight')
plt.close()

7.2 Coding Demo 2: Empirical Verification of Neal's Limit¶

This demo proves empirically that as a finite neural network becomes wider, its finite sample covariance matrix approaches the analytical NNGP matrix computed in Demo 1.

Python

import numpy as np
import torch
import torch.nn as nn

torch.manual_seed(42)
np.random.seed(42)

def compute_relu_nngp_kernel(X1, X2, depth=3, var_w=1.0, var_b=0.0):
    N1 = X1.shape[0]
    N2 = X2.shape[0]
    d = X1.shape[1]
    K_11 = np.sum(X1 * X1, axis=1) / d + var_b
    K_22 = np.sum(X2 * X2, axis=1) / d + var_b
    K_12 = np.dot(X1, X2.T) / d + var_b
    for l in range(depth):
        K_11 = np.clip(K_11, 1e-10, None)
        K_22 = np.clip(K_22, 1e-10, None)
        norm_matrix = np.sqrt(np.outer(K_11, K_22))
        cos_theta = np.clip(K_12 / norm_matrix, -1.0, 1.0)
        theta = np.arccos(cos_theta)
        J_12 = (np.sin(theta) + (np.pi - theta) * cos_theta) / (2 * np.pi)
        K_12 = var_b + var_w * norm_matrix * J_12
        K_11 = var_b + var_w * K_11 / 2.0
        K_22 = var_b + var_w * K_22 / 2.0
    return K_12

def empirical_covariance(X, width, num_samples=1000):
    N, d = X.shape
    outputs = np.zeros((num_samples, N))

    # Standard NNGP/NTK scaling parameterization
    for i in range(num_samples):
        # Define a massive 1-hidden-layer network
        net = nn.Sequential(
            nn.Linear(d, width),
            nn.ReLU(),
            nn.Linear(width, 1, bias=False)
        )

        # Apply strict NNGP initialization
        nn.init.normal_(net[0].weight, mean=0.0, std=np.sqrt(1.0/d))
        nn.init.zeros_(net[0].bias)
        nn.init.normal_(net[2].weight, mean=0.0, std=np.sqrt(1.0/width))

        with torch.no_grad():
            outputs[i, :] = net(torch.tensor(X, dtype=torch.float32)).numpy().flatten()

    # Compute the empirical covariance matrix across the 1000 network instantiations
    return np.cov(outputs, rowvar=False)

# Define two simple input points
X_sample = np.array([[1.0], [2.0]])

print("--- Empirical vs Analytical NNGP ---")
print("Analytical Matrix (Depth=1):")
# var_w = 1.0 (to match the std=1.0 scaling), var_b = 0.0
K_exact = compute_relu_nngp_kernel(X_sample, X_sample, depth=1, var_w=1.0, var_b=0.0)
print(K_exact)

for w in [10, 100, 10000]:
    print(f"\nEmpirical Matrix (Width={w}, 1000 samples):")
    K_emp = empirical_covariance(X_sample, w)
    print(K_emp)

# The output will show that K_emp converges tightly to K_exact as w -> 10000.

Text Only

--- Empirical vs Analytical NNGP ---
Analytical Matrix (Depth=1):
[[0.5 1. ]
 [1.  2. ]]

Empirical Matrix (Width=10, 1000 samples):
[[0.51374438 1.02748875]
 [1.02748875 2.0549775 ]]

Empirical Matrix (Width=100, 1000 samples):
[[0.50297487 1.00594973]
 [1.00594973 2.01189945]]

Empirical Matrix (Width=10000, 1000 samples):
[[0.51332177 1.02664354]
 [1.02664354 2.05328708]]

8. Philosophical Implications and the NTK¶

The NNGP equivalence resolves a long-standing mystery: Why don't highly over-parameterized neural networks perfectly overfit random noise in a way that destroys generalization? Classical statistical learning theory suggests that billions of parameters should lead to extreme overfitting. The NNGP provides the answer: As networks grow wider, they do not become more chaotic; they become smoother. They converge to well-behaved Gaussian Processes with structured kernels that possess strong inductive biases for generalization.

However, the NNGP describes the distribution of functions at initialization (before any training). It implies that if we performed exact Bayesian inference on the weights, the result would be GP regression. But in practice, we use Stochastic Gradient Descent (SGD).

Does SGD on an infinite-width network behave like GP regression? The answer is yes, but governed by a different kernel: the Neural Tangent Kernel (NTK). Discovered by Jacot et al. in 2018, the NTK proves that in the infinite-width limit, the weights of the network move infinitesimally during training, yet the function output changes significantly. The training dynamics follow a linear differential equation governed entirely by the fixed NTK.

Together, the NNGP (describing the prior) and the NTK (describing the SGD dynamics) provide a complete, rigorous mathematical theory of deep learning in the infinite-width regime, closing the historical divide between deep neural networks and classical kernel machines.