6.2 Graph Neural Networks and Spectral Theory¶

Graph Neural Networks (GNNs) have emerged as the standard architecture for modeling relational data, where information is structured as a set of entities (nodes) and their relationships (edges). Unlike standard Euclidean data like images or sequences, graphs lack a canonical ordering of nodes. This leads to the fundamental requirement of permutation invariance (for graph-level tasks) and permutation equivariance (for node-level tasks). This chapter explores the theoretical limits of GNNs through the lens of the Weisfeiler-Lehman hierarchy and provides a rigorous treatment of Spectral Graph Theory.

1. Graph Representation and Symmetries¶

A graph is represented as \(G = (V, E)\), where \(V\) is a set of \(n\) nodes and \(E \subseteq V \times V\) is a set of edges. We often associate each node \(i \in V\) with a feature vector \(\mathbf{x}_i \in \mathbb{R}^d\).

1.1 Permutation Invariance and Equivariance¶

Let \(\mathbf{X} \in \mathbb{R}^{n \times d}\) be the matrix of node features and \(\mathbf{A} \in \{0, 1\}^{n \times n}\) be the adjacency matrix. A permutation \(\pi \in S_n\) acts on the graph by reordering the nodes. This is represented by a permutation matrix \(\mathbf{P} \in \{0, 1\}^{n \times n}\).

The transformed features and adjacency matrix are:

\(\mathbf{X}' = \mathbf{P} \mathbf{X}\)
\(\mathbf{A}' = \mathbf{P} \mathbf{A} \mathbf{P}^T\)

Definition (Invariance)

A function \(f(\mathbf{A}, \mathbf{X})\) is permutation invariant if \(f(\mathbf{P}\mathbf{A}\mathbf{P}^T, \mathbf{P}\mathbf{X}) = f(\mathbf{A}, \mathbf{X})\) for all \(\mathbf{P} \in S_n\).

Definition (Equivariance)

A function \(\mathbf{F}(\mathbf{A}, \mathbf{X})\) is permutation equivariant if \(\mathbf{F}(\mathbf{P}\mathbf{A}\mathbf{P}^T, \mathbf{P}\mathbf{X}) = \mathbf{P} \mathbf{F}(\mathbf{A}, \mathbf{X})\) for all \(\mathbf{P} \in S_n\).

2. The Weisfeiler-Lehman Hierarchy and GNN Expressivity¶

The fundamental question of GNN theory is: "Which graphs can a GNN distinguish?" If two graphs are non-isomorphic but a GNN maps them to the same embedding, the GNN has "failed."

The 1-WL test is a heuristic for graph isomorphism.

Initialize color \(c_i^{(0)} = 1\) for all nodes (or use node labels).
At step \(t\), refine colors:

\[ c_i^{(t)} = \text{HASH}\left( c_i^{(t-1)}, \{ c_j^{(t-1)} : j \in \mathcal{N}(i) \} \right) \]

where \(\{ \dots \}\) denotes the multiset of neighbor colors.

Terminate when the partition of nodes into color classes stabilizes.

If the sorted multiset of colors for graph \(G_1\) differs from \(G_2\), they are definitely non-isomorphic. If they are the same, they might be isomorphic.

2.2 MPNNs are Bounded by 1-WL¶

Standard Message Passing Neural Networks (MPNNs) use an aggregation function to update node states.

Theorem (MPNN Expressivity Limit)

Any MPNN with continuous aggregation and update functions is at most as expressive as the 1-WL test in terms of distinguishing non-isomorphic graphs.

Proof: We proceed by induction on the iteration number \(t\). Let \(h_v^{(t)}\) be the embedding of node \(v\) in an MPNN and \(c_v^{(t)}\) be its color in 1-WL.

Inductive Hypothesis: If \(c_u^{(t)} = c_v^{(t)}\), then \(h_u^{(t)} = h_v^{(t)}\) for any nodes \(u, v\) in any graphs \(G, G'\).

Base Case (\(t=0\)): Standard initialization is \(c_v^{(0)} = \text{label}(v)\) and \(h_v^{(0)} = \mathbf{x}_v\). If we assume node features are the labels, then \(c_u^{(0)} = c_v^{(0)} \implies h_u^{(0)} = h_v^{(0)}\).

Inductive Step: Assume the hypothesis holds for \(t-1\). Suppose \(c_u^{(t)} = c_v^{(t)}\). In the 1-WL algorithm, the hash function is injective. Thus, \(c_u^{(t)} = c_v^{(t)}\) implies:

\(c_u^{(t-1)} = c_v^{(t-1)}\)
\(\{ c_i^{(t-1)} : i \in \mathcal{N}(u) \} = \{ c_j^{(t-1)} : j \in \mathcal{N}(v) \}\)

By the inductive hypothesis:

\(h_u^{(t-1)} = h_v^{(t-1)}\)
The multiset of embeddings of neighbors must also be identical:

\[ \{ h_i^{(t-1)} : i \in \mathcal{N}(u) \} = \{ h_j^{(t-1)} : j \in \mathcal{N}(v) \} \]

The MPNN update is \(h_u^{(t)} = \text{UP}^{(t)}(h_u^{(t-1)}, \text{AGG}^{(t)}(\{ h_i^{(t-1)} : i \in \mathcal{N}(u) \}))\). Since both inputs to the deterministic functions \(\text{UP}\) and \(\text{AGG}\) are identical for \(u\) and \(v\), it must be that \(h_u^{(t)} = h_v^{(t)}\).

Thus, if 1-WL cannot distinguish two graphs (mapping them to the same multiset of colors), no MPNN can distinguish them either. \(\blacksquare\)

2.3 The Graph Isomorphic Network (GIN)¶

While MPNNs are bounded by 1-WL, can we reach that bound? GIN achieves this by using a sum-aggregator and an MLP.

Theorem (Injectivity of Sum-Aggregation)

Let \(\mathcal{X}\) be a countable set. There exists a function \(f: \mathcal{X} \to \mathbb{R}^n\) such that \(h(M) = \sum_{x \in M} f(x)\) is injective for all multisets \(M\) of finite size.

Proof Sketch: For any countable set, we can map its elements to a set of numbers (e.g., primes) such that the sums over multisets are unique (by the fundamental theorem of arithmetic or properties of transcendental numbers). GIN uses an MLP to learn such an injective mapping.

2.4 Higher-Order \(k\)-WL Hierarchy¶

Since 1-WL fails on simple cases (like 3-regular graphs of different sizes), higher-order variants of the WL algorithm examine tuples of nodes:

2-WL: Operates on pairs of nodes \((i, j)\).
k-WL: Operates on \(k\)-tuples of nodes \((v_1, \dots, v_k)\).

Hierarchy Property: For \(k \geq 2\), \((k+1)\)-WL is strictly more powerful than \(k\)-WL. That is, there exist graphs distinguished by \((k+1)\)-WL that are not distinguished by \(k\)-WL.

3. Spectral Graph Theory¶

Spectral GNNs define convolutions in the "frequency domain" of the graph.

3.1 The Graph Laplacian¶

Let \(\mathbf{D}\) be the degree matrix (\(D_{ii} = \sum_j A_{ij}\)).

Combinatorial Laplacian: \(\mathbf{L} = \mathbf{D} - \mathbf{A}\)
Normalized Laplacian: \(\tilde{\mathbf{L}} = \mathbf{I} - \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2}\)

Properties of \(\tilde{\mathbf{L}}\):

It is symmetric and positive semi-definite.
Its eigenvalues lie in \([0, 2]\).
The smallest eigenvalue is 0, with the constant vector \(\mathbf{D}^{1/2} \mathbf{1}\) as an eigenvector.

3.2 Graph Fourier Transform¶

Let \(\tilde{\mathbf{L}} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^T\) be the eigendecomposition. The eigenvectors \(\mathbf{u}_l\) (columns of \(\mathbf{U}\)) are the "Graph Fourier Bases," and eigenvalues \(\lambda_l\) are the "frequencies." The Graph Fourier Transform of a signal \(\mathbf{x}\) is \(\hat{\mathbf{x}} = \mathbf{U}^T \mathbf{x}\).

A convolution in the spectral domain is defined as pointwise multiplication in the Fourier domain:

\[ \mathbf{y} = \mathbf{g}_\theta * \mathbf{x} = \mathbf{U} (\mathbf{U}^T \mathbf{g}_\theta \odot \mathbf{U}^T \mathbf{x}) = \mathbf{U} \mathbf{g}_\theta(\mathbf{\Lambda}) \mathbf{U}^T \mathbf{x} \]

where \(\mathbf{g}_\theta(\mathbf{\Lambda})\) is a diagonal matrix of filter coefficients.

3.3 Spectral Localization and Chebyshev Polynomials¶

A filter is localized if its action only depends on a local neighborhood.

Theorem (Polynomial Filters are Localized)

A spectral filter of the form \(g(\lambda) = \sum_{k=0}^K \theta_k \lambda^k\) is \(K\)-localized. That is, \((\mathbf{U} g(\mathbf{\Lambda}) \mathbf{U}^T)_{ij} = 0\) if the distance \(d(i, j) > K\).

Proof:

\[ \mathbf{U} \left( \sum_{k=0}^K \theta_k \mathbf{\Lambda}^k \right) \mathbf{U}^T = \sum_{k=0}^K \theta_k (\mathbf{U} \mathbf{\Lambda} \mathbf{U}^T)^k = \sum_{k=0}^K \theta_k \mathbf{L}^k \]

We know \((\mathbf{A}^k)_{ij}\) is the number of paths of length \(k\) between \(i\) and \(j\). Similarly, the \(ij\)-th entry of \(\mathbf{L}^k\) is non-zero only if there is a path of length at most \(k\) between \(i\) and \(j\). Thus, if \(d(i, j) > K\), all terms in the sum are zero. \(\blacksquare\)

To avoid the \(O(n^3)\) cost of eigendecomposition, ChebNet uses Chebyshev polynomials \(T_k(x)\) which can be computed recursively:

\[ T_k(x) = 2x T_{k-1}(x) - T_{k-2}(x) \]

This allows computing \(\mathbf{L}^k \mathbf{x}\) in \(O(k|E|)\) time.

4. Worked Examples¶

Example 1: Failure of 1-WL on the Hexagon vs. Two Triangles¶

Consider:

\(G_1\): A cycle of 6 nodes.
\(G_2\): Two disjoint cycles of 3 nodes. Both graphs are 2-regular. In 1-WL, every node initializes with the same color \(1\).
In \(G_1\), each node has 2 neighbors of color 1.
In \(G_2\), each node has 2 neighbors of color 1. Thus, at step 1, 1-WL assigns the same new color to all nodes in both graphs. The partition never refines further. 1-WL fails to distinguish them despite \(G_2\) being disconnected and \(G_1\) being connected.

Example 2: The Graph Laplacian of a Path Graph \(P_3\)¶

\(P_3\) has 3 nodes and edges \((1, 2)\) and \((2, 3)\).

\[ \mathbf{A} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{pmatrix}, \quad \mathbf{D} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \end{pmatrix} \]

\[ \mathbf{L} = \mathbf{D} - \mathbf{A} = \begin{pmatrix} 1 & -1 & 0 \\ -1 & 2 & -1 \\ 0 & -1 & 1 \end{pmatrix} \]

Eigenvalues: \(0, 1, 3\). Eigenvectors:

\(\lambda=0: \mathbf{u}_0 = \frac{1}{\sqrt{3}}(1, 1, 1)^T\) (DC component)
\(\lambda=1: \mathbf{u}_1 = \frac{1}{\sqrt{2}}(1, 0, -1)^T\)
\(\lambda=3: \mathbf{u}_2 = \frac{1}{\sqrt{6}}(1, -2, 1)^T\) (High frequency)

Example 3: GCN as a First-Order Spectral Filter¶

Hammond et al. (2011) showed \(g_\theta(\mathbf{L}) \mathbf{x} \approx \theta_0 \mathbf{x} + \theta_1 \mathbf{L} \mathbf{x}\). Kipf & Welling (2017) simplified this further:

Approximate \(\lambda_{max} \approx 2\).
Set \(\theta = \theta_0 = -\theta_1\).
Result: \(\mathbf{y} = \theta (\mathbf{I} + \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2}) \mathbf{x}\). Using the "renormalization trick" \((\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I})\), we get the standard GCN layer.

5. Coding Demonstrations¶

Demo 1: Implementing a GIN Layer (PyTorch)¶

Python

import torch
import torch.nn as nn

class GINLayer(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.epsilon = nn.Parameter(torch.zeros(1))
        # MLP for injective mapping
        self.mlp = nn.Sequential(
            nn.Linear(in_dim, out_dim),
            nn.ReLU(),
            nn.Linear(out_dim, out_dim)
        )

    def forward(self, x, adj):
        # x: [N, in_dim], adj: [N, N] (sparse or dense)
        # 1. Neighbor aggregation (Sum)
        neighbor_sum = torch.matmul(adj, x)

        # 2. Combine with self
        out = (1 + self.epsilon) * x + neighbor_sum

        # 3. Apply MLP
        return self.mlp(out)

# Test on a small graph
x = torch.eye(3) # Identity features
adj = torch.tensor([[0., 1., 0.], [1., 0., 1.], [0., 1., 0.]])
layer = GINLayer(3, 8)
print("Output shape:", layer(x, adj).shape)

Text Only

Output shape: torch.Size([3, 8])

Demo 2: Chebyshev Spectral Convolution¶