Topic 05: Practice, Exercises, and Solutions¶

This module provides theoretical and practical exercises to reinforce the concepts of Information Theory in Deep Learning.

1. Theoretical Exercises¶

1.1 Entropy of a Mixture¶

Let \(p(x) = \sum_i \pi_i p_i(x)\) be a mixture of \(K\) distributions.

Prove that \(h(X) \ge \sum \pi_i h(X_i)\).
When does equality hold?

1.2 KL-Divergence Between Gaussians¶

Derive the closed-form expression for \(D_{KL}(P \| Q)\) where \(P = \mathcal{N}(\mu_0, \Sigma_0)\) and \(Q = \mathcal{N}(\mu_1, \Sigma_1)\) in \(d\) dimensions. Hint: Use the property that \(\mathbb{E}_P[x^\top A x] = \text{Tr}(A\Sigma_0) + \mu_0^\top A \mu_0\).

1.3 MI and Correlation¶

For a bivariate Gaussian with correlation \(\rho\), we found \(I(X; Y) = -\frac{1}{2} \log(1 - \rho^2)\).

Show that \(I(X; Y) = 0\) if \(\rho = 0\).
Compare this to the case of two non-Gaussian variables where \(\rho=0\) but \(I(X; Y) > 0\). Provide an example.

1.4 The Variational Bound for MI¶

Prove the Donsker-Varadhan lower bound: \(D_{KL}(P \| Q) = \sup_{T} (\mathbb{E}_P[T] - \log \mathbb{E}_Q[e^T])\) Hint: Use the Gibbs' inequality and the fact that the optimal \(T\) is \(\log(P/Q) + c\).

1.5 Data Processing Inequality (DPI)¶

Let \(X \to Y \to Z\) be a Markov chain. Prove that \(I(X; Z) \le I(X; Y)\). Relate this to the "Information Bottleneck" in a 3-layer neural network.

1.6 Entropy of a Sparse Vector¶

Consider a random vector \(X \in \{0, 1\}^n\) where each bit is 1 with probability \(p\).

Calculate \(h(X)\).
If \(p \to 0\) as \(n \to \infty\) such that \(np = \lambda\), what is the limiting entropy?

2. Coding Practice¶

Task 2.1: MI Estimation for a Gaussian Mixture¶

Generate a 2D dataset from a Mixture of Gaussians.
Implement a simple "Neural MI Estimator" (MINE-style) using PyTorch.
Compare the estimated MI to the ground truth (calculated numerically or analytically if possible).

Task 2.2: Entropy and Perplexity of a Language Model¶

Load a small pre-trained model (e.g., GPT-2).
Compute the cross-entropy loss on a paragraph of text.
Calculate the Perplexity (\(e^{loss}\)).
How does Perplexity change if you "shuffle" the words in the paragraph? Why?

Task 2.3: Visualizing the Information Plane¶

Implement a 3-layer MLP on MNIST.

Bin the hidden activations of the middle layer to estimate \(I(X; T)\) and \(I(T; Y)\).
Plot these two values as training progresses.
Do you observe the "compression phase"?

3. Hints and Solutions¶

Hint 1.2¶

\(D_{KL} = \frac{1}{2} [\text{Tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^\top \Sigma_1^{-1} (\mu_1 - \mu_0) - d + \ln \frac{|\Sigma_1|}{|\Sigma_0|}]\).

Solution 1.3.2¶

Example: \(X \sim \mathcal{N}(0, 1)\), \(Y = X^2\). They are uncorrelated (\(\mathbb{E}[XY] = \mathbb{E}[X^3] = 0\)) but highly dependent. \(I(X; Y)\) will be large.

Hint 2.2¶

Shuffling words increases the entropy (specifically the conditional entropy \(h(x_t | x_{t-1}, \dots)\)) because the sequence becomes less predictable. Perplexity will spike.

Solution 2.3¶

In ReLU networks, explicit compression (lowering \(I(X; T)\)) is often not observed because the mapping remains injective. Compression usually only appears if you use saturating activations (Tanh/Sigmoid) or explicit noise.