Topic 05: Practice, Exercises, and Solutions¶
This module provides theoretical and practical exercises to reinforce the concepts of Information Theory in Deep Learning.
1. Theoretical Exercises¶
1.1 Entropy of a Mixture¶
Let \(p(x) = \sum_i \pi_i p_i(x)\) be a mixture of \(K\) distributions.
- Prove that \(h(X) \ge \sum \pi_i h(X_i)\).
- When does equality hold?
1.2 KL-Divergence Between Gaussians¶
Derive the closed-form expression for \(D_{KL}(P \| Q)\) where \(P = \mathcal{N}(\mu_0, \Sigma_0)\) and \(Q = \mathcal{N}(\mu_1, \Sigma_1)\) in \(d\) dimensions. Hint: Use the property that \(\mathbb{E}_P[x^\top A x] = \text{Tr}(A\Sigma_0) + \mu_0^\top A \mu_0\).
1.3 MI and Correlation¶
For a bivariate Gaussian with correlation \(\rho\), we found \(I(X; Y) = -\frac{1}{2} \log(1 - \rho^2)\).
- Show that \(I(X; Y) = 0\) if \(\rho = 0\).
- Compare this to the case of two non-Gaussian variables where \(\rho=0\) but \(I(X; Y) > 0\). Provide an example.
1.4 The Variational Bound for MI¶
Prove the Donsker-Varadhan lower bound: \(D_{KL}(P \| Q) = \sup_{T} (\mathbb{E}_P[T] - \log \mathbb{E}_Q[e^T])\) Hint: Use the Gibbs' inequality and the fact that the optimal \(T\) is \(\log(P/Q) + c\).
1.5 Data Processing Inequality (DPI)¶
Let \(X \to Y \to Z\) be a Markov chain. Prove that \(I(X; Z) \le I(X; Y)\). Relate this to the "Information Bottleneck" in a 3-layer neural network.
1.6 Entropy of a Sparse Vector¶
Consider a random vector \(X \in \{0, 1\}^n\) where each bit is 1 with probability \(p\).
- Calculate \(h(X)\).
- If \(p \to 0\) as \(n \to \infty\) such that \(np = \lambda\), what is the limiting entropy?
2. Coding Practice¶
Task 2.1: MI Estimation for a Gaussian Mixture¶
- Generate a 2D dataset from a Mixture of Gaussians.
- Implement a simple "Neural MI Estimator" (MINE-style) using PyTorch.
- Compare the estimated MI to the ground truth (calculated numerically or analytically if possible).
Task 2.2: Entropy and Perplexity of a Language Model¶
- Load a small pre-trained model (e.g., GPT-2).
- Compute the cross-entropy loss on a paragraph of text.
- Calculate the Perplexity (\(e^{loss}\)).
- How does Perplexity change if you "shuffle" the words in the paragraph? Why?
Task 2.3: Visualizing the Information Plane¶
Implement a 3-layer MLP on MNIST.
- Bin the hidden activations of the middle layer to estimate \(I(X; T)\) and \(I(T; Y)\).
- Plot these two values as training progresses.
- Do you observe the "compression phase"?
3. Hints and Solutions¶
Hint 1.2¶
\(D_{KL} = \frac{1}{2} [\text{Tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^\top \Sigma_1^{-1} (\mu_1 - \mu_0) - d + \ln \frac{|\Sigma_1|}{|\Sigma_0|}]\).
Solution 1.3.2¶
Example: \(X \sim \mathcal{N}(0, 1)\), \(Y = X^2\). They are uncorrelated (\(\mathbb{E}[XY] = \mathbb{E}[X^3] = 0\)) but highly dependent. \(I(X; Y)\) will be large.
Hint 2.2¶
Shuffling words increases the entropy (specifically the conditional entropy \(h(x_t | x_{t-1}, \dots)\)) because the sequence becomes less predictable. Perplexity will spike.
Solution 2.3¶
In ReLU networks, explicit compression (lowering \(I(X; T)\)) is often not observed because the mapping remains injective. Compression usually only appears if you use saturating activations (Tanh/Sigmoid) or explicit noise.