Practice 08: Bayesian and Probabilistic Machine Learning¶

1. Theoretical Exercises¶

1.1 The MAP Estimate and Weight Decay¶

Problem: Prove that for a linear model \(y = w^\top x + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) and a Gaussian prior \(w \sim \mathcal{N}(0, \lambda^{-1} I)\), the Maximum A Posteriori (MAP) estimate is equivalent to Ridge Regression (L2 regularization). Task: Write the log-posterior \(\log p(w|D)\) and show it is proportional to \(-\|y - Xw\|^2 - \alpha \|w\|^2\).

1.2 HMC and Volume Preservation¶

Problem: The Leapfrog integrator is a symplectic integrator.

What does it mean for a mapping to be volume-preserving in phase space?
Why is volume preservation (unit determinant of the Jacobian) crucial for the Metropolis-Hastings acceptance step in HMC?

1.3 NNGP and the Arc-Cosine Kernel¶

Problem: For a single-layer ReLU network \(h(x) = \text{ReLU}(\sum w_i x_i)\), the NNGP kernel is \(K(x, x') = \frac{1}{2\pi} \|x\| \|x'\| (\sin \theta + (\pi - \theta) \cos \theta)\), where \(\theta = \arccos(\frac{x^\top x'}{\|x\| \|x'\|})\).

What happens to the correlation between \(x\) and \(x'\) as \(\theta \to \pi\) (opposite directions)?
How does this compare to the standard RBF (Gaussian) kernel?

1.4 Variational Inference and Information Theory¶

Problem: The ELBO can be written as \(\log p(x) - D_{KL}(q(z) \| p(z|x))\).

Prove that the ELBO is always a lower bound on the log-evidence (i.e., \(D_{KL} \ge 0\)).
When is the bound tight?

1.5 Conformal Coverage Guarantee¶

Problem: Suppose you have \(n=99\) calibration points and you want \(90\%\) coverage (\(\alpha=0.1\)).

Which sorted score \(S_{(k)}\) should you pick as your threshold \(\hat{q}\)?
If the data is not exchangeable (e.g., there is a temporal trend), why does the coverage guarantee fail?

1.6 Reparameterization vs. Score Function¶

Problem: Contrast the variance of the REINFORCE gradient estimator \(\nabla_\theta \mathbb{E}_{q_\theta}[f(z)]\) and the reparameterization estimator. Task: Explain why the reparameterization trick is generally preferred for Variational Autoencoders (VAEs).

1.7 Normalizing Flows and Dimensionality¶

Problem: Standard Normalizing Flows require the mapping \(f: \mathbb{R}^d \to \mathbb{R}^d\) to be a bijection (homeomorphism).

Can a Normalizing Flow change the dimensionality of the representation?
How do "Injective Flows" or "Manifold Flows" attempt to bypass this?

1.8 The Cold Posterior Mystery¶

Problem: Research the "Cold Posterior" effect.

Why do practitioners often use \(p(w|D)^{1/T}\) with \(T < 1\)?
Give one hypothesis for why \(T=1\) (the true Bayes posterior) often performs worse on test data than \(T=0.1\).

2. Coding Practice¶

2.1 Metropolis-Hastings from Scratch¶

Task: Implement a simple Metropolis-Hastings sampler to sample from a 2D Mixture of Gaussians.

Visualize the chain's path.
Calculate the "Effective Sample Size" (ESS).

2.2 Conformal Regression on Synthetic Data¶

Task:

Generate data from a heteroscedastic process: \(y = x \sin(x) + \epsilon(x)\) where \(\sigma(x)\) increases with \(x\).
Train a simple MLP to predict the mean.
Use Split Conformal Prediction to generate \(95\%\) confidence intervals.
Verify that the coverage is indeed \(95\%\) across the test set.

3. Hints & Solutions¶

3.1 Hints¶

1.1: Use Bayes' Rule: \(p(w|D) \propto p(D|w)p(w)\). Take the negative log.
1.4: Use Jensen's Inequality on the concave function \(\log(\cdot)\).
2.2: The non-conformity score for regression is typically \(|y - \hat{y}|\).

3.2 Solutions (Brief)¶

1.2: If the mapping is not volume-preserving, the Metropolis-Hastings ratio would require a complex Jacobian determinant term, which is computationally expensive to calculate for every step. Symplecticity ensures the determinant is 1.
1.5: \(\hat{q} = S_{(\lceil (n+1)(1-\alpha) \rceil)} = S_{(\lceil 100 \times 0.9 \rceil)} = S_{(90)}\). Pick the \(90^{th}\) largest score.
1.7: No, a homeomorphism must preserve dimensionality. Manifold flows typically use a padding or an encoder-decoder structure to change dims.