Foundations of Differential Entropy and High-Dimensional Information¶

1. Introduction to Differential Entropy¶

Differential entropy extends the concept of Shannon entropy from discrete probability distributions to continuous probability density functions (PDFs). While discrete entropy measures the absolute uncertainty or information content of a discrete random variable, differential entropy is relative and can even be negative. Despite these differences, it serves as the foundational mathematical tool for analyzing information in continuous systems, which is critical for machine learning, deep learning, and signal processing.

Let $X$ be a continuous random variable with probability density function $p(x)$ with support $\mathcal{S}$. The differential entropy $h(X)$ is defined as:

\[ h(X) = - \int_{\mathcal{S}} p(x) \log p(x) dx \]

where the logarithm is typically taken base $2$ (yielding units of bits) or base $e$ (yielding units of nats). Throughout this text, unless specified, we assume natural logarithms for mathematical convenience in proofs.

1.1 Contrast with Discrete Entropy¶

For a discrete random variable $X$, the entropy $H(X) \ge 0$. However, for a continuous random variable, $p(x)$ can be strictly greater than $1$ (as long as it integrates to $1$). Consequently, $\log p(x)$ can be positive, leading to negative values for $h(X)$.

Consider the uniform distribution over the interval $[0, a]$. The density is $p(x) = \frac{1}{a}$. The differential entropy is:

\[ h(X) = - \int_{0}^{a} \frac{1}{a} \log\left(\frac{1}{a}\right) dx = \log a \]

If $a < 1$, then $h(X) < 0$. This highlights that differential entropy is not an absolute measure of information but rather a relative one.

2. Properties of Differential Entropy¶

Theorem 2.1 (Translation and Scaling)

Let $X$ be a continuous random variable with differential entropy $h(X)$. Let $Y = cX + a$, where $c, a \in \mathbb{R}$ and $c \neq 0$. Then:

\[ h(Y) = h(X) + \log |c| \]

Proof: By the change of variables formula for probability densities, if $Y = cX + a$, the density of $Y$ is given by:

\[ p_Y(y) = \frac{1}{|c|} p_X\left(\frac{y - a}{c}\right) \]

Substitute this into the definition of differential entropy:

\[ h(Y) = - \int p_Y(y) \log p_Y(y) dy \]

\[ h(Y) = - \int \frac{1}{|c|} p_X\left(\frac{y - a}{c}\right) \log \left( \frac{1}{|c|} p_X\left(\frac{y - a}{c}\right) \right) dy \]

Let $x = \frac{y - a}{c}$. Then $dy = |c| dx$.

\[ h(Y) = - \int p_X(x) \log \left( \frac{1}{|c|} p_X(x) \right) dx \]

\[ h(Y) = - \int p_X(x) [ \log p_X(x) - \log |c| ] dx \]

\[ h(Y) = - \int p_X(x) \log p_X(x) dx + \log |c| \int p_X(x) dx \]

Since $\int p_X(x) dx = 1$, we have:

\[ h(Y) = h(X) + \log |c| \]

This completes the proof. $\blacksquare$

Theorem 2.2 (Asymptotic Equipartition Property (AEP) for Continuous Variables)

Let $X_1, X_2, \dots, X_n$ be a sequence of independent and identically distributed (i.i.d.) continuous random variables with PDF $p(x)$. Let $p(X_1, \dots, X_n) = \prod_{i=1}^n p(X_i)$. Then, as $n \to \infty$:

\[ - \frac{1}{n} \log p(X_1, \dots, X_n) \to h(X) \quad \text{in probability.} \]

Proof: By the definition of the joint density of i.i.d. variables:

\[ - \frac{1}{n} \log p(X_1, \dots, X_n) = - \frac{1}{n} \log \prod_{i=1}^n p(X_i) = - \frac{1}{n} \sum_{i=1}^n \log p(X_i) \]

Let $Y_i = - \log p(X_i)$. The expected value of $Y_i$ is:

\[ \mathbb{E}[Y_i] = \mathbb{E}[- \log p(X_i)] = - \int p(x) \log p(x) dx = h(X) \]

Assuming the variance of $Y_i$ is finite, we can apply the Weak Law of Large Numbers (WLLN). The sample average of i.i.d. variables converges in probability to the expected value:

\[ \frac{1}{n} \sum_{i=1}^n Y_i \xrightarrow{P} \mathbb{E}[Y_1] = h(X) \]

Thus, the result is proven. $\blacksquare$

3. The Entropy Power Inequality (EPI)¶

The Entropy Power Inequality is a fundamental result in information theory, originally stated by Shannon and rigorously proven by Stam. It relates the entropy of a sum of independent random variables to the entropies of the individual variables.

Definition

Entropy Power The entropy power $N(X)$ of a continuous random variable $X \in \mathbb{R}^d$ is defined as the variance of a Gaussian random variable that has the same differential entropy as $X$. Mathematically:

\[ N(X) = \frac{1}{2\pi e} \exp\left( \frac{2}{d} h(X) \right) \]

Theorem 3.1 (Entropy Power Inequality)

Let $X$ and $Y$ be independent continuous random variables in $\mathbb{R}^d$. Then:

\[ N(X + Y) \ge N(X) + N(Y) \]

Or equivalently:

\[ \exp\left( \frac{2}{d} h(X + Y) \right) \ge \exp\left( \frac{2}{d} h(X) \right) + \exp\left( \frac{2}{d} h(Y) \right) \]

Proof of the EPI using de Bruijn's Identity: We will prove the 1-dimensional case ($d=1$). The proof heavily relies on de Bruijn's identity, which relates the derivative of differential entropy under Gaussian noise perturbation to Fisher information.

Fisher Information: For a random variable $X$ with density $p(x)$, the Fisher information $J(X)$ with respect to a translation parameter is:

\[ J(X) = \int \frac{(p'(x))^2}{p(x)} dx \]

de Bruijn's Identity: Let $Z_t \sim \mathcal{N}(0, t)$ be independent of $X$. Let $X_t = X + Z_t$. Then:

\[ \frac{\partial}{\partial t} h(X_t) = \frac{1}{2} J(X_t) \]

Fisher Information Inequality (FII): For independent $X, Y$:

\[ \frac{1}{J(X + Y)} \ge \frac{1}{J(X)} + \frac{1}{J(Y)} \]

This follows from the Cauchy-Schwarz inequality.

Integration: We analyze the function $f(t) = N(X_t + Y_t)$. Note that $X_t + Y_t = (X + Y) + \sqrt{2t}Z$, where $Z \sim \mathcal{N}(0, 1)$. Thus, its derivative with respect to $t$ (using de Bruijn's identity) is $N(X_t + Y_t) J(X_t + Y_t)$. By the FII and properties of Fisher Information under scaling, we can derive that the second derivative of the logarithm of entropy power with respect to added Gaussian noise is non-positive, establishing concavity. Stam's original proof integrates this differential relationship from $t=0$ to $t=\infty$, culminating in the EPI. $\blacksquare$

4. High-Dimensional Gaussian Entropy¶

Gaussian distributions are ubiquitous in information theory because they maximize differential entropy for a given covariance matrix.

Theorem 4.1 (Entropy of a Multivariate Gaussian)

Let $X \sim \mathcal{N}(\mu, \Sigma)$ be a $d$-dimensional multivariate Gaussian random variable with mean vector $\mu$ and covariance matrix $\Sigma$. Its differential entropy is:

\[ h(X) = \frac{1}{2} \log \left( (2\pi e)^d |\Sigma| \right) \]

Proof: The PDF of a multivariate Gaussian is:

\[ p(x) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right) \]

Taking the natural logarithm:

\[ \log p(x) = -\frac{1}{2} \log((2\pi)^d |\Sigma|) - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \]

Now, take the expectation of $-\log p(x)$ with respect to $X$:

\[ h(X) = \mathbb{E}[-\log p(X)] = \frac{1}{2} \log((2\pi)^d |\Sigma|) + \frac{1}{2} \mathbb{E}\left[ (X - \mu)^T \Sigma^{-1} (X - \mu) \right] \]

Using the trace trick for the expectation of a quadratic form: $\mathbb{E}[z^T A z] = \text{Tr}(A \mathbb{E}[zz^T])$ where $\mathbb{E}[z]=0$.

\[ \mathbb{E}\left[ (X - \mu)^T \Sigma^{-1} (X - \mu) \right] = \text{Tr}\left( \Sigma^{-1} \mathbb{E}[(X - \mu)(X - \mu)^T] \right) \]

Since $\mathbb{E}[(X - \mu)(X - \mu)^T] = \Sigma$:

\[ \mathbb{E}\left[ (X - \mu)^T \Sigma^{-1} (X - \mu) \right] = \text{Tr}(\Sigma^{-1} \Sigma) = \text{Tr}(I_d) = d \]

Substitute this back into the entropy equation:

\[ h(X) = \frac{1}{2} \log((2\pi)^d |\Sigma|) + \frac{d}{2} \]

\[ h(X) = \frac{1}{2} \log((2\pi)^d |\Sigma|) + \frac{1}{2} \log(e^d) = \frac{1}{2} \log \left( (2\pi e)^d |\Sigma| \right) \]

This completes the proof. $\blacksquare$

Theorem 4.2 (Maximum Entropy Principle for Gaussians)

Let $X$ be a random vector in $\mathbb{R}^d$ with zero mean and covariance matrix $K=\mathbb{E}[XX^T]$. Let $\phi_K$ be the density of a zero-mean Gaussian $\mathcal{N}(0, K)$. Then $h(X) \le h(\phi_K)$.

Proof: Using the relative entropy (Kullback-Leibler divergence) $D(p || \phi_K) \ge 0$:

\[ - \int p(x) \log \phi_K(x) dx \ge - \int p(x) \log p(x) dx = h(X) \]

Evaluate the left side:

$$

\int p(x) \log \phi_K(x) dx = - \int p(x) \left[ -\frac{1}{2} \log((2\pi)^d |K|) - \frac{1}{2} x^T K^{-1} x \right] dx $$

\[ = \frac{1}{2} \log((2\pi)^d |K|) + \frac{1}{2} \text{Tr}\left( K^{-1} \int x x^T p(x) dx \right) \]

Since $\int x x^T p(x) dx = K$:

\[ = \frac{1}{2} \log((2\pi)^d |K|) + \frac{1}{2} \text{Tr}(K^{-1} K) = \frac{1}{2} \log((2\pi e)^d |K|) = h(\phi_K) \]

Thus, $h(\phi_K) \ge h(X)$. $\blacksquare$

5. Worked Examples¶

Example 1: Entropy of the Exponential Distribution¶

Let $X$ be an exponentially distributed random variable with rate parameter $\lambda > 0$. The PDF is $p(x) = \lambda e^{-\lambda x}$ for $x \ge 0$. Find $h(X)$.

Solution:

\[ h(X) = - \int_0^\infty p(x) \log(\lambda e^{-\lambda x}) dx \]

\[ = - \int_0^\infty p(x) (\log \lambda - \lambda x) dx \]

\[ = - \log \lambda \int_0^\infty p(x) dx + \lambda \int_0^\infty x p(x) dx \]

Since the integral of $p(x)$ is 1, and the mean of the exponential distribution is $\mathbb{E}[X] = 1/\lambda$:

\[ h(X) = - \log \lambda + \lambda \left( \frac{1}{\lambda} \right) = 1 - \log \lambda \]

Example 2: Sum of Independent Uniform Random Variables¶

Let $X_1 \sim \mathcal{U}(0, 1)$ and $X_2 \sim \mathcal{U}(0, 1)$ be independent. We want to find the differential entropy of their sum $Y = X_1 + X_2$.

Solution: The PDF of the sum of two independent variables is the convolution of their PDFs. Thus, the PDF of $Y$ is a triangular distribution:

\[ p_Y(y) = \begin{cases} y & 0 \le y < 1 \\ 2 - y & 1 \le y \le 2 \\ 0 & \text{otherwise} \end{cases} \]

The entropy is:

\[ h(Y) = - \int_0^1 y \log y dy - \int_1^2 (2-y) \log (2-y) dy \]

By symmetry, $\int_1^2 (2-y) \log(2-y) dy = \int_0^1 z \log z dz$. Thus:

\[ h(Y) = -2 \int_0^1 y \log y dy \]

Using integration by parts: let $u = \log y$, $dv = y dy$. Then $du = \frac{1}{y} dy$ and $v = \frac{y^2}{2}$.

\[ \int y \log y dy = \frac{y^2}{2} \log y - \int \frac{y}{2} dy = \frac{y^2}{2} \log y - \frac{y^2}{4} \]

Evaluating from 0 to 1 (using L'Hopital's rule for the limit as $y \to 0$):

\[ \left[ \frac{y^2}{2} \log y - \frac{y^2}{4} \right]_0^1 = \left( 0 - \frac{1}{4} \right) - (0 - 0) = -\frac{1}{4} \]

So, $h(Y) = -2(-\frac{1}{4}) = \frac{1}{2}$ nats.

Example 3: Verifying the EPI for Two Gaussians¶

Let $X \sim \mathcal{N}(0, \sigma_1^2)$ and $Y \sim \mathcal{N}(0, \sigma_2^2)$ be independent. $h(X) = \frac{1}{2} \log(2\pi e \sigma_1^2)$, meaning $N(X) = \sigma_1^2$. $h(Y) = \frac{1}{2} \log(2\pi e \sigma_2^2)$, meaning $N(Y) = \sigma_2^2$.

The sum $X+Y \sim \mathcal{N}(0, \sigma_1^2 + \sigma_2^2)$. Thus, $h(X+Y) = \frac{1}{2} \log(2\pi e (\sigma_1^2 + \sigma_2^2))$. The entropy power is $N(X+Y) = \sigma_1^2 + \sigma_2^2$. We have $N(X+Y) = N(X) + N(Y)$. The EPI holds with equality for independent Gaussians!

6. Coding Demos¶

Demo 1: Estimating Differential Entropy via k-NN (Kozachenko-Leonenko)¶

Estimating differential entropy from samples is non-trivial. Histogram methods scale poorly in high dimensions. The Kozachenko-Leonenko estimator uses $k$-nearest neighbors for robust estimation.

Python

import numpy as np
from scipy.special import digamma, gamma
from sklearn.neighbors import NearestNeighbors

def kl_entropy(X, k=3):
    """
    Estimates differential entropy of a set of samples X using k-NN.
    X: (n_samples, n_features)
    """
    n, d = X.shape
    nn = NearestNeighbors(n_neighbors=k+1)
    nn.fit(X)

    # distances to the k-th nearest neighbor
    distances, _ = nn.kneighbors(X)
    r_k = distances[:, k]

    # Volume of unit d-ball
    c_d = (np.pi ** (d/2)) / gamma(d/2 + 1)

    # Estimate
    log_dists = np.log(r_k[r_k > 0])
    h = d * np.mean(log_dists) + np.log(c_d) + np.log(n - 1) - digamma(k)
    return h

# Generate samples from 2D Gaussian
np.random.seed(42)
cov = [[1.0, 0.5], [0.5, 2.0]]
X = np.random.multivariate_normal([0, 0], cov, 10000)

# True entropy
det_cov = np.linalg.det(cov)
true_h = 0.5 * np.log((2 * np.pi * np.e) ** 2 * det_cov)

est_h = kl_entropy(X, k=5)
print(f"True Entropy: {true_h:.4f} nats")
print(f"Estimated Entropy: {est_h:.4f} nats")

Text Only

True Entropy: 3.1177 nats
Estimated Entropy: 3.1144 nats

Demo 2: Empirical Verification of the Entropy Power Inequality¶

Python

import numpy as np
from scipy.special import digamma, gamma
from sklearn.neighbors import NearestNeighbors

def kl_entropy(X, k=3):
    """
    Estimates differential entropy of a set of samples X using k-NN.
    X: (n_samples, n_features)
    """
    n, d = X.shape
    nn = NearestNeighbors(n_neighbors=k+1)
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    r_k = distances[:, k]
    c_d = (np.pi ** (d/2)) / gamma(d/2 + 1)
    log_dists = np.log(r_k[r_k > 0])
    h = d * np.mean(log_dists) + np.log(c_d) + np.log(n - 1) - digamma(k)
    return h

# Let's verify EPI for non-Gaussian distributions (e.g., Uniform + Exponential)
n_samples = 50000

# X ~ U(0, 1) -> True entropy = 0, Entropy power = 1 / (2*pi*e)
X = np.random.uniform(0, 1, n_samples).reshape(-1, 1)
NX = np.exp(2 * 0) / (2 * np.pi * np.e)

# Y ~ Exp(1) -> True entropy = 1, Entropy power = e^2 / (2*pi*e)
Y = np.random.exponential(1, n_samples).reshape(-1, 1)
NY = np.exp(2 * 1) / (2 * np.pi * np.e)

# Z = X + Y
Z = X + Y

# Estimate h(Z) using our k-NN estimator from Demo 1
h_Z_est = kl_entropy(Z, k=5)
NZ_est = np.exp(2 * h_Z_est) / (2 * np.pi * np.e)

print(f"N(X) + N(Y) = {NX + NY:.5f}")
print(f"Estimated N(X+Y) = {NZ_est:.5f}")
print(f"EPI Holds: {NZ_est >= (NX + NY)}")

Text Only

N(X) + N(Y) = 0.49118
Estimated N(X+Y) = 0.70157
EPI Holds: True