on Sinkhorn's Theorem

🧩 Background

Sinkhorn’s Theorem bridges the positive matrices (positive entries) with the doubly stochastic matrices by diagonal multipliers, i.e., there are diagonal matrices $D_{1}$ and $D_{2}$ that

D_{1} A D_{2} e = e, e^{T} D_{1} A D_{2} = e^{T}

where $e$ is the vector with all entries as one. The proof can be found in various methods, but most are based on certain kind of fixed-point property. For instance, the geometric proof uses the Brouwer’s fixed-point theorem. The convergence of the Sinkhorn’s algorithm seeks for the vectors $x$ and $y$ by iteratively computing the following:

y_{k + 1} = r ./ (A x_{k}), x_{k + 1} = c ./ (A^{T} y_{k + 1})

where $r$ is the row sum vector and $c$ is the column sum vector. The dot division means entry-wise division. They can be merged into one iteration by (Knight, 2008; Brualdi, Parter & Schneider, 1966; Idel, 2016)

x_{k + 1} = T (x_{k}) := c ./ (A^{T} (r ./ A x_{k})) .

The theory uses the nonlinear Perron-Frobenius theorem to find the fixed-point. Clearly, the map $T$ sends the positive cone $R_{+}^{n}$ to itself, here we need a little more compactness to exploit the fixed-point theorem. Either seeking for the boundedness or define $T x = T (x) /∥ T (x) ∥_{1}$ , then a fixed-point of $T$ also works due to invariance of scaling. When the matrix $A$ is relaxed to only nonnegative entries, there are proofs showing if $A$ is fully indecomposable, then the same conclusion holds.

🌀 Random thoughts

Replacing matrix by “compact” kernel

We consider a simple generalization, it is natural to ask for non-singular integral kernel $A (x, y) > 0$ such that

\int_{D_{q}} p (x) A (x, y) q (y) d y = 1, \int_{D_{p}} p (x) A (x, y) q (y) d x = 1

Then a natural copy of the previous iteration will be

h_{k + 1} (y) = T h_{k} := \frac{c ( y )}{\int _{D_{q}} \frac{A ( x , y ) r ( x )}{\int _{D_{p}} A ( x , y ) h _{k} ( y ) d y} d x} .

And we seek for a fixed-point of $h = T h$ , where $T$ is a normalization of $T$ . The boundedness of the iterates $h_{k}$ comes at no price. For a certain kind of compactness, we can add the equicontinuity requirement which becomes a smoothness condition on $A (x, y)$ such that $A (x, y)$ is equicontinuous in $y$ variable, then a convergent subsequence of $h_{k}$ will be what we need. By the way, $A$ can also be a weakly singular kernel as well.

If $A (x, y, t)$ is an evolution of integral operator converges as $t \to \infty$ which gradually loses the equicontinuity or boundedness, it is interesting to see whether the corresponding solution $h_{k} (y, t)$ converges or not.

The alternative search

The algorithm adopts a common strategy (greedy) which can be used in other places (like gradient descent). For instance, the matrix factorization $A^{m \times n} = U^{m \times r} V^{t, r \times n}$ can be done through the gradient descent

\frac{d}{d t} U_{k + 1} = - (A - U_{k} V_{k}^{t}) V_{k}, \frac{d}{d t} V_{k + 1} = - U_{k}^{t} (A - U_{k} V_{k}^{t}) .

Therefore, we can obtain a conservative quantity $C = U^{t} U - V^{t} V$ . This $r \times r$ matrix is a fixed term.

For $r = 1$ , we actually obtain the manifold that the dynamics lives on: $\sum_{i = 1}^{m} u_{i}^{2} - \sum_{j = 1}^{n} v_{j}^{2} = C$ . Let $A = p q^{t}$ that $∣ p ∣ = ∣ q ∣$ , then a good initial guess of $U_{0} = A r$ that $r \sim N (0_{n}, I_{n})$ , and $V_{0} = A^{t} s$ that $s \sim N (0_{m}, ρ I_{m})$ such that $ρ \neq = 1$ . We only need $∣ U_{0} ∣^{2} \neq = ∣ V_{0} ∣^{2}$ with high probability.

Then we look into the training, $U_{0} = p (q^{t} r) = p u_{0}$ and $V_{0} = q (p^{t} s) = q v_{0}$ , then we will find

\frac{d}{d t} u (t) = - C v (1 - uv) \frac{d}{d t} v (t) = - C u (1 - uv)

where $C = ∣ p ∣^{2} = ∣ q ∣^{2}$ . Then energy $E (t) = (1 - uv)^{2}$ decays with

\frac{d}{d t} E (t) = - C (v^{2} + u^{2}) E (t) .

Therefore, as long as $(u, v)$ is not close to $(0, 0)$ on the trajectory, $E$ decay exponentially fast.

Of course, there are other ways to directly deal with the KKT condition (only one iteration needed here).

U_{k + 1} = A V_{k} /∣ V_{k} ∣^{2}, V_{k + 1} = A^{t} U_{k + 1} /∣ U_{k + 1} ∣^{2} .

This shares the same spirit as Sinkhorn’s algorithm.

References

🐻 Brualdi, R.A., Parter, S.V. & Schneider, H. 1966. The Diagonal Equivalence of a Nonnegative Matrix to a Stochastic Matrix. Journal of Mathematical Analysis and Applications 16(1), 31–50.

🐻 Idel, M. 2016. A Review of Matrix Scaling and Sinkhorn’s Normal Form for Matrices and Positive Maps.

🐻 Knight, P.A. 2008. The Sinkhorn–Knopp Algorithm: Convergence and Applications. SIAM Journal on Matrix Analysis and Applications 30(1), 261–275.

Latent Seminar

Explorer