🧩 Background
Sinkhorn’s Theorem bridges the positive matrices (positive entries) with the doubly stochastic matrices by diagonal multipliers, i.e., there are diagonal matrices and that
where is the vector with all entries as one. The proof can be found in various methods, but most are based on certain kind of fixed-point property. For instance, the geometric proof uses the Brouwer’s fixed-point theorem. The convergence of the Sinkhorn’s algorithm seeks for the vectors and by iteratively computing the following:
where is the row sum vector and is the column sum vector. The dot division means entry-wise division. They can be merged into one iteration by (Knight, 2008; Brualdi, Parter & Schneider, 1966; Idel, 2016)
The theory uses the nonlinear Perron-Frobenius theorem to find the fixed-point. Clearly, the map sends the positive cone to itself, here we need a little more compactness to exploit the fixed-point theorem. Either seeking for the boundedness or define , then a fixed-point of also works due to invariance of scaling. When the matrix is relaxed to only nonnegative entries, there are proofs showing if is fully indecomposable, then the same conclusion holds.
🌀 Random thoughts
Replacing matrix by “compact” kernel
We consider a simple generalization, it is natural to ask for non-singular integral kernel such that
Then a natural copy of the previous iteration will be
And we seek for a fixed-point of , where is a normalization of . The boundedness of the iterates comes at no price. For a certain kind of compactness, we can add the equicontinuity requirement which becomes a smoothness condition on such that is equicontinuous in variable, then a convergent subsequence of will be what we need. By the way, can also be a weakly singular kernel as well.
If is an evolution of integral operator converges as which gradually loses the equicontinuity or boundedness, it is interesting to see whether the corresponding solution converges or not.
The alternative search
The algorithm adopts a common strategy (greedy) which can be used in other places (like gradient descent). For instance, the matrix factorization can be done through the gradient descent
Therefore, we can obtain a conservative quantity . This matrix is a fixed term.
For , we actually obtain the manifold that the dynamics lives on: . Let that , then a good initial guess of that , and that such that . We only need with high probability.
Then we look into the training, and , then we will find
where . Then energy decays with
Therefore, as long as is not close to on the trajectory, decay exponentially fast.
Of course, there are other ways to directly deal with the KKT condition (only one iteration needed here).
This shares the same spirit as Sinkhorn’s algorithm.