Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Optimal transportation distances are a type of parameterized distance for histograms.
- Computing these distances involves solving a linear program, which is expensive when the histograms’ dimension is large.
- This paper proposes a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective.
- This new distance can be computed quickly and has improved performance on the MNIST benchmark problem.
Paper Content
Introduction
- Optimal transportation distances are used in computer vision
- They are the only distances that are parameterized
- They are used to handle high-dimensional histograms
- They have been studied from a theoretical and practical perspective
- They have a drawback of taking a long time to compute
- This paper introduces a regularization to the optimal transportation problem to make it faster and more applicable to machine learning
Reminders on optimal transportation
- U (r, c) contains all nonnegative d × d matrices with row and column sums r and c.
- U (r, c) can be identified with a joint probability for two multinomial random variables.
- Optimal Transportation is the entropy and Kullback-Leibler divergences of these tables and their marginals.
Sinkhorn distances
- Consider a family of optimal transportation distances
- Entropic constraints on joint probabilities
- Cover and Thomas inequality
- Entropy of independence table
- Convex set of joint probability matrices
- Equivalent to mutual information of two random variables
- Kullback-Leibler divergence to the table
- Mitigate transportation cost with entropic constraint
- Transportation polytope and Kullback-Leibler ball
- Sinkhorn distance is dot product of M with optimal transportation table
- Sinkhorn distance coincides with classic optimal transportation distance
- Closed form of Sinkhorn distance when α = 0
- Negative definite kernel when M is Euclidean distance matrix
- Sinkhorn distances are symmetric and satisfy triangle inequalities
- Gluing lemma with entropic constraint
- Triangle inequality for d M,α
Computing sinkhorn distances with the sinkhorn-knopp algorithm
- Sinkhorn distance is defined by a constraint on the entropy of h(P) relative to h(r) and h(c).
- Lagrange multiplier is used to calculate dual-Sinkhorn divergence.
- Dual-Sinkhorn divergence can be computed at a cheaper cost than classical optimal transportation problem.
- Solution Pλ is of the form ui e-λmij vj.
Experimental results
- MNIST Digits dataset used to test performance of Sinkhorn distances
- Each digit is a vector of intensities on a 20x20 pixel grid
- Subsets of N points in the training set, where N ranges from 3-25x10^3 datapoints
- Mean and standard deviation of classification error using 4 fold cross validation scheme repeated 6 times
- Different distances studied with parameter selection scheme
- Ground metric M is Euclidean distance between 20x20 points in the grid
- Hellinger, χ2, Total Variation and squared Euclidean (Gaussian kernel) distances used
- Independence kernel uses Euclidean distance matrix with a parameter a
- Sinkhorn distances regularized by adding a sufficiently large diagonal term
- SVM’s run with libsvm (one-vs-one) for multiclass classification
- Entropic penalty λ of Sinkhorn distances chosen to make matrix e-λM relatively diagonally dominant
- Number of fixed-point iterations set to 20
- Sinkhorn distance beats all other distances, including EMD
- Sinkhorn distances converge to EMD as λ gets bigger
- Sinkhorn distances hover above EMD distances by about 10%
- Sinkhorn distances several orders of magnitude faster than classic optimal transportation distances
- Number of iterations required for convergence increases as e-λM becomes diagonally dominant
Conclusion
- Regularizing the optimal transportation problem with an entropic penalty opens the door for new research and applications.
- Sinkhorn distances do not perform worse than the EMD and may perform better.
- Small values of λ seem to perform better than large ones.
- There is a faster way to compute the Independence kernel.
- Sinkhorn distances are parameterized by a regularization weight λ.
- The Independence kernel is positive definite on histograms with the same 1-norm.
- The data processing inequality can be used to prove h(X, Z)−h(X)+h(Z) ≥ h(X, Y )−h(X)+h(Y ) ≥ −α.