Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Optimal transportation distances are a type of parameterized distance for histograms.
  • Computing these distances involves solving a linear program, which is expensive when the histograms’ dimension is large.
  • This paper proposes a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective.
  • This new distance can be computed quickly and has improved performance on the MNIST benchmark problem.

Paper Content

Introduction

  • Optimal transportation distances are used in computer vision
  • They are the only distances that are parameterized
  • They are used to handle high-dimensional histograms
  • They have been studied from a theoretical and practical perspective
  • They have a drawback of taking a long time to compute
  • This paper introduces a regularization to the optimal transportation problem to make it faster and more applicable to machine learning

Reminders on optimal transportation

  • U (r, c) contains all nonnegative d × d matrices with row and column sums r and c.
  • U (r, c) can be identified with a joint probability for two multinomial random variables.
  • Optimal Transportation is the entropy and Kullback-Leibler divergences of these tables and their marginals.

Sinkhorn distances

  • Consider a family of optimal transportation distances
  • Entropic constraints on joint probabilities
  • Cover and Thomas inequality
  • Entropy of independence table
  • Convex set of joint probability matrices
  • Equivalent to mutual information of two random variables
  • Kullback-Leibler divergence to the table
  • Mitigate transportation cost with entropic constraint
  • Transportation polytope and Kullback-Leibler ball
  • Sinkhorn distance is dot product of M with optimal transportation table
  • Sinkhorn distance coincides with classic optimal transportation distance
  • Closed form of Sinkhorn distance when α = 0
  • Negative definite kernel when M is Euclidean distance matrix
  • Sinkhorn distances are symmetric and satisfy triangle inequalities
  • Gluing lemma with entropic constraint
  • Triangle inequality for d M,α

Computing sinkhorn distances with the sinkhorn-knopp algorithm

  • Sinkhorn distance is defined by a constraint on the entropy of h(P) relative to h(r) and h(c).
  • Lagrange multiplier is used to calculate dual-Sinkhorn divergence.
  • Dual-Sinkhorn divergence can be computed at a cheaper cost than classical optimal transportation problem.
  • Solution Pλ is of the form ui e-λmij vj.

Experimental results

  • MNIST Digits dataset used to test performance of Sinkhorn distances
  • Each digit is a vector of intensities on a 20x20 pixel grid
  • Subsets of N points in the training set, where N ranges from 3-25x10^3 datapoints
  • Mean and standard deviation of classification error using 4 fold cross validation scheme repeated 6 times
  • Different distances studied with parameter selection scheme
  • Ground metric M is Euclidean distance between 20x20 points in the grid
  • Hellinger, χ2, Total Variation and squared Euclidean (Gaussian kernel) distances used
  • Independence kernel uses Euclidean distance matrix with a parameter a
  • Sinkhorn distances regularized by adding a sufficiently large diagonal term
  • SVM’s run with libsvm (one-vs-one) for multiclass classification
  • Entropic penalty λ of Sinkhorn distances chosen to make matrix e-λM relatively diagonally dominant
  • Number of fixed-point iterations set to 20
  • Sinkhorn distance beats all other distances, including EMD
  • Sinkhorn distances converge to EMD as λ gets bigger
  • Sinkhorn distances hover above EMD distances by about 10%
  • Sinkhorn distances several orders of magnitude faster than classic optimal transportation distances
  • Number of iterations required for convergence increases as e-λM becomes diagonally dominant

Conclusion

  • Regularizing the optimal transportation problem with an entropic penalty opens the door for new research and applications.
  • Sinkhorn distances do not perform worse than the EMD and may perform better.
  • Small values of λ seem to perform better than large ones.
  • There is a faster way to compute the Independence kernel.
  • Sinkhorn distances are parameterized by a regularization weight λ.
  • The Independence kernel is positive definite on histograms with the same 1-norm.
  • The data processing inequality can be used to prove h(X, Z)−h(X)+h(Z) ≥ h(X, Y )−h(X)+h(Y ) ≥ −α.