Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Optimal transport-based distributionally robust optimization problems studied
Fictitious adversary (nature) can choose distribution of uncertain problem parameters
Robustification related to various forms of variation and Lipschitz regularization
Conditions for existence and computability of Nash equilibrium between decision-maker and nature

Paper Content

Introduction

Stochastic optimization methods are becoming popular in machine learning.
A stochastic optimization problem aims to minimize the expected value of an uncertainty-affected loss function.
The probability distribution governing the problem parameters is rarely accessible to the decision-maker.
A distributionally robust approach minimizes the worst-case expected loss with respect to all distributions in some neighborhood of the reference distribution.
This approach leads to tractable optimization models and provides generalization bounds.
It can enable generalization in the face of domain shifts and make training of deep neural networks more resilient against adversarial attacks.
The ambiguity set is an -neighborhood of the reference distribution with respect to an optimal transport discrepancy.
The transportation cost function is lower semi-continuous and the loss function is upper semi-continuous.
Optimal transport-based DRO problems are studied in many areas.
The ambiguity set can be interpreted as the family of all probability distributions that can be obtained by reshaping the reference distribution at a finite cost.
Assumptions 1.1 (i) and 1.1 (ii) ensure that the inner maximization problem admits a strong dual minimization problem.
Checking whether a given distribution belongs to the ambiguity set is #P-hard.
The optimal value of the problem can often be computed efficiently by solving the dual problem.
The dual objective function involves the expected value of the -transform of the loss function.
The -transform can be interpreted as an epigraphical regularization of the loss function.

Nash equilibria in dro

DRO problem is a zero-sum game between an agent and an adversary
Optimal decision and distribution form a Nash equilibrium
Optimal solution is referred to as a minimax estimator or robust estimator
Expected loss is minimized under the crisp distribution
Nash equilibrium exists under mild regularity conditions

Existence of nash equilibria

Assumption 1.1 (i) states that there is a reference point
Assumption 2.1 (i) and (ii) are satisfied when the support set is compact
Assumption 2.4 (iii) states that the loss function is convex and lower semi-continuous
The Wasserstein space is independent of the reference point
The -th Wasserstein ball of radius ≥ 0 is defined
Assumption 2.4 (iii) ensures that the loss function satisfies a growth condition

Computation of nash equilibria

The dual DRO problem appears intractable as it is a challenging maximin problem.
The primal DRO problem can be reformulated as a finite convex program if certain convexity properties are met.
The dual DRO problem can also be reformulated as a finite convex program if certain regularity conditions are met.
Solutions of the convex programs can be used to construct a robust decision and a least favorable distribution that form a Nash equilibrium.
Dual DRO problems have been investigated in specific applications, such as minimum mean square error estimation and Kalman filtering.
General dual DRO problems can often be addressed with methods from convex optimization.
Assumptions 2.8, 2.9 and 2.11 are needed to reformulate the dual DRO problem as a finite convex program.
Assumption 2.13 is needed to ensure that the dual DRO problem is solvable.
The dual DRO problem can be reformulated as a convex program with a maximization problem and a minimization problem.
The maximization problem has a Slater point and is solvable.
The minimization problem has a Slater point and is solvable.
Any maximizer of the convex program can be used to construct a maximizer of the dual DRO problem.
If the dual DRO problem is not solvable, a sequence of distributions can be constructed that are feasible and asymptotically optimal.

Regularization by robustification

Regularization schemes in statistics and machine learning can be interpreted as a form of robustness.
This paper seeks to create a comprehensive theory of regularization and optimal transport-based robustification.
The paper studies the primal and dual regularizing effects of robustification.

Primal regularizating effects of robustification

The worst-case expected loss across all distributions in a generic optimal transport-based ambiguity set is bounded above
Sum of expected loss under reference distribution and regularization terms that penalize -norms and Lipschitz moduli of higher-order derivatives of loss function
Intimate connections between robustification and gradient, Hessian, and Lipschitz regularization
Norm on space of totally symmetric -th order tensors induced by vector norm
Smoothness conditions depend on -th order partial derivatives of loss function
Upper bound of worst-case expected loss is sum of expected loss under reference distribution, -th variation regularization terms, and Lipschitz regularization term
Variation regularizers used in machine learning applications
Theorem 3.2 gives heuristic regularization schemes a theoretical justification
Estimate can be simplified by upper bounding all variation regularization terms by corresponding Lipschitz regularizers
Results generalize several bounds from extant literature

Dual regularizing effects of robustification

A DRO problem of the form inf is a univariate loss function : R → (−∞, +∞]
Several important problems in operations research and machine learning can be framed as instances of this DRO problem
Assumption 1.1 holds
The DRO problem is equivalent to a stochastic program inf
The -transform ℓ (, , ̂︀ ) is defined in (20)
The -transform can be expressed in terms of a suitable envelope of
The DRO problem can be solved efficiently even if fails to be (piecewise) concave or even if fails to be convex
Evaluating the -transform ℓ is equivalent to solving a univariate maximization problem
The -transform ℓ (, , ̂︀ ) can be viewed as an approximation of ℓ(, ̂︀ ) = (⟨, ̂︀ ⟩)
If ≥ lip(), then the worst-case expected loss over a 1-Wasserstein ball of radius coincides with the expected loss under the reference distribution adjusted by the regularization term lip()‖‖ *
If Θ is a cone, then the scaling factor can be eliminated by using the variable substitution ′ ←

Numerical experiments

Linear and second-order cone programs implemented in Python
Solved with Gurobi 10.0.0 on a 2.4 GHz quad-core machine with 8 GB RAM

Nash equilibria

Computation of Nash equilibria between a statistician and nature in the context of a distributionally robust support vector machine problem
Feature vector ∈ ⊆ R −1 and a label ∈ = {−1, +1}
Weight vector ∈ Θ = R −1 of a linear classifier
Hinge loss function ℓ(, ) = max{0, 1 − ⟨, ⟩}
Reference distribution ̂︀ P is set to the empirical (uniform) distribution on training samples ̂︀ = (̂︀ , ̂︀ ), ∈ []
Transportation cost function defined
Nash strategies for the statistician and nature can be computed by solving finite convex programs
Continuum of least favorable distributions, representing different Nash strategies of nature
Instance of the distributionally robust support vector machine problem with = 3, = 0.1 and = 20
-norm to quantify the transportation cost in the feature space
Distributionally robust support vector machine to distinguish greyscale images of handwritten numbers 3 and 8 from the MNIST 3-vs-8 dataset
Feature vector ∈ = [0, 1] −1 with = 787
∞-norm to quantify the transportation cost in the feature space
Compare least favorable distributions (Nash strategies of nature) against worst-case distributions (best response strategies of nature)

Distributionally robust log-optimal portfolio selection

Assume components of random vector represent total returns of assets over next month
Probability simplex in R represents probability of constantly rebalanced portfolio
Distribution P is unknown in practice
Model distributional ambiguity via optimal transport-based ambiguity set
Maximization problem in (28) can be solved efficiently by sorting
Out-of-sample performance of log-optimal portfolios with 10 assets assessed
Unknown true asset return distribution P assumed to be lognormal
Maximizer of problem (28) is unique and fully determined by first-order optimality condition
Maximizer of problem (28) is bounded
As tends to infinity, converges to

C technical background results

Strong duality states that the adversarial examples implied by nature’s best response can only deceive an algorithm, while the adversarial examples implied by nature’s Nash strategy can even deceive a human.
Relaxed self service condition states that for any integer up to a certain number, the directional derivative of the loss function is Lipschitz continuous in that direction.
Smoothness properties of the loss function states that the loss function is continuously differentiable and there are certain bounds on the derivatives.
Regularization by robustification over Wasserstein balls states that if certain assumptions hold, then the maximum of the Wasserstein ball is equal to the minimum of the dual problem.
Aymptotically steep Lipschitz continuous loss states that if the asymptotic linear growth rate of the loss function is equal to the Lipschitz modulus, then the Pasch-Hausdorff envelope of the loss function is equal to the loss function if the growth rate is greater than the Lipschitz modulus and equal to infinity otherwise.
Non-uniqueness of Nash equilibria states that if a certain maximization problem is solved, then the corresponding objective function values coincide.
Given certain parameters, the out-of-sample performance of a fixed portfolio is evaluated empirically.
Accounting for distributional ambiguity reduces the expected logarithmic disutility and the dispersion of the disutility.
If an upper semi-continuous function satisfies a growth condition, then the -th envelope converges to the function as grows.

Link to paper#

Abstract#

Paper Content#

Introduction#

Nash equilibria in dro#

Existence of nash equilibria#

Computation of nash equilibria#

Regularization by robustification#

Primal regularizating effects of robustification#

Dual regularizing effects of robustification#

Numerical experiments#

Nash equilibria#

Distributionally robust log-optimal portfolio selection#

C technical background results#

Link to paper

Abstract

Paper Content

Introduction

Nash equilibria in dro

Existence of nash equilibria

Computation of nash equilibria

Regularization by robustification

Primal regularizating effects of robustification

Dual regularizing effects of robustification

Numerical experiments

Nash equilibria

Distributionally robust log-optimal portfolio selection

C technical background results