Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Optimal transport-based distributionally robust optimization problems studied
- Fictitious adversary (nature) can choose distribution of uncertain problem parameters
- Robustification related to various forms of variation and Lipschitz regularization
- Conditions for existence and computability of Nash equilibrium between decision-maker and nature
Paper Content
Introduction
- Stochastic optimization methods are becoming popular in machine learning.
- A stochastic optimization problem aims to minimize the expected value of an uncertainty-affected loss function.
- The probability distribution governing the problem parameters is rarely accessible to the decision-maker.
- A distributionally robust approach minimizes the worst-case expected loss with respect to all distributions in some neighborhood of the reference distribution.
- This approach leads to tractable optimization models and provides generalization bounds.
- It can enable generalization in the face of domain shifts and make training of deep neural networks more resilient against adversarial attacks.
- The ambiguity set is an -neighborhood of the reference distribution with respect to an optimal transport discrepancy.
- The transportation cost function is lower semi-continuous and the loss function is upper semi-continuous.
- Optimal transport-based DRO problems are studied in many areas.
- The ambiguity set can be interpreted as the family of all probability distributions that can be obtained by reshaping the reference distribution at a finite cost.
- Assumptions 1.1 (i) and 1.1 (ii) ensure that the inner maximization problem admits a strong dual minimization problem.
- Checking whether a given distribution belongs to the ambiguity set is #P-hard.
- The optimal value of the problem can often be computed efficiently by solving the dual problem.
- The dual objective function involves the expected value of the -transform of the loss function.
- The -transform can be interpreted as an epigraphical regularization of the loss function.
Nash equilibria in dro
- DRO problem is a zero-sum game between an agent and an adversary
- Optimal decision and distribution form a Nash equilibrium
- Optimal solution is referred to as a minimax estimator or robust estimator
- Expected loss is minimized under the crisp distribution
- Nash equilibrium exists under mild regularity conditions
Existence of nash equilibria
- Assumption 1.1 (i) states that there is a reference point
- Assumption 2.1 (i) and (ii) are satisfied when the support set is compact
- Assumption 2.4 (iii) states that the loss function is convex and lower semi-continuous
- The Wasserstein space is independent of the reference point
- The -th Wasserstein ball of radius ≥ 0 is defined
- Assumption 2.4 (iii) ensures that the loss function satisfies a growth condition
Computation of nash equilibria
- The dual DRO problem appears intractable as it is a challenging maximin problem.
- The primal DRO problem can be reformulated as a finite convex program if certain convexity properties are met.
- The dual DRO problem can also be reformulated as a finite convex program if certain regularity conditions are met.
- Solutions of the convex programs can be used to construct a robust decision and a least favorable distribution that form a Nash equilibrium.
- Dual DRO problems have been investigated in specific applications, such as minimum mean square error estimation and Kalman filtering.
- General dual DRO problems can often be addressed with methods from convex optimization.
- Assumptions 2.8, 2.9 and 2.11 are needed to reformulate the dual DRO problem as a finite convex program.
- Assumption 2.13 is needed to ensure that the dual DRO problem is solvable.
- The dual DRO problem can be reformulated as a convex program with a maximization problem and a minimization problem.
- The maximization problem has a Slater point and is solvable.
- The minimization problem has a Slater point and is solvable.
- Any maximizer of the convex program can be used to construct a maximizer of the dual DRO problem.
- If the dual DRO problem is not solvable, a sequence of distributions can be constructed that are feasible and asymptotically optimal.
Regularization by robustification
- Regularization schemes in statistics and machine learning can be interpreted as a form of robustness.
- This paper seeks to create a comprehensive theory of regularization and optimal transport-based robustification.
- The paper studies the primal and dual regularizing effects of robustification.
Primal regularizating effects of robustification
- The worst-case expected loss across all distributions in a generic optimal transport-based ambiguity set is bounded above
- Sum of expected loss under reference distribution and regularization terms that penalize -norms and Lipschitz moduli of higher-order derivatives of loss function
- Intimate connections between robustification and gradient, Hessian, and Lipschitz regularization
- Norm on space of totally symmetric -th order tensors induced by vector norm
- Smoothness conditions depend on -th order partial derivatives of loss function
- Upper bound of worst-case expected loss is sum of expected loss under reference distribution, -th variation regularization terms, and Lipschitz regularization term
- Variation regularizers used in machine learning applications
- Theorem 3.2 gives heuristic regularization schemes a theoretical justification
- Estimate can be simplified by upper bounding all variation regularization terms by corresponding Lipschitz regularizers
- Results generalize several bounds from extant literature
Dual regularizing effects of robustification
- A DRO problem of the form inf is a univariate loss function : R → (−∞, +∞]
- Several important problems in operations research and machine learning can be framed as instances of this DRO problem
- Assumption 1.1 holds
- The DRO problem is equivalent to a stochastic program inf
- The -transform ℓ (, , ̂︀ ) is defined in (20)
- The -transform can be expressed in terms of a suitable envelope of
- The DRO problem can be solved efficiently even if fails to be (piecewise) concave or even if fails to be convex
- Evaluating the -transform ℓ is equivalent to solving a univariate maximization problem
- The -transform ℓ (, , ̂︀ ) can be viewed as an approximation of ℓ(, ̂︀ ) = (⟨, ̂︀ ⟩)
- If ≥ lip(), then the worst-case expected loss over a 1-Wasserstein ball of radius coincides with the expected loss under the reference distribution adjusted by the regularization term lip()‖‖ *
- If Θ is a cone, then the scaling factor can be eliminated by using the variable substitution ′ ←
Numerical experiments
- Linear and second-order cone programs implemented in Python
- Solved with Gurobi 10.0.0 on a 2.4 GHz quad-core machine with 8 GB RAM
Nash equilibria
- Computation of Nash equilibria between a statistician and nature in the context of a distributionally robust support vector machine problem
- Feature vector ∈ ⊆ R −1 and a label ∈ = {−1, +1}
- Weight vector ∈ Θ = R −1 of a linear classifier
- Hinge loss function ℓ(, ) = max{0, 1 − ⟨, ⟩}
- Reference distribution ̂︀ P is set to the empirical (uniform) distribution on training samples ̂︀ = (̂︀ , ̂︀ ), ∈ []
- Transportation cost function defined
- Nash strategies for the statistician and nature can be computed by solving finite convex programs
- Continuum of least favorable distributions, representing different Nash strategies of nature
- Instance of the distributionally robust support vector machine problem with = 3, = 0.1 and = 20
- -norm to quantify the transportation cost in the feature space
- Distributionally robust support vector machine to distinguish greyscale images of handwritten numbers 3 and 8 from the MNIST 3-vs-8 dataset
- Feature vector ∈ = [0, 1] −1 with = 787
- ∞-norm to quantify the transportation cost in the feature space
- Compare least favorable distributions (Nash strategies of nature) against worst-case distributions (best response strategies of nature)
Distributionally robust log-optimal portfolio selection
- Assume components of random vector represent total returns of assets over next month
- Probability simplex in R represents probability of constantly rebalanced portfolio
- Distribution P is unknown in practice
- Model distributional ambiguity via optimal transport-based ambiguity set
- Maximization problem in (28) can be solved efficiently by sorting
- Out-of-sample performance of log-optimal portfolios with 10 assets assessed
- Unknown true asset return distribution P assumed to be lognormal
- Maximizer of problem (28) is unique and fully determined by first-order optimality condition
- Maximizer of problem (28) is bounded
- As tends to infinity, converges to
C technical background results
- Strong duality states that the adversarial examples implied by nature’s best response can only deceive an algorithm, while the adversarial examples implied by nature’s Nash strategy can even deceive a human.
- Relaxed self service condition states that for any integer up to a certain number, the directional derivative of the loss function is Lipschitz continuous in that direction.
- Smoothness properties of the loss function states that the loss function is continuously differentiable and there are certain bounds on the derivatives.
- Regularization by robustification over Wasserstein balls states that if certain assumptions hold, then the maximum of the Wasserstein ball is equal to the minimum of the dual problem.
- Aymptotically steep Lipschitz continuous loss states that if the asymptotic linear growth rate of the loss function is equal to the Lipschitz modulus, then the Pasch-Hausdorff envelope of the loss function is equal to the loss function if the growth rate is greater than the Lipschitz modulus and equal to infinity otherwise.
- Non-uniqueness of Nash equilibria states that if a certain maximization problem is solved, then the corresponding objective function values coincide.
- Given certain parameters, the out-of-sample performance of a fixed portfolio is evaluated empirically.
- Accounting for distributional ambiguity reduces the expected logarithmic disutility and the dispersion of the disutility.
- If an upper semi-continuous function satisfies a growth condition, then the -th envelope converges to the function as grows.