Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a new approach to portfolio optimization that combines synthetic data generation and a CVaR-constraint
  • Formulated the portfolio optimization problem as an asset allocation problem
  • Used Modified CTGAN algorithm to generate synthetic return scenarios
  • Rely on several points along the U.S. Treasury yield curve for contextual information
  • Demonstrated merits of approach with an example based on ten asset classes
  • Synthetic generation process captures key characteristics of original data
  • Optimization scheme results in portfolios with satisfactory out-of-sample performance
  • Outperforms conventional equal-weights (1/N) asset allocation strategy and other optimization formulations based on historical data only

Paper Content

Motivation and previous work

  • The portfolio selection problem is one of the oldest problems in applied finance
  • Before 1952, the issue was tackled with gut feeling, intuition, and common sense
  • Harry Markowitz published a paper in 1952 that showed the portfolio selection problem was an optimization problem
  • The key ideas behind Markowitz’s framework (diversification, risk/return tradeoff, efficient frontier) have survived well
  • Implementing Markowitz’s approach has been problematic due to difficulty in estimating correlation-of-returns matrix and the use of standard deviation to describe risk
  • Most research efforts have been aimed at devising practical strategies to implement the MV formulation
  • John Bogle introduced the concept of passive investment in 1975, which shifted the emphasis from asset selection to asset allocation
  • The Conditional-Value-at-Risk (CVaR) has become the risk metric of choice
  • Generative Adversarial Networks (GANs) have been used to generate realistic synthetic data
  • The joint behavior of a group of assets can fluctuate between discrete states, known as market regimes
  • Features have been incorporated to the formulation of optimization problems
  • The goal is to propose a method to tackle the portfolio selection problem based on an asset allocation approach

Problem formulation

  • Investor seeks to maximize return by selecting appropriate exposure to each asset class while keeping overall portfolio risk below predefined tolerance level
  • Risk metric chosen is Conditional-Value-at-Risk (CVaR)
  • CVaR chosen because it avoids losses better than standard deviation of returns
  • CVaR is convex and coherent
  • Problem can be written in discretized and linear fashion
  • Sampled data from relevant probability distribution of returns used in combination with discrete probability density function
  • Weights π can be modified to adjust formulation for case with features

Synthetic data generation

  • Generating random samples from a given probability density function is a straightforward task in principle.
  • In practice, there are two major limitations: knowing only one path generated by an unknown stochastic process, and the non-stationary nature of the stochastic processes.
  • Using machine learning techniques to generate synthetic data based on recent historical data.
  • Conditional Tabular Generative Adversarial Networks (CTGAN) used to generate realistic synthetic data.
  • CTGAN models the dataset as a conditional process, with continuous variables dependent on discrete variables.
  • CTGAN introduces the notions of conditional generator and a training-by-sampling process.
  • CTGAN improves the normalization of the continuous columns.

A modified ctgan-plus-features method

  • Generate state-aware synthetic data using CTGAN architecture
  • Unsupervised method to generate discrete market regimes or states
  • Identify clusters of samples exhibiting similar characteristics
  • Use cluster identifier as state-defining variable
  • Reduce noise generated by trivially-correlated assets using PCA technique

Example of application

  • Investor has access to 10 asset classes
  • Rebalancing portfolio once a year
  • Data from 2003-2022
  • 5-year lookback period to generate synthetic returns data
  • Incorporating features to optimization problem to improve out-of-sample performance
  • Using Treasury yield curve as indicator of macroeconomic environment

Synthetic data generation process (sdgp) validation

  • Synthetic Data Generation Process (SDGP) is important in the approach
  • Need to investigate if the CTGAN model generates suitable scenarios
  • Comparing similarity between input and output distributions
  • Comparing single and joint multivariate distributions
  • Kolmogorov-Smirnov test (KS-test) used to compare original and synthetic data
  • Correlation similarity index used to compare correlation matrices of original and synthetic data

Testing strategy

  • Five strategies tested: CTGAN without features, CTGAN with features, Historical data without features, Historical data with features, Equal Weights
  • Optimization model run once a year based on 5-year lookback periods
  • Performance metrics: returns, risk, transaction costs, diversification
  • Optimization problem solved for several CVaR limits
  • Optimization run 5 times for each CVaR tolerance level
  • Backtesting process repeated until January 2022
  • Final test done in July 2022

Performance comparison

  • Experiments run on MacBook Pro 14 with M1 Pro chip and 16 GB RAM
  • 5-year window of daily historical scenarios used as input
  • 500 synthetic scenarios used for CTGAN-based strategies
  • 500 historical scenarios used for historical-based strategies
  • EW strategy does not depend on any scenarios
  • Historical-based strategies had 0.001 seconds running time per rebalance cycle
  • CTGAN-based strategies had 203.5 seconds running time per rebalance cycle
  • Features-based strategies outperform non-features options
  • GwF outperforms HwF, especially in terms of returns
  • EW strategy underperforms compared to other strategies
  • All strategies display fairly similar diversification levels
  • Trading expenses have no significant impact on returns
  • No hyperparameter-tuning process to avoid overfitting
  • Lookback period of 3-5 years and rebalancing period of 1 year

Conclusions

  • Synthetic data generating approach suggested is promising
  • Captures essential character of historical data
  • Incorporating contextual information is beneficial
  • Benefits of using alternatives to yield curve as features should be explored
  • Synthetic data generating method should be applied to other financial variables