Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a new approach to portfolio optimization that combines synthetic data generation and a CVaR-constraint
- Formulated the portfolio optimization problem as an asset allocation problem
- Used Modified CTGAN algorithm to generate synthetic return scenarios
- Rely on several points along the U.S. Treasury yield curve for contextual information
- Demonstrated merits of approach with an example based on ten asset classes
- Synthetic generation process captures key characteristics of original data
- Optimization scheme results in portfolios with satisfactory out-of-sample performance
- Outperforms conventional equal-weights (1/N) asset allocation strategy and other optimization formulations based on historical data only
Paper Content
Motivation and previous work
- The portfolio selection problem is one of the oldest problems in applied finance
- Before 1952, the issue was tackled with gut feeling, intuition, and common sense
- Harry Markowitz published a paper in 1952 that showed the portfolio selection problem was an optimization problem
- The key ideas behind Markowitz’s framework (diversification, risk/return tradeoff, efficient frontier) have survived well
- Implementing Markowitz’s approach has been problematic due to difficulty in estimating correlation-of-returns matrix and the use of standard deviation to describe risk
- Most research efforts have been aimed at devising practical strategies to implement the MV formulation
- John Bogle introduced the concept of passive investment in 1975, which shifted the emphasis from asset selection to asset allocation
- The Conditional-Value-at-Risk (CVaR) has become the risk metric of choice
- Generative Adversarial Networks (GANs) have been used to generate realistic synthetic data
- The joint behavior of a group of assets can fluctuate between discrete states, known as market regimes
- Features have been incorporated to the formulation of optimization problems
- The goal is to propose a method to tackle the portfolio selection problem based on an asset allocation approach
Problem formulation
- Investor seeks to maximize return by selecting appropriate exposure to each asset class while keeping overall portfolio risk below predefined tolerance level
- Risk metric chosen is Conditional-Value-at-Risk (CVaR)
- CVaR chosen because it avoids losses better than standard deviation of returns
- CVaR is convex and coherent
- Problem can be written in discretized and linear fashion
- Sampled data from relevant probability distribution of returns used in combination with discrete probability density function
- Weights π can be modified to adjust formulation for case with features
Synthetic data generation
- Generating random samples from a given probability density function is a straightforward task in principle.
- In practice, there are two major limitations: knowing only one path generated by an unknown stochastic process, and the non-stationary nature of the stochastic processes.
- Using machine learning techniques to generate synthetic data based on recent historical data.
- Conditional Tabular Generative Adversarial Networks (CTGAN) used to generate realistic synthetic data.
- CTGAN models the dataset as a conditional process, with continuous variables dependent on discrete variables.
- CTGAN introduces the notions of conditional generator and a training-by-sampling process.
- CTGAN improves the normalization of the continuous columns.
A modified ctgan-plus-features method
- Generate state-aware synthetic data using CTGAN architecture
- Unsupervised method to generate discrete market regimes or states
- Identify clusters of samples exhibiting similar characteristics
- Use cluster identifier as state-defining variable
- Reduce noise generated by trivially-correlated assets using PCA technique
Example of application
- Investor has access to 10 asset classes
- Rebalancing portfolio once a year
- Data from 2003-2022
- 5-year lookback period to generate synthetic returns data
- Incorporating features to optimization problem to improve out-of-sample performance
- Using Treasury yield curve as indicator of macroeconomic environment
Synthetic data generation process (sdgp) validation
- Synthetic Data Generation Process (SDGP) is important in the approach
- Need to investigate if the CTGAN model generates suitable scenarios
- Comparing similarity between input and output distributions
- Comparing single and joint multivariate distributions
- Kolmogorov-Smirnov test (KS-test) used to compare original and synthetic data
- Correlation similarity index used to compare correlation matrices of original and synthetic data
Testing strategy
- Five strategies tested: CTGAN without features, CTGAN with features, Historical data without features, Historical data with features, Equal Weights
- Optimization model run once a year based on 5-year lookback periods
- Performance metrics: returns, risk, transaction costs, diversification
- Optimization problem solved for several CVaR limits
- Optimization run 5 times for each CVaR tolerance level
- Backtesting process repeated until January 2022
- Final test done in July 2022
Performance comparison
- Experiments run on MacBook Pro 14 with M1 Pro chip and 16 GB RAM
- 5-year window of daily historical scenarios used as input
- 500 synthetic scenarios used for CTGAN-based strategies
- 500 historical scenarios used for historical-based strategies
- EW strategy does not depend on any scenarios
- Historical-based strategies had 0.001 seconds running time per rebalance cycle
- CTGAN-based strategies had 203.5 seconds running time per rebalance cycle
- Features-based strategies outperform non-features options
- GwF outperforms HwF, especially in terms of returns
- EW strategy underperforms compared to other strategies
- All strategies display fairly similar diversification levels
- Trading expenses have no significant impact on returns
- No hyperparameter-tuning process to avoid overfitting
- Lookback period of 3-5 years and rebalancing period of 1 year
Conclusions
- Synthetic data generating approach suggested is promising
- Captures essential character of historical data
- Incorporating contextual information is beneficial
- Benefits of using alternatives to yield curve as features should be explored
- Synthetic data generating method should be applied to other financial variables