Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a new approach to portfolio optimization that combines synthetic data generation and a CVaR-constraint
Formulated the portfolio optimization problem as an asset allocation problem
Used Modified CTGAN algorithm to generate synthetic return scenarios
Rely on several points along the U.S. Treasury yield curve for contextual information
Demonstrated merits of approach with an example based on ten asset classes
Synthetic generation process captures key characteristics of original data
Optimization scheme results in portfolios with satisfactory out-of-sample performance
Outperforms conventional equal-weights (1/N) asset allocation strategy and other optimization formulations based on historical data only

The portfolio selection problem is one of the oldest problems in applied finance
Before 1952, the issue was tackled with gut feeling, intuition, and common sense
Harry Markowitz published a paper in 1952 that showed the portfolio selection problem was an optimization problem
The key ideas behind Markowitz’s framework (diversification, risk/return tradeoff, efficient frontier) have survived well
Implementing Markowitz’s approach has been problematic due to difficulty in estimating correlation-of-returns matrix and the use of standard deviation to describe risk
Most research efforts have been aimed at devising practical strategies to implement the MV formulation
John Bogle introduced the concept of passive investment in 1975, which shifted the emphasis from asset selection to asset allocation
The Conditional-Value-at-Risk (CVaR) has become the risk metric of choice
Generative Adversarial Networks (GANs) have been used to generate realistic synthetic data
The joint behavior of a group of assets can fluctuate between discrete states, known as market regimes
Features have been incorporated to the formulation of optimization problems
The goal is to propose a method to tackle the portfolio selection problem based on an asset allocation approach

Investor seeks to maximize return by selecting appropriate exposure to each asset class while keeping overall portfolio risk below predefined tolerance level
Risk metric chosen is Conditional-Value-at-Risk (CVaR)
CVaR chosen because it avoids losses better than standard deviation of returns
CVaR is convex and coherent
Problem can be written in discretized and linear fashion
Sampled data from relevant probability distribution of returns used in combination with discrete probability density function
Weights π can be modified to adjust formulation for case with features

Generating random samples from a given probability density function is a straightforward task in principle.
In practice, there are two major limitations: knowing only one path generated by an unknown stochastic process, and the non-stationary nature of the stochastic processes.
Using machine learning techniques to generate synthetic data based on recent historical data.
Conditional Tabular Generative Adversarial Networks (CTGAN) used to generate realistic synthetic data.
CTGAN models the dataset as a conditional process, with continuous variables dependent on discrete variables.
CTGAN introduces the notions of conditional generator and a training-by-sampling process.
CTGAN improves the normalization of the continuous columns.

Investor has access to 10 asset classes
Rebalancing portfolio once a year
Data from 2003-2022
5-year lookback period to generate synthetic returns data
Incorporating features to optimization problem to improve out-of-sample performance
Using Treasury yield curve as indicator of macroeconomic environment

Synthetic Data Generation Process (SDGP) is important in the approach
Need to investigate if the CTGAN model generates suitable scenarios
Comparing similarity between input and output distributions
Comparing single and joint multivariate distributions
Kolmogorov-Smirnov test (KS-test) used to compare original and synthetic data
Correlation similarity index used to compare correlation matrices of original and synthetic data

Five strategies tested: CTGAN without features, CTGAN with features, Historical data without features, Historical data with features, Equal Weights
Optimization model run once a year based on 5-year lookback periods
Performance metrics: returns, risk, transaction costs, diversification
Optimization problem solved for several CVaR limits
Optimization run 5 times for each CVaR tolerance level
Backtesting process repeated until January 2022
Final test done in July 2022