Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Predicted next-day stock end-of-day implied volatility using random forests
- Examined usefulness of different sources of predictors and value of attention and sentiment features from Twitter
- Studied approach on 165 most liquid US stocks across 11 traditional market sectors
- Discovered stocks in certain sectors are easier to predict than others
- Possible reasons for discrepancies caused by excess social media attention or low option liquidity
- Explored how approach fares throughout time by identifying four underlying market regimes in implied volatility
Paper Content
Introduction
- Social media has caused significant changes in the world, including financial markets
- Efficient Market Hypothesis (Fama, 1970) suggests rapid information diffusion could lead to higher price efficiency
- Behavioral economists argue social media could influence investors and incite herd behavior
- Data providers offer social media indicators to help financial institutions
- Academic studies have tried to quantify the interplay between social media and financial variables
- Existing research has mainly focused on Twitter and its influence on stock price, volatility and trading volume
- Current literature overlooks the interaction between social media and market implied volatility
- This study investigates the ability to predict one-day ahead movement in implied volatility using machine learning
- Stock universe is diversified across 11 traditional US stock market sectors
- Out-of-sample period spans from January 1st, 2013 till March 1st, 2019
- Hidden Markov models used to identify four regimes in implied volatility and measure performance across them
Preliminaries
- Explains market implied volatility and its relation to derivatives
- Describes random forest machine learning model for predictions
- Describes hidden Markov model for quantifying regimes in market implied volatility
Market implied volatility
- Options are a type of financial instrument
- Sellers of options are exposed to risk
- Measuring this risk requires considering expected price fluctuations of the underlying asset
- The CBOE Volatility Index combines implied volatility of different option contracts into an index
- The VIX is a measure of expected price fluctuations in the S&P 500 Index over the next 30 days
- Equation 1 is used to compute the VIX for a given term
- Option contracts typically have fixed expiration dates
- The VIX is calculated by linearly interpolating between two computed measures
Random forests
- Random forests are a machine learning approach for learning a predictive model
- They consist of multiple decision (or regression) trees whose predictions are combined
- Combination is typically done by taking the mode (or average) of all outputs
- Random forests are fast to build, not affected by feature scaling, robust to irrelevant predictors and noisy data
- Constructing an ensemble model by randomly subsampling both data points and features helps reduce overfitting
Hidden markov models
- Hidden Markov models (HMM) are a generative approach for modeling systems that follow a Markov process
- HMM models the joint distribution of a sequence of hidden states and observations
- Parameters of HMM are initial state distribution, state transition model, and emission probabilities model
- Three key tasks associated with HMM are: probability of sequence of observations, best sequence of hidden states, and learning an HMM
Methodology
- Main goal of study is to explore 3 questions related to stock market performance
- Study uses random forests to predict stock market performance using stock price, implied volatility, and Twitter features
- Study covers 165 stocks over a 6 year period
- Performance is grouped by 11 traditional stock sectors
- Hidden Markov models used to identify 4 distinct implied volatility regimes per stock
Stock universe selection
- Looked at popular ETFs to obtain a diversified universe of stocks
- Selected 15 most liquid stocks per sector for a total of 165 stocks
- Excluded some stocks due to stock splits, late introduction, and ambiguous names
- Replaced excluded stocks to maintain 15 stocks per sector
Data acquisition and feature generation
- Data from Jan 1, 2011 to March 1, 2019 was used from 3 sources: stock prices, option contracts, and Twitter
- 4 features were extracted per stock for each trading day: closing price, 30-day implied volatility, total tweet count, and average sentiment polarity
- Sentiment polarity was calculated using VADER
- 2 additional predictors were generated per feature: daily difference and difference between daily value and exponential moving average of last 10 trading days
Predicting movements in implied volatility
- This study aims to predict one-day ahead movements in a stock’s 30-day implied volatility.
- Random forest classifiers are used to predict the target variable.
- Random forests are chosen due to their performance on noisy data.
- 64 distinctive random forest configurations are built using Sklearn.
Experimental evaluation
- Evaluated different random forest configurations using walk-forward validation
- Walk-forward validation is a cross-validation technique designed for temporally ordered data
- Classical cross-validation methods assume observations to be independent
Analyzing performance through time with hidden markov models
- Financial markets are constantly changing, making it difficult to find successful approaches.
- Evaluating how the proposed approach works in different market regimes was studied.
- Hidden Markov models were used to quantify market regimes.
- Four different regimes were identified.
Experimental results and discussion
- Conducted an ablation study with 7 different feature configurations
- Looked at performance of best feature configuration per market sector
- Examined predictive performance across different implied volatility regimes
- Denoted 11 different stock market sectors by symbol of equivalent SPDR ETF tracker
Ablation study
- Investigated to what extent daily movements in end-of-day implied volatility can be predicted
- Built random forest classifiers to predict said target variable
- Compared predictive performance of different scenarios to stratified dummy classifier
- Results show end-of-day movements in implied volatility can be predicted
- Implied volatility features are important source of information
- Including features derived from tweets always yielded better performance
- Using all possible features yielded best result overall
Predictive performance across sectors
- Best performing feature configuration (S7)
- Performance variability across 11 different stock market sectors
- Generally able to beat stratified dummy classifier across all sectors
- 148 out of 165 stocks beat dummy classifier
- Variability present across different sectors
- XLRE and XLU do significantly better
- XLI and XLB lack in performance
- Weak negative correlation between predictive performance and option liquidity
- Lower liquidity might partially explain why XLRE and XLU easier to predict
- XLC, XLY, and XLK also do better comparatively
- Might be due to attention they receive on Twitter
- Strong correlation between attention on Twitter and liquidity
- Improvement in predictive performance caused by features extracted from Twitter per sector
- Stocks in XLC, XLY, and XLK receive significantly more attention on Twitter
- Social media inciting herd behavior and emotional reactions among investors
- Optimal implied volatility regimes differ significantly for different sectors
- XLK performs best in highest implied volatility regime and worst in lowest
Conclusion
- One-day ahead movements of end-of-day stock implied volatility can be predicted
- Attention and sentiment features from Twitter improve the performance of the approach
- Interplay between stock and options data gives rise to predictive patterns
- Performance of approach varies across 11 traditional stock market sectors
- Real estate, utilities, consumer discretionary, communications, and technology sectors easier to predict
- Differences in performance explained by market inefficiencies and Twitter attention
- Hidden Markov models used to evaluate predictive performance across 4 implied volatility regimes
- Outperforms dummy classifier in all 4 regimes
- Different stock market sectors have different optimal regimes
- Analysis of performance through usage of regimes provides additional insight
- 165 US stocks considered over period of January 1st, 2013 to March 1st, 2019