Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Predicted next-day stock end-of-day implied volatility using random forests
Examined usefulness of different sources of predictors and value of attention and sentiment features from Twitter
Studied approach on 165 most liquid US stocks across 11 traditional market sectors
Discovered stocks in certain sectors are easier to predict than others
Possible reasons for discrepancies caused by excess social media attention or low option liquidity
Explored how approach fares throughout time by identifying four underlying market regimes in implied volatility

Paper Content

Introduction

Social media has caused significant changes in the world, including financial markets
Efficient Market Hypothesis (Fama, 1970) suggests rapid information diffusion could lead to higher price efficiency
Behavioral economists argue social media could influence investors and incite herd behavior
Data providers offer social media indicators to help financial institutions
Academic studies have tried to quantify the interplay between social media and financial variables
Existing research has mainly focused on Twitter and its influence on stock price, volatility and trading volume
Current literature overlooks the interaction between social media and market implied volatility
This study investigates the ability to predict one-day ahead movement in implied volatility using machine learning
Stock universe is diversified across 11 traditional US stock market sectors
Out-of-sample period spans from January 1st, 2013 till March 1st, 2019
Hidden Markov models used to identify four regimes in implied volatility and measure performance across them

Preliminaries

Explains market implied volatility and its relation to derivatives
Describes random forest machine learning model for predictions
Describes hidden Markov model for quantifying regimes in market implied volatility

Market implied volatility

Options are a type of financial instrument
Sellers of options are exposed to risk
Measuring this risk requires considering expected price fluctuations of the underlying asset
The CBOE Volatility Index combines implied volatility of different option contracts into an index
The VIX is a measure of expected price fluctuations in the S&P 500 Index over the next 30 days
Equation 1 is used to compute the VIX for a given term
Option contracts typically have fixed expiration dates
The VIX is calculated by linearly interpolating between two computed measures

Random forests

Random forests are a machine learning approach for learning a predictive model
They consist of multiple decision (or regression) trees whose predictions are combined
Combination is typically done by taking the mode (or average) of all outputs
Random forests are fast to build, not affected by feature scaling, robust to irrelevant predictors and noisy data
Constructing an ensemble model by randomly subsampling both data points and features helps reduce overfitting

Hidden markov models

Hidden Markov models (HMM) are a generative approach for modeling systems that follow a Markov process
HMM models the joint distribution of a sequence of hidden states and observations
Parameters of HMM are initial state distribution, state transition model, and emission probabilities model
Three key tasks associated with HMM are: probability of sequence of observations, best sequence of hidden states, and learning an HMM

Methodology

Main goal of study is to explore 3 questions related to stock market performance
Study uses random forests to predict stock market performance using stock price, implied volatility, and Twitter features
Study covers 165 stocks over a 6 year period
Performance is grouped by 11 traditional stock sectors
Hidden Markov models used to identify 4 distinct implied volatility regimes per stock

Stock universe selection

Looked at popular ETFs to obtain a diversified universe of stocks
Selected 15 most liquid stocks per sector for a total of 165 stocks
Excluded some stocks due to stock splits, late introduction, and ambiguous names
Replaced excluded stocks to maintain 15 stocks per sector

Data acquisition and feature generation

Data from Jan 1, 2011 to March 1, 2019 was used from 3 sources: stock prices, option contracts, and Twitter
4 features were extracted per stock for each trading day: closing price, 30-day implied volatility, total tweet count, and average sentiment polarity
Sentiment polarity was calculated using VADER
2 additional predictors were generated per feature: daily difference and difference between daily value and exponential moving average of last 10 trading days

Predicting movements in implied volatility

This study aims to predict one-day ahead movements in a stock’s 30-day implied volatility.
Random forest classifiers are used to predict the target variable.
Random forests are chosen due to their performance on noisy data.
64 distinctive random forest configurations are built using Sklearn.

Experimental evaluation

Evaluated different random forest configurations using walk-forward validation
Walk-forward validation is a cross-validation technique designed for temporally ordered data
Classical cross-validation methods assume observations to be independent

Analyzing performance through time with hidden markov models

Financial markets are constantly changing, making it difficult to find successful approaches.
Evaluating how the proposed approach works in different market regimes was studied.
Hidden Markov models were used to quantify market regimes.
Four different regimes were identified.

Experimental results and discussion

Conducted an ablation study with 7 different feature configurations
Looked at performance of best feature configuration per market sector
Examined predictive performance across different implied volatility regimes
Denoted 11 different stock market sectors by symbol of equivalent SPDR ETF tracker

Ablation study

Investigated to what extent daily movements in end-of-day implied volatility can be predicted
Built random forest classifiers to predict said target variable
Compared predictive performance of different scenarios to stratified dummy classifier
Results show end-of-day movements in implied volatility can be predicted
Implied volatility features are important source of information
Including features derived from tweets always yielded better performance
Using all possible features yielded best result overall

Predictive performance across sectors

Best performing feature configuration (S7)
Performance variability across 11 different stock market sectors
Generally able to beat stratified dummy classifier across all sectors
148 out of 165 stocks beat dummy classifier
Variability present across different sectors
XLRE and XLU do significantly better
XLI and XLB lack in performance
Weak negative correlation between predictive performance and option liquidity
Lower liquidity might partially explain why XLRE and XLU easier to predict
XLC, XLY, and XLK also do better comparatively
Might be due to attention they receive on Twitter
Strong correlation between attention on Twitter and liquidity
Improvement in predictive performance caused by features extracted from Twitter per sector
Stocks in XLC, XLY, and XLK receive significantly more attention on Twitter
Social media inciting herd behavior and emotional reactions among investors
Optimal implied volatility regimes differ significantly for different sectors
XLK performs best in highest implied volatility regime and worst in lowest

Conclusion

One-day ahead movements of end-of-day stock implied volatility can be predicted
Attention and sentiment features from Twitter improve the performance of the approach
Interplay between stock and options data gives rise to predictive patterns
Performance of approach varies across 11 traditional stock market sectors
Real estate, utilities, consumer discretionary, communications, and technology sectors easier to predict
Differences in performance explained by market inefficiencies and Twitter attention
Hidden Markov models used to evaluate predictive performance across 4 implied volatility regimes
Outperforms dummy classifier in all 4 regimes
Different stock market sectors have different optimal regimes
Analysis of performance through usage of regimes provides additional insight
165 US stocks considered over period of January 1st, 2013 to March 1st, 2019

Link to paper#

Abstract#

Paper Content#

Introduction#

Preliminaries#

Market implied volatility#

Random forests#

Hidden markov models#

Methodology#

Stock universe selection#

Data acquisition and feature generation#

Predicting movements in implied volatility#

Experimental evaluation#

Analyzing performance through time with hidden markov models#

Experimental results and discussion#

Ablation study#

Predictive performance across sectors#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Preliminaries

Market implied volatility

Random forests

Hidden markov models

Methodology

Stock universe selection

Data acquisition and feature generation

Predicting movements in implied volatility

Experimental evaluation

Analyzing performance through time with hidden markov models

Experimental results and discussion

Ablation study

Predictive performance across sectors

Conclusion