Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Predicted next-day stock end-of-day implied volatility using random forests
  • Examined usefulness of different sources of predictors and value of attention and sentiment features from Twitter
  • Studied approach on 165 most liquid US stocks across 11 traditional market sectors
  • Discovered stocks in certain sectors are easier to predict than others
  • Possible reasons for discrepancies caused by excess social media attention or low option liquidity
  • Explored how approach fares throughout time by identifying four underlying market regimes in implied volatility

Paper Content

Introduction

  • Social media has caused significant changes in the world, including financial markets
  • Efficient Market Hypothesis (Fama, 1970) suggests rapid information diffusion could lead to higher price efficiency
  • Behavioral economists argue social media could influence investors and incite herd behavior
  • Data providers offer social media indicators to help financial institutions
  • Academic studies have tried to quantify the interplay between social media and financial variables
  • Existing research has mainly focused on Twitter and its influence on stock price, volatility and trading volume
  • Current literature overlooks the interaction between social media and market implied volatility
  • This study investigates the ability to predict one-day ahead movement in implied volatility using machine learning
  • Stock universe is diversified across 11 traditional US stock market sectors
  • Out-of-sample period spans from January 1st, 2013 till March 1st, 2019
  • Hidden Markov models used to identify four regimes in implied volatility and measure performance across them

Preliminaries

  • Explains market implied volatility and its relation to derivatives
  • Describes random forest machine learning model for predictions
  • Describes hidden Markov model for quantifying regimes in market implied volatility

Market implied volatility

  • Options are a type of financial instrument
  • Sellers of options are exposed to risk
  • Measuring this risk requires considering expected price fluctuations of the underlying asset
  • The CBOE Volatility Index combines implied volatility of different option contracts into an index
  • The VIX is a measure of expected price fluctuations in the S&P 500 Index over the next 30 days
  • Equation 1 is used to compute the VIX for a given term
  • Option contracts typically have fixed expiration dates
  • The VIX is calculated by linearly interpolating between two computed measures

Random forests

  • Random forests are a machine learning approach for learning a predictive model
  • They consist of multiple decision (or regression) trees whose predictions are combined
  • Combination is typically done by taking the mode (or average) of all outputs
  • Random forests are fast to build, not affected by feature scaling, robust to irrelevant predictors and noisy data
  • Constructing an ensemble model by randomly subsampling both data points and features helps reduce overfitting

Hidden markov models

  • Hidden Markov models (HMM) are a generative approach for modeling systems that follow a Markov process
  • HMM models the joint distribution of a sequence of hidden states and observations
  • Parameters of HMM are initial state distribution, state transition model, and emission probabilities model
  • Three key tasks associated with HMM are: probability of sequence of observations, best sequence of hidden states, and learning an HMM

Methodology

  • Main goal of study is to explore 3 questions related to stock market performance
  • Study uses random forests to predict stock market performance using stock price, implied volatility, and Twitter features
  • Study covers 165 stocks over a 6 year period
  • Performance is grouped by 11 traditional stock sectors
  • Hidden Markov models used to identify 4 distinct implied volatility regimes per stock

Stock universe selection

  • Looked at popular ETFs to obtain a diversified universe of stocks
  • Selected 15 most liquid stocks per sector for a total of 165 stocks
  • Excluded some stocks due to stock splits, late introduction, and ambiguous names
  • Replaced excluded stocks to maintain 15 stocks per sector

Data acquisition and feature generation

  • Data from Jan 1, 2011 to March 1, 2019 was used from 3 sources: stock prices, option contracts, and Twitter
  • 4 features were extracted per stock for each trading day: closing price, 30-day implied volatility, total tweet count, and average sentiment polarity
  • Sentiment polarity was calculated using VADER
  • 2 additional predictors were generated per feature: daily difference and difference between daily value and exponential moving average of last 10 trading days

Predicting movements in implied volatility

  • This study aims to predict one-day ahead movements in a stock’s 30-day implied volatility.
  • Random forest classifiers are used to predict the target variable.
  • Random forests are chosen due to their performance on noisy data.
  • 64 distinctive random forest configurations are built using Sklearn.

Experimental evaluation

  • Evaluated different random forest configurations using walk-forward validation
  • Walk-forward validation is a cross-validation technique designed for temporally ordered data
  • Classical cross-validation methods assume observations to be independent

Analyzing performance through time with hidden markov models

  • Financial markets are constantly changing, making it difficult to find successful approaches.
  • Evaluating how the proposed approach works in different market regimes was studied.
  • Hidden Markov models were used to quantify market regimes.
  • Four different regimes were identified.

Experimental results and discussion

  • Conducted an ablation study with 7 different feature configurations
  • Looked at performance of best feature configuration per market sector
  • Examined predictive performance across different implied volatility regimes
  • Denoted 11 different stock market sectors by symbol of equivalent SPDR ETF tracker

Ablation study

  • Investigated to what extent daily movements in end-of-day implied volatility can be predicted
  • Built random forest classifiers to predict said target variable
  • Compared predictive performance of different scenarios to stratified dummy classifier
  • Results show end-of-day movements in implied volatility can be predicted
  • Implied volatility features are important source of information
  • Including features derived from tweets always yielded better performance
  • Using all possible features yielded best result overall

Predictive performance across sectors

  • Best performing feature configuration (S7)
  • Performance variability across 11 different stock market sectors
  • Generally able to beat stratified dummy classifier across all sectors
  • 148 out of 165 stocks beat dummy classifier
  • Variability present across different sectors
  • XLRE and XLU do significantly better
  • XLI and XLB lack in performance
  • Weak negative correlation between predictive performance and option liquidity
  • Lower liquidity might partially explain why XLRE and XLU easier to predict
  • XLC, XLY, and XLK also do better comparatively
  • Might be due to attention they receive on Twitter
  • Strong correlation between attention on Twitter and liquidity
  • Improvement in predictive performance caused by features extracted from Twitter per sector
  • Stocks in XLC, XLY, and XLK receive significantly more attention on Twitter
  • Social media inciting herd behavior and emotional reactions among investors
  • Optimal implied volatility regimes differ significantly for different sectors
  • XLK performs best in highest implied volatility regime and worst in lowest

Conclusion

  • One-day ahead movements of end-of-day stock implied volatility can be predicted
  • Attention and sentiment features from Twitter improve the performance of the approach
  • Interplay between stock and options data gives rise to predictive patterns
  • Performance of approach varies across 11 traditional stock market sectors
  • Real estate, utilities, consumer discretionary, communications, and technology sectors easier to predict
  • Differences in performance explained by market inefficiencies and Twitter attention
  • Hidden Markov models used to evaluate predictive performance across 4 implied volatility regimes
  • Outperforms dummy classifier in all 4 regimes
  • Different stock market sectors have different optimal regimes
  • Analysis of performance through usage of regimes provides additional insight
  • 165 US stocks considered over period of January 1st, 2013 to March 1st, 2019