Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models lack grounding in the real world and struggle with complex reasoning.
AI systems outperform humans in games like chess and Go.
Chess commentary requires reasoning over a complex board state and providing analyses in natural language.
Combining symbolic reasoning engines with controllable language models can generate chess commentaries.
Experiments show that this approach generates commentaries preferred by human judges.

Paper Content

Introduction

Large-scale language models can generate plausible responses that are not related to the real world
Symbolic reasoning systems can play games like chess or Go better than humans or neural models
This paper investigates how to combine language models and symbolic reasoning engines to generate accurate and strategically sound chess commentaries
During training, control codes are extracted from the commentaries and used to control the language model
During inference, the chess engine provides control codes to the language model to control its output

Chess commentary generation
Language models

Chess commentary generation

Chess commentary generation was proposed by Jhamtani et al. (2018).
Zang et al. (2019) extended this work by using AlphaZero (Silver et al., 2018).
Similar tasks include Shogi commentary generation (Kameko et al., 2015).

Controlled text generation

Controlled text generation involves conditioning a language model with special tokens.
This allows researchers to control aspects such as style, content, and task-specific behavior.

Reasoning systems with language models

Researchers have combined symbolic reasoning agents with language generation models for different tasks.
Researchers have used language models to “reflect” or “manifest” the actions being made by their counterpart reasoning agents.
Our task is to provide strategically accurate analyses about the actions that are taking place.

Task description and data

Formal definition of task
Details about data

Task: chess commentary generation

Goal of chess commentary generation is to provide a description or analysis of moves made in a chess game
Requires robust language modeling and strong symbolic reasoning
6 categories of commentary types: move descriptions, move quality, move comparisons, rationales, contextual, and general
This work focuses on the first three categories

Data

Data collected from an online forum
Data is in the form of triplets (G, M, C)
373,919 triplets collected

Additional pre-training data

Used data from 3 chess forums
Built question and response pairs
Filtered out irrelevant tags and non-positive upvote responses
Used pre-existing Reddit dataset
Scrubbed any PII from data

Methodology

Overview of approach
Explanation of each step

Overview

Approach is visualized in Figure 1
Training data is in the form of (G, M, C)
Build token representation of game-state
Extract “tags” from commentary to provide signals
Commentary generation model is conditioned on extracted tags
During inference, new game-state is passed into chess engine to provide new tags

Training

Identified tags to extract from commentary training data
Built tag-extraction models for 5 types of tags: “commentary type”, “move quality”, “suggested moves”, “pronouns/proper nouns”, and “length”
Tags must be consistent with the semantics of the commentary text
Tags provide explicit signals to learn meaningful patterns between game-state and commentary
Used BART model pre-trained using data described in §3.3
Manually annotated samples to fine-tune each BART model
Commentary type tags indicate which of the six categories of commentaries from Jhamtani et al. (2018)
Move quality tags indicate how good or bad a move is
Suggested move tags indicate alternative lines of moves mentioned in the commentary
Tagged commentary data with pronouns and proper nouns
Tagged data with “[short]”, “[medium]”, or “[long]” depending on the commentary’s length
Used chess engines to derive new tags during inference
Included PGN representations and SAN tokens
Enumerated pieces on the board for each player
Represented attacks by listing the attacking piece and attacked piece
Trained BART to generate commentaries

Inference

Use a chess engine to derive tags to control commentary generation
User selects commentary type
Derive move quality using Leela
Format move quality classification
Use Leela’s suggested moves
Omit pronoun and proper noun tags
Use “medium” as length tag to control output length

Results

Show accuracy of tag extraction models
Assess ability to control commentaries with tags
Use human judges to compare quality of commentaries against baseline models

Tag-extraction models

Used 80:10:10 split of manually annotated training data for classification models and 85:10:5 split for generative model
F1 and token exact match scores had accuracy of 67-73% which was sufficient for controlling commentary generation model

Perplexity

Expect language model to learn meaningful relationship between input and commentary
Drop in perplexity compared to different baselines
Baselines include unconditioned, move, game-state, tags and fully-conditioned models
More signals lead to drop in perplexity, demonstrating ability to learn patterns between tags and commentaries

Human evaluations

Evaluated system using human evaluations
Conducted A/B tests using Amazon Mechanical Turk
Compared against Unconditioned and Game-State (All) baseline models
Compared against prior work: Jhamtani et al. (2018)
Validated chess knowledge of crowdworkers using chess puzzles
Evaluated three commentary types: “Move Description”, “Move Quality”, and “Move Comparison”
Results show that system strongly outperforms all baselines across all commentary types

Discussion: limitations and challenges

Model occasionally generates commentaries with logical reasoning errors
Interface does not cover every possible logical deduction
Crowdworkers sometimes disagree with model’s assessment, even when commentaries are accurate
Super-human behavior of chess engines are different from humans and difficult to interpret
Model occasionally makes mistakes, but has best understanding about recently moved pieces
Experimental details, complete results, and further analysis provided in Appendix F

Conclusion

Combine symbolic reasoning system with controllable language model to generate chess commentaries
Use Reddit data to find question-answer pairs likely referring to a specific chess game
Exclude threads with external links
Match comment post against set of patterns: PGN/SAN move notations, referral to specific move numbers, chess event tokens, chess piece tokens
Use ParlAI to train models
373,919 samples of game-state and commentary pairs, split into 85:10:5 splits for training, validation, and testing
56,032 samples of question and answer pairs, split into 85:10:5 splits for pretraining
1800 samples for commentary type extraction model, split into 80:10:10 splits
610 samples for move quality extraction model, split into 80:10:10 splits
830 samples for suggested move extraction model, split into 80:15:5 splits
Minimize validation perplexity for generative models, minimize validation loss for classification models
Human evaluation study, optionally ask crowdworkers to provide a reason for their selection
Prompted belief-state experiments, compute likelihood of all 64 chess squares given model, prompt, and game-state
Visualize prompted belief-states, weight concentrated around recently moved pieces
Accuracy (36%) and strategy (14%) reasons for preferring our model

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Chess commentary generation#

Controlled text generation#

Reasoning systems with language models#

Task description and data#

Task: chess commentary generation#

Data#

Additional pre-training data#

Methodology#

Overview#

Training#

Inference#

Results#

Tag-extraction models#

Perplexity#

Human evaluations#

Discussion: limitations and challenges#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Chess commentary generation

Controlled text generation

Reasoning systems with language models

Task description and data

Task: chess commentary generation

Data

Additional pre-training data

Methodology

Overview

Training

Inference

Results

Tag-extraction models

Perplexity

Human evaluations

Discussion: limitations and challenges

Conclusion