Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language models lack grounding in the real world and struggle with complex reasoning.
- AI systems outperform humans in games like chess and Go.
- Chess commentary requires reasoning over a complex board state and providing analyses in natural language.
- Combining symbolic reasoning engines with controllable language models can generate chess commentaries.
- Experiments show that this approach generates commentaries preferred by human judges.
Paper Content
Introduction
- Large-scale language models can generate plausible responses that are not related to the real world
- Symbolic reasoning systems can play games like chess or Go better than humans or neural models
- This paper investigates how to combine language models and symbolic reasoning engines to generate accurate and strategically sound chess commentaries
- During training, control codes are extracted from the commentaries and used to control the language model
- During inference, the chess engine provides control codes to the language model to control its output
Related work
- Chess commentary generation
- Language models
Chess commentary generation
- Chess commentary generation was proposed by Jhamtani et al. (2018).
- Zang et al. (2019) extended this work by using AlphaZero (Silver et al., 2018).
- Similar tasks include Shogi commentary generation (Kameko et al., 2015).
Controlled text generation
- Controlled text generation involves conditioning a language model with special tokens.
- This allows researchers to control aspects such as style, content, and task-specific behavior.
Reasoning systems with language models
- Researchers have combined symbolic reasoning agents with language generation models for different tasks.
- Researchers have used language models to “reflect” or “manifest” the actions being made by their counterpart reasoning agents.
- Our task is to provide strategically accurate analyses about the actions that are taking place.
Task description and data
- Formal definition of task
- Details about data
Task: chess commentary generation
- Goal of chess commentary generation is to provide a description or analysis of moves made in a chess game
- Requires robust language modeling and strong symbolic reasoning
- 6 categories of commentary types: move descriptions, move quality, move comparisons, rationales, contextual, and general
- This work focuses on the first three categories
Data
- Data collected from an online forum
- Data is in the form of triplets (G, M, C)
- 373,919 triplets collected
Additional pre-training data
- Used data from 3 chess forums
- Built question and response pairs
- Filtered out irrelevant tags and non-positive upvote responses
- Used pre-existing Reddit dataset
- Scrubbed any PII from data
Methodology
- Overview of approach
- Explanation of each step
Overview
- Approach is visualized in Figure 1
- Training data is in the form of (G, M, C)
- Build token representation of game-state
- Extract “tags” from commentary to provide signals
- Commentary generation model is conditioned on extracted tags
- During inference, new game-state is passed into chess engine to provide new tags
Training
- Identified tags to extract from commentary training data
- Built tag-extraction models for 5 types of tags: “commentary type”, “move quality”, “suggested moves”, “pronouns/proper nouns”, and “length”
- Tags must be consistent with the semantics of the commentary text
- Tags provide explicit signals to learn meaningful patterns between game-state and commentary
- Used BART model pre-trained using data described in §3.3
- Manually annotated samples to fine-tune each BART model
- Commentary type tags indicate which of the six categories of commentaries from Jhamtani et al. (2018)
- Move quality tags indicate how good or bad a move is
- Suggested move tags indicate alternative lines of moves mentioned in the commentary
- Tagged commentary data with pronouns and proper nouns
- Tagged data with “[short]”, “[medium]”, or “[long]” depending on the commentary’s length
- Used chess engines to derive new tags during inference
- Included PGN representations and SAN tokens
- Enumerated pieces on the board for each player
- Represented attacks by listing the attacking piece and attacked piece
- Trained BART to generate commentaries
Inference
- Use a chess engine to derive tags to control commentary generation
- User selects commentary type
- Derive move quality using Leela
- Format move quality classification
- Use Leela’s suggested moves
- Omit pronoun and proper noun tags
- Use “medium” as length tag to control output length
Results
- Show accuracy of tag extraction models
- Assess ability to control commentaries with tags
- Use human judges to compare quality of commentaries against baseline models
Tag-extraction models
- Used 80:10:10 split of manually annotated training data for classification models and 85:10:5 split for generative model
- F1 and token exact match scores had accuracy of 67-73% which was sufficient for controlling commentary generation model
Perplexity
- Expect language model to learn meaningful relationship between input and commentary
- Drop in perplexity compared to different baselines
- Baselines include unconditioned, move, game-state, tags and fully-conditioned models
- More signals lead to drop in perplexity, demonstrating ability to learn patterns between tags and commentaries
Human evaluations
- Evaluated system using human evaluations
- Conducted A/B tests using Amazon Mechanical Turk
- Compared against Unconditioned and Game-State (All) baseline models
- Compared against prior work: Jhamtani et al. (2018)
- Validated chess knowledge of crowdworkers using chess puzzles
- Evaluated three commentary types: “Move Description”, “Move Quality”, and “Move Comparison”
- Results show that system strongly outperforms all baselines across all commentary types
Discussion: limitations and challenges
- Model occasionally generates commentaries with logical reasoning errors
- Interface does not cover every possible logical deduction
- Crowdworkers sometimes disagree with model’s assessment, even when commentaries are accurate
- Super-human behavior of chess engines are different from humans and difficult to interpret
- Model occasionally makes mistakes, but has best understanding about recently moved pieces
- Experimental details, complete results, and further analysis provided in Appendix F
Conclusion
- Combine symbolic reasoning system with controllable language model to generate chess commentaries
- Use Reddit data to find question-answer pairs likely referring to a specific chess game
- Exclude threads with external links
- Match comment post against set of patterns: PGN/SAN move notations, referral to specific move numbers, chess event tokens, chess piece tokens
- Use ParlAI to train models
- 373,919 samples of game-state and commentary pairs, split into 85:10:5 splits for training, validation, and testing
- 56,032 samples of question and answer pairs, split into 85:10:5 splits for pretraining
- 1800 samples for commentary type extraction model, split into 80:10:10 splits
- 610 samples for move quality extraction model, split into 80:10:10 splits
- 830 samples for suggested move extraction model, split into 80:15:5 splits
- Minimize validation perplexity for generative models, minimize validation loss for classification models
- Human evaluation study, optionally ask crowdworkers to provide a reason for their selection
- Prompted belief-state experiments, compute likelihood of all 64 chess squares given model, prompt, and game-state
- Visualize prompted belief-states, weight concentrated around recently moved pieces
- Accuracy (36%) and strategy (14%) reasons for preferring our model