Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Popular text-to-image models lack character-level input features.
- Conducted a series of experiments to compare character-aware vs. character-blind text encoders.
- Character-aware models provide large gains on a novel spelling task.
- Character-aware variants outperform character-blind counterparts across a range of novel text rendering tasks.
- Models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors.
Paper Content
Introduction
- Image generation models have made quality gains in the last year
- Rendering reliable visual text in images is a challenge
- Character-blind text encoders have limited spelling ability
- Character-aware text encoders can achieve robust spelling abilities at smaller scales
The spelling miracle
- Language models can be categorized as character-aware or character-blind
- Early neural language models operated directly on characters
- Later models moved to vocabulary-based tokenization, some retaining character-awareness
- Most widely used language models are character-blind, relying on data-driven subword segmentation algorithms
- Recent work has pointed to advantages of character-aware input representations
- Character-blind models are able to infer the character-level makeup of their tokens
- With sufficient scale, character-blind models can achieve near-perfect spelling accuracy
- Character-blind text encoders used in image generation struggle to translate input tokens into rendered character sequences
- Supervised image-caption data can help models learn to spell, but is an inefficient paradigm
Measuring text encoder spelling ability
- Text-to-image generation models rely on text encoders.
- Text encoders are explored in isolation using a text-only spelling evaluation task.
The wikispell benchmark
- WikiSpell benchmark created by sampling words from Wiktionary
- Input to model is single word, output is spelling with spaces between each character
- Words grouped into buckets based on frequency in mC4 corpus
- Test and development sets created by sampling 1k words from each bucket
- Five buckets: top 1%, 1-10%, 10-20%, 20-30%, bottom 50%
- Training set of 10,000 words, 5,000 from bottom 50%, 5,000 proportional to frequency in mC4
- Evaluated on English and six other languages
Text generation experiments
- We use the WikiSpell benchmark to evaluate multiple pretrained text-only models.
- T5, mT5, ByT5, and PaLM models are experimented with.
- T5 and mT5 perform worse on the Top-1% most frequent words.
- Scale is a significant factor in spelling ability for character-blind models.
- ByT5 exhibits far greater spelling abilities than T5 and mT5.
- Finetuning the encoder helps for less frequent words, but not for common words.
The drawtext benchmark
- Evaluating text-to-image models is an ongoing topic of research
- Standard benchmarks and metrics have been developed
- There is a lack of work on text rendering and spelling evaluation
- DrawText is a new benchmark designed to measure text rendering quality of text-to-image models
- DrawText consists of two parts: Spelling and Creative
Drawtext spelling
- Constructed 500 prompts by sampling 100 words from each of the English WikiSpell frequency buckets
- Assessed images using human ratings and optical character recognition (OCR)-based metrics
- Used Google Cloud Vision API for OCR evaluation
- Post-processed OCR output by removing newline characters and ignoring case when computing spelling accuracy
Drawtext creative
- Visual text can appear in many forms, such as scribbled, painted, carved, and sculpted.
- 175 diverse prompts were created to test the ability of image generation models to render text in creative styles and settings.
- Current models have difficulty accurately rendering text.
Image generation experiments
- Text-to-image generative models are evaluated for their spelling ability using the DrawText benchmark.
- Text-to-image generative models consist of a text encoder and a cascade of either diffusion models or autoregressive models.
- In section 3, character-aware text encoders were found to outperform character-blind models on spelling in a text-only setting.
- This section investigates whether making the text encoder character-aware improves the text rendering ability of text-to-image models.
Models
- Train two character-blind and three character-aware image generation models
- Train for 500,000 steps, 5.6x fewer than Imagen
- Train only initial 64x64 model
- Train exclusively on Laion-400M dataset
- 71% of Laion images contain text, 60% have caption/visual text correspondence
- Train on uncropped images with arbitrary aspect ratios
- Vary pretrained text encoder size and character-awareness
- Benchmark Stable Diffusion, Imagen, and Parti (all character-blind)
Drawtext spelling results
- Character-aware models (ByT5 and Concat) outperform other models
- Imagen-AR trained for 6.6x longer but still had lower accuracy
- T5 models provide a controlled comparison and had 25+ point gains on most frequent words and 30+ point gains on least frequent
- ByT5-XL model had 43% smaller encoder than T5-XXL
- OCR errors can lead to false positives and false negatives
- Manual human validation found no false positives but false negatives for both models
- Common error types observed in T5 models suggest lack of core spelling knowledge
Drawtext creative results
- Character-aware models have a clear advantage in text rendering ability
- Diffusion models struggle with arranging words in a fixed frame
- Character-aware text encoders reduce misspellings of words
Drawbench results
- Character-aware text encoders excel at spelling in both text and visual domains.
- Character-aware models maintain image quality and text-image alignment of character-blind models.
- DrawBench evaluation shows character-aware models score worse on image-text alignment.
- Concat(T5-XXL, ByT5-Small) model closes alignment gap.
- Character-aware models are preferred in text and count categories, but dispreferred in other categories.
Multilingual results
- ByT5-XXL model exhibits understanding across 11 languages
- Pretrained ByT5 encoder aligns representations across languages
- Model learns to map any language seen in pretraining into the space of images
- Rendering text in different scripts requires more than just a multilingual encoder
- Model unable to render words for dog in non-Latin scripts
- Prompt language can bias model towards culturally-relevant visual interpretations
Conclusion
- Example Python 3 code for transforming a word into its spelling
- Exclude entries with whitespace, punctuation/symbols, longer than 30 characters, or a “part-of-speech” Proverb
- Word frequencies computed on subsets of the full mC4 corpus
- Word-counting by splitting document texts using delimiters
- For Chinese and Thai, count number of documents in which the word appeared as a substring
- ByT5 model able to spell words correctly in most cases
- Filter images with no legible text for better comparison
- Using character-aware text encoders provide large gains on spelling
- Hybrid model combining token-level and character-level signals provides best of both worlds
- Models lacking a direct character-level view of their inputs can infer robust spelling information
- Models over 100B parameters did not generalize well beyond English
- Changes to the image generation module needed to resolve failure modes
- 175 creative prompts targeting rendered text of various lengths