Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Popular text-to-image models lack character-level input features.
Conducted a series of experiments to compare character-aware vs. character-blind text encoders.
Character-aware models provide large gains on a novel spelling task.
Character-aware variants outperform character-blind counterparts across a range of novel text rendering tasks.
Models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors.

Paper Content

Introduction

Image generation models have made quality gains in the last year
Rendering reliable visual text in images is a challenge
Character-blind text encoders have limited spelling ability
Character-aware text encoders can achieve robust spelling abilities at smaller scales

The spelling miracle

Language models can be categorized as character-aware or character-blind
Early neural language models operated directly on characters
Later models moved to vocabulary-based tokenization, some retaining character-awareness
Most widely used language models are character-blind, relying on data-driven subword segmentation algorithms
Recent work has pointed to advantages of character-aware input representations
Character-blind models are able to infer the character-level makeup of their tokens
With sufficient scale, character-blind models can achieve near-perfect spelling accuracy
Character-blind text encoders used in image generation struggle to translate input tokens into rendered character sequences
Supervised image-caption data can help models learn to spell, but is an inefficient paradigm

Measuring text encoder spelling ability

Text-to-image generation models rely on text encoders.
Text encoders are explored in isolation using a text-only spelling evaluation task.

The wikispell benchmark

WikiSpell benchmark created by sampling words from Wiktionary
Input to model is single word, output is spelling with spaces between each character
Words grouped into buckets based on frequency in mC4 corpus
Test and development sets created by sampling 1k words from each bucket
Five buckets: top 1%, 1-10%, 10-20%, 20-30%, bottom 50%
Training set of 10,000 words, 5,000 from bottom 50%, 5,000 proportional to frequency in mC4
Evaluated on English and six other languages

Text generation experiments

We use the WikiSpell benchmark to evaluate multiple pretrained text-only models.
T5, mT5, ByT5, and PaLM models are experimented with.
T5 and mT5 perform worse on the Top-1% most frequent words.
Scale is a significant factor in spelling ability for character-blind models.
ByT5 exhibits far greater spelling abilities than T5 and mT5.
Finetuning the encoder helps for less frequent words, but not for common words.

The drawtext benchmark

Evaluating text-to-image models is an ongoing topic of research
Standard benchmarks and metrics have been developed
There is a lack of work on text rendering and spelling evaluation
DrawText is a new benchmark designed to measure text rendering quality of text-to-image models
DrawText consists of two parts: Spelling and Creative

Drawtext spelling

Constructed 500 prompts by sampling 100 words from each of the English WikiSpell frequency buckets
Assessed images using human ratings and optical character recognition (OCR)-based metrics
Used Google Cloud Vision API for OCR evaluation
Post-processed OCR output by removing newline characters and ignoring case when computing spelling accuracy

Drawtext creative

Visual text can appear in many forms, such as scribbled, painted, carved, and sculpted.
175 diverse prompts were created to test the ability of image generation models to render text in creative styles and settings.
Current models have difficulty accurately rendering text.

Image generation experiments

Text-to-image generative models are evaluated for their spelling ability using the DrawText benchmark.
Text-to-image generative models consist of a text encoder and a cascade of either diffusion models or autoregressive models.
In section 3, character-aware text encoders were found to outperform character-blind models on spelling in a text-only setting.
This section investigates whether making the text encoder character-aware improves the text rendering ability of text-to-image models.

Models

Train two character-blind and three character-aware image generation models
Train for 500,000 steps, 5.6x fewer than Imagen
Train only initial 64x64 model
Train exclusively on Laion-400M dataset
71% of Laion images contain text, 60% have caption/visual text correspondence
Train on uncropped images with arbitrary aspect ratios
Vary pretrained text encoder size and character-awareness
Benchmark Stable Diffusion, Imagen, and Parti (all character-blind)

Drawtext spelling results

Character-aware models (ByT5 and Concat) outperform other models
Imagen-AR trained for 6.6x longer but still had lower accuracy
T5 models provide a controlled comparison and had 25+ point gains on most frequent words and 30+ point gains on least frequent
ByT5-XL model had 43% smaller encoder than T5-XXL
OCR errors can lead to false positives and false negatives
Manual human validation found no false positives but false negatives for both models
Common error types observed in T5 models suggest lack of core spelling knowledge

Drawtext creative results

Character-aware models have a clear advantage in text rendering ability
Diffusion models struggle with arranging words in a fixed frame
Character-aware text encoders reduce misspellings of words

Drawbench results

Character-aware text encoders excel at spelling in both text and visual domains.
Character-aware models maintain image quality and text-image alignment of character-blind models.
DrawBench evaluation shows character-aware models score worse on image-text alignment.
Concat(T5-XXL, ByT5-Small) model closes alignment gap.
Character-aware models are preferred in text and count categories, but dispreferred in other categories.

Multilingual results

ByT5-XXL model exhibits understanding across 11 languages
Pretrained ByT5 encoder aligns representations across languages
Model learns to map any language seen in pretraining into the space of images
Rendering text in different scripts requires more than just a multilingual encoder
Model unable to render words for dog in non-Latin scripts
Prompt language can bias model towards culturally-relevant visual interpretations

Conclusion

Example Python 3 code for transforming a word into its spelling
Exclude entries with whitespace, punctuation/symbols, longer than 30 characters, or a “part-of-speech” Proverb
Word frequencies computed on subsets of the full mC4 corpus
Word-counting by splitting document texts using delimiters
For Chinese and Thai, count number of documents in which the word appeared as a substring
ByT5 model able to spell words correctly in most cases
Filter images with no legible text for better comparison
Using character-aware text encoders provide large gains on spelling
Hybrid model combining token-level and character-level signals provides best of both worlds
Models lacking a direct character-level view of their inputs can infer robust spelling information
Models over 100B parameters did not generalize well beyond English
Changes to the image generation module needed to resolve failure modes
175 creative prompts targeting rendered text of various lengths

Link to paper#

Abstract#

Paper Content#

Introduction#

The spelling miracle#

Measuring text encoder spelling ability#

The wikispell benchmark#

Text generation experiments#

The drawtext benchmark#

Drawtext spelling#

Drawtext creative#

Image generation experiments#

Models#

Drawtext spelling results#

Drawtext creative results#

Drawbench results#

Multilingual results#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

The spelling miracle

Measuring text encoder spelling ability

The wikispell benchmark

Text generation experiments

The drawtext benchmark

Drawtext spelling

Drawtext creative

Image generation experiments

Models

Drawtext spelling results

Drawtext creative results

Drawbench results

Multilingual results

Conclusion