Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Popular text-to-image models lack character-level input features.
  • Conducted a series of experiments to compare character-aware vs. character-blind text encoders.
  • Character-aware models provide large gains on a novel spelling task.
  • Character-aware variants outperform character-blind counterparts across a range of novel text rendering tasks.
  • Models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors.

Paper Content

Introduction

  • Image generation models have made quality gains in the last year
  • Rendering reliable visual text in images is a challenge
  • Character-blind text encoders have limited spelling ability
  • Character-aware text encoders can achieve robust spelling abilities at smaller scales

The spelling miracle

  • Language models can be categorized as character-aware or character-blind
  • Early neural language models operated directly on characters
  • Later models moved to vocabulary-based tokenization, some retaining character-awareness
  • Most widely used language models are character-blind, relying on data-driven subword segmentation algorithms
  • Recent work has pointed to advantages of character-aware input representations
  • Character-blind models are able to infer the character-level makeup of their tokens
  • With sufficient scale, character-blind models can achieve near-perfect spelling accuracy
  • Character-blind text encoders used in image generation struggle to translate input tokens into rendered character sequences
  • Supervised image-caption data can help models learn to spell, but is an inefficient paradigm

Measuring text encoder spelling ability

  • Text-to-image generation models rely on text encoders.
  • Text encoders are explored in isolation using a text-only spelling evaluation task.

The wikispell benchmark

  • WikiSpell benchmark created by sampling words from Wiktionary
  • Input to model is single word, output is spelling with spaces between each character
  • Words grouped into buckets based on frequency in mC4 corpus
  • Test and development sets created by sampling 1k words from each bucket
  • Five buckets: top 1%, 1-10%, 10-20%, 20-30%, bottom 50%
  • Training set of 10,000 words, 5,000 from bottom 50%, 5,000 proportional to frequency in mC4
  • Evaluated on English and six other languages

Text generation experiments

  • We use the WikiSpell benchmark to evaluate multiple pretrained text-only models.
  • T5, mT5, ByT5, and PaLM models are experimented with.
  • T5 and mT5 perform worse on the Top-1% most frequent words.
  • Scale is a significant factor in spelling ability for character-blind models.
  • ByT5 exhibits far greater spelling abilities than T5 and mT5.
  • Finetuning the encoder helps for less frequent words, but not for common words.

The drawtext benchmark

  • Evaluating text-to-image models is an ongoing topic of research
  • Standard benchmarks and metrics have been developed
  • There is a lack of work on text rendering and spelling evaluation
  • DrawText is a new benchmark designed to measure text rendering quality of text-to-image models
  • DrawText consists of two parts: Spelling and Creative

Drawtext spelling

  • Constructed 500 prompts by sampling 100 words from each of the English WikiSpell frequency buckets
  • Assessed images using human ratings and optical character recognition (OCR)-based metrics
  • Used Google Cloud Vision API for OCR evaluation
  • Post-processed OCR output by removing newline characters and ignoring case when computing spelling accuracy

Drawtext creative

  • Visual text can appear in many forms, such as scribbled, painted, carved, and sculpted.
  • 175 diverse prompts were created to test the ability of image generation models to render text in creative styles and settings.
  • Current models have difficulty accurately rendering text.

Image generation experiments

  • Text-to-image generative models are evaluated for their spelling ability using the DrawText benchmark.
  • Text-to-image generative models consist of a text encoder and a cascade of either diffusion models or autoregressive models.
  • In section 3, character-aware text encoders were found to outperform character-blind models on spelling in a text-only setting.
  • This section investigates whether making the text encoder character-aware improves the text rendering ability of text-to-image models.

Models

  • Train two character-blind and three character-aware image generation models
  • Train for 500,000 steps, 5.6x fewer than Imagen
  • Train only initial 64x64 model
  • Train exclusively on Laion-400M dataset
  • 71% of Laion images contain text, 60% have caption/visual text correspondence
  • Train on uncropped images with arbitrary aspect ratios
  • Vary pretrained text encoder size and character-awareness
  • Benchmark Stable Diffusion, Imagen, and Parti (all character-blind)

Drawtext spelling results

  • Character-aware models (ByT5 and Concat) outperform other models
  • Imagen-AR trained for 6.6x longer but still had lower accuracy
  • T5 models provide a controlled comparison and had 25+ point gains on most frequent words and 30+ point gains on least frequent
  • ByT5-XL model had 43% smaller encoder than T5-XXL
  • OCR errors can lead to false positives and false negatives
  • Manual human validation found no false positives but false negatives for both models
  • Common error types observed in T5 models suggest lack of core spelling knowledge

Drawtext creative results

  • Character-aware models have a clear advantage in text rendering ability
  • Diffusion models struggle with arranging words in a fixed frame
  • Character-aware text encoders reduce misspellings of words

Drawbench results

  • Character-aware text encoders excel at spelling in both text and visual domains.
  • Character-aware models maintain image quality and text-image alignment of character-blind models.
  • DrawBench evaluation shows character-aware models score worse on image-text alignment.
  • Concat(T5-XXL, ByT5-Small) model closes alignment gap.
  • Character-aware models are preferred in text and count categories, but dispreferred in other categories.

Multilingual results

  • ByT5-XXL model exhibits understanding across 11 languages
  • Pretrained ByT5 encoder aligns representations across languages
  • Model learns to map any language seen in pretraining into the space of images
  • Rendering text in different scripts requires more than just a multilingual encoder
  • Model unable to render words for dog in non-Latin scripts
  • Prompt language can bias model towards culturally-relevant visual interpretations

Conclusion

  • Example Python 3 code for transforming a word into its spelling
  • Exclude entries with whitespace, punctuation/symbols, longer than 30 characters, or a “part-of-speech” Proverb
  • Word frequencies computed on subsets of the full mC4 corpus
  • Word-counting by splitting document texts using delimiters
  • For Chinese and Thai, count number of documents in which the word appeared as a substring
  • ByT5 model able to spell words correctly in most cases
  • Filter images with no legible text for better comparison
  • Using character-aware text encoders provide large gains on spelling
  • Hybrid model combining token-level and character-level signals provides best of both worlds
  • Models lacking a direct character-level view of their inputs can infer robust spelling information
  • Models over 100B parameters did not generalize well beyond English
  • Changes to the image generation module needed to resolve failure modes
  • 175 creative prompts targeting rendered text of various lengths