Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce A-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts
  • Prompts can be trained in isolation, on different devices, at different times, and on different distributions or domains
  • During inference, models can be assembled based on arbitrary selections of data sources
  • A-la-carte learning enables constructing bespoke models specific to each user’s individual access rights and preferences
  • Models can be added or removed without retraining from scratch
  • Models achieve accuracy within 5% of models trained on the union of the respective sources
  • State-of-the-art performance on Split CIFAR-100 and CORe50 benchmarks

Paper Content

Introduction

  • Prompting originated from natural language processing
  • We compare different methods of combining prompts
  • We optimize “soft” prompts in the embedding space
  • Prompt tuning applied to the continual learning problem
  • Forgetting in deep networks is challenging
  • We run a procedure to benchmark our APT approach

Preliminaries

  • Vision Transformer is used as the backbone architecture due to its accuracy and ease of prompting.
  • An image is split into patches and represented as tokens.
  • A special learnable class token is added to the input.
  • The output tokens of the transformer are used to predict the label of the input.
  • Prompting can be used as an alternative adaptation mechanism for vision transformers.
  • A new learnable prompt token is attached to the transformer’s input.

À-la-carte prompt tuning

  • We have a pre-trained backbone and a pool of additional data sources
  • All sources in the pool pertain to the same task and share the same input and label space
  • We want to fine-tune the backbone using all data in the pool
  • The collection of data sources can change over time
  • Different users may want to use different subsets of the data
  • We suggest an alternative strategy based on composition of prompts trained on individual data sources
  • We use a transformer with frozen parameters of the backbone
  • We ensemble the predictions made by each individual prompt
  • The model output depends only on the sources in the subset
  • We compare the performance of APT to ensembling classifier heads
  • APT consistently outperforms Head-only ensembling

Applications of à-la-carte learning

  • Data stored across multiple servers/devices
  • Each server can train a prompt on its data in isolation
  • Prompts can be assembled on a central server for inference
  • Data does not need to be uploaded to the central server
  • Different users have different rights to access datasets
  • Model can be updated with new data
  • Easy to forget a source of data
  • Can partition data into “shards” to satisfy forget requests
  • Can use for continual learning by training a prompt for each episode

Experiments

  • Pre-trained model used: VIT-B/16 pre-trained on ImageNet-21k
  • Datasets used: MIT-67, Cub-200-2011, FGVC-Aircrafts, Oxford Flowers, Caltech-256, Oxford Pets, Stanford Cars, Split CIFAR-100, CORe50
  • Finetuning methods compared to prompt tuning: Head-only, Bias+Head, Deep PT, Deep Shared PT, Shallow PT
  • APT compared to standard finetuning: competitive (within 2% accuracy)
  • APT compared to head-only tuning: outperforms
  • APT compared to naive concatenation: performs worse
  • APT compared to training a simple head-only classifier: uniformly outperforms
  • APT performance when shards are deleted: drop off in accuracy is small
  • APT performance when shards are added: accuracy within 2-5% of accuracy of training on entire dataset
  • APT-W outperforms L2P on Split CIFAR-100
  • APT outperforms all other methods on CORe50

Conclusion

  • Introduced general problem of À-la-carte Learning and an efficient solution using APT
  • Models constructed à la carte are competitive with models trained on union of sources
  • APT achieves state-of-the-art performance for class and domain incremental learning
  • Problem deserves further study to develop competitive machine learning methods
  • Average ensembling outperforms majority vote
  • Pretraining of backbone transformer is pertinent for performance of APT
  • APT outperforms finetuning in terms of classification accuracy
  • APT prevents prompts from interfering with each other and ensembles individual outputs
  • APT performance within a few percent of paragon, even when dataset split in up to 20 parts
  • APT degrades gracefully when data removed
  • APT combines individual prompts to create stronger classifier
  • Self-attention has quadratic complexity in sequence length