Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce A-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts
- Prompts can be trained in isolation, on different devices, at different times, and on different distributions or domains
- During inference, models can be assembled based on arbitrary selections of data sources
- A-la-carte learning enables constructing bespoke models specific to each user’s individual access rights and preferences
- Models can be added or removed without retraining from scratch
- Models achieve accuracy within 5% of models trained on the union of the respective sources
- State-of-the-art performance on Split CIFAR-100 and CORe50 benchmarks
Paper Content
Introduction
Related work
- Prompting originated from natural language processing
- We compare different methods of combining prompts
- We optimize “soft” prompts in the embedding space
- Prompt tuning applied to the continual learning problem
- Forgetting in deep networks is challenging
- We run a procedure to benchmark our APT approach
Preliminaries
- Vision Transformer is used as the backbone architecture due to its accuracy and ease of prompting.
- An image is split into patches and represented as tokens.
- A special learnable class token is added to the input.
- The output tokens of the transformer are used to predict the label of the input.
- Prompting can be used as an alternative adaptation mechanism for vision transformers.
- A new learnable prompt token is attached to the transformer’s input.
À-la-carte prompt tuning
- We have a pre-trained backbone and a pool of additional data sources
- All sources in the pool pertain to the same task and share the same input and label space
- We want to fine-tune the backbone using all data in the pool
- The collection of data sources can change over time
- Different users may want to use different subsets of the data
- We suggest an alternative strategy based on composition of prompts trained on individual data sources
- We use a transformer with frozen parameters of the backbone
- We ensemble the predictions made by each individual prompt
- The model output depends only on the sources in the subset
- We compare the performance of APT to ensembling classifier heads
- APT consistently outperforms Head-only ensembling
Applications of à-la-carte learning
- Data stored across multiple servers/devices
- Each server can train a prompt on its data in isolation
- Prompts can be assembled on a central server for inference
- Data does not need to be uploaded to the central server
- Different users have different rights to access datasets
- Model can be updated with new data
- Easy to forget a source of data
- Can partition data into “shards” to satisfy forget requests
- Can use for continual learning by training a prompt for each episode
Experiments
- Pre-trained model used: VIT-B/16 pre-trained on ImageNet-21k
- Datasets used: MIT-67, Cub-200-2011, FGVC-Aircrafts, Oxford Flowers, Caltech-256, Oxford Pets, Stanford Cars, Split CIFAR-100, CORe50
- Finetuning methods compared to prompt tuning: Head-only, Bias+Head, Deep PT, Deep Shared PT, Shallow PT
- APT compared to standard finetuning: competitive (within 2% accuracy)
- APT compared to head-only tuning: outperforms
- APT compared to naive concatenation: performs worse
- APT compared to training a simple head-only classifier: uniformly outperforms
- APT performance when shards are deleted: drop off in accuracy is small
- APT performance when shards are added: accuracy within 2-5% of accuracy of training on entire dataset
- APT-W outperforms L2P on Split CIFAR-100
- APT outperforms all other methods on CORe50
Conclusion
- Introduced general problem of À-la-carte Learning and an efficient solution using APT
- Models constructed à la carte are competitive with models trained on union of sources
- APT achieves state-of-the-art performance for class and domain incremental learning
- Problem deserves further study to develop competitive machine learning methods
- Average ensembling outperforms majority vote
- Pretraining of backbone transformer is pertinent for performance of APT
- APT outperforms finetuning in terms of classification accuracy
- APT prevents prompts from interfering with each other and ensembles individual outputs
- APT performance within a few percent of paragon, even when dataset split in up to 20 parts
- APT degrades gracefully when data removed
- APT combines individual prompts to create stronger classifier
- Self-attention has quadratic complexity in sequence length