Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent few-shot methods have achieved impressive results in label-scarce settings.
- These methods are difficult to employ due to high variability from manually crafted prompts and require billion-parameter language models.
- SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers.
- SetFit works by fine-tuning a pretrained ST on a small number of text pairs and using the resulting model to generate rich text embeddings.
- SetFit obtains comparable results with PEFT and PET techniques while being an order of magnitude faster to train.
- SetFit can be applied in multilingual settings by switching the ST body.
Paper Content
Introduction
- Few-shot learning methods are designed to work with a small number of labeled training examples
- Several approaches to few-shot learning with PLMs exist, including incontext learning, parameter-efficient finetuning, and pattern exploiting training
- These approaches can be impractical due to reliance on large-scale language models and specialized infrastructure
- SETFIT is proposed as an alternative approach that does not require large-scale PLMs or manually generated prompts
- SETFIT is based on Sentence Transformers and is fine-tuned in a Siamese manner for a text classification objective
- SETFIT is related to the few-shot and zero-shot training line of literature
- SETFIT is compared to GPT-3, T-FEW, adapters, PET, ADAPET, and PER-FECT
Setfit: sentence transformer fine-tuning
- SETFIT is based on Sentence Transformers
- Sentence Transformers use Siamese and triplet network structures to create semantically meaningful sentence embeddings
- SETFIT uses a two-step training approach
- First step is to fine-tune an ST in a contrastive, Siamese manner
- Second step is to train a classifier head using the encoded training data from the first step
Experiments
Data
- Conducted experiments on text classification datasets
- Split datasets into development and test datasets
- Used development datasets to set hyperparameters
- Test datasets represent different text classification tasks
- Datasets available on Hugging Face Hub
- Evaluated SETFIT on RAFT benchmark
- Evaluated three variations of SETFIT using different underlying ST models
Baselines
- Standard transformer fine-tuning is used as a baseline
- Hyperparameter search is done on the number of epochs
- ADAPET is a PET-based approach
- PERFECT uses task-specific adapters and multi-token label-embeddings
- T-FEW is a PEFT-based few-shot learning method
- Experiments are run with 5 random seeds and the median result is reported
Experimental setup
- Evaluating few-shot performance can be difficult
- To address this, 10 random training splits are used for each dataset and sample size
- Average measure and standard deviation are reported for each method
- SETFIT’s ST model is fine-tuned using cosine-similarity loss, learning rate of 1e-3, batch size of 16 and max sequence length of 256 tokens, for 1 epoch
Results
- SETFIT MPNET outperforms FINETUNE, PERFECT, and ADAPET for both N = 8 and N = 64.
- SETFIT MPNET is on par with T-FEW 3B for N = 8 and outperforms it for N = 64.
- SETFIT ROBERTA outperforms GPT3 and PET and surpasses the human baseline in 7 out of 11 tasks.
- SETFIT ROBERTA falls short of T-FEW 11B by 4.5 points, but is more than 30 times smaller.
Multilingual experiments
- SETFIT was tested in a multilingual, few-shot text classification scenario.
- The Multilingual Amazon Reviews Corpus (MARC) was used for the experiments.
- SETFIT, standard transformer fine-tuning and ADAPET were compared.
- SETFIT outperformed the other methods in all settings.
Few-shot distillation
- SETFIT achieves state-of-the-art results in few-shot setups
- Model distillation can reduce computational load while preserving performance
- SETFIT student model compared to a standard transformer student model in few-shot distillation setups
- SETFIT teacher and student models have 110M and 15M parameters respectively
- SETFIT student outperforms baseline student when only small amounts of unlabeled data are available
- Performance gains decrease as amount of unlabeled data increases
Computational costs
- Comparing the computational costs of SET-FIT and PET/PEFT is difficult due to different hardware/memory requirements
- FLOPs-per-token estimates are used to compare SET-FIT to T-FEW
- SET-FIT is an order of magnitude faster than T-FEW for inference and training
- SET-FIT MINILM is two orders of magnitude faster than T-FEW
- SET-FIT models have much smaller storage costs than T-FEW
Conclusion
- Introduces SETFIT, a new few-shot text classification approach
- Advantages over comparable approaches such as T-FEW, ADAPET and PERFECT
- Faster at inference and training
- Requires smaller base models
- Not subject to instability and inconvenience of prompting
- Robust few-shot text classifier in languages other than English
- Useful in few-shot distillation setups
- Development and test datasets used for setting SETFIT’s hyperparameters
- Input and target templates used for experiments
- Relative speed-up for inference and training