Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent few-shot methods have achieved impressive results in label-scarce settings.
  • These methods are difficult to employ due to high variability from manually crafted prompts and require billion-parameter language models.
  • SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers.
  • SetFit works by fine-tuning a pretrained ST on a small number of text pairs and using the resulting model to generate rich text embeddings.
  • SetFit obtains comparable results with PEFT and PET techniques while being an order of magnitude faster to train.
  • SetFit can be applied in multilingual settings by switching the ST body.

Paper Content

Introduction

  • Few-shot learning methods are designed to work with a small number of labeled training examples
  • Several approaches to few-shot learning with PLMs exist, including incontext learning, parameter-efficient finetuning, and pattern exploiting training
  • These approaches can be impractical due to reliance on large-scale language models and specialized infrastructure
  • SETFIT is proposed as an alternative approach that does not require large-scale PLMs or manually generated prompts
  • SETFIT is based on Sentence Transformers and is fine-tuned in a Siamese manner for a text classification objective
  • SETFIT is related to the few-shot and zero-shot training line of literature
  • SETFIT is compared to GPT-3, T-FEW, adapters, PET, ADAPET, and PER-FECT

Setfit: sentence transformer fine-tuning

  • SETFIT is based on Sentence Transformers
  • Sentence Transformers use Siamese and triplet network structures to create semantically meaningful sentence embeddings
  • SETFIT uses a two-step training approach
  • First step is to fine-tune an ST in a contrastive, Siamese manner
  • Second step is to train a classifier head using the encoded training data from the first step

Experiments

Data

  • Conducted experiments on text classification datasets
  • Split datasets into development and test datasets
  • Used development datasets to set hyperparameters
  • Test datasets represent different text classification tasks
  • Datasets available on Hugging Face Hub
  • Evaluated SETFIT on RAFT benchmark
  • Evaluated three variations of SETFIT using different underlying ST models

Baselines

  • Standard transformer fine-tuning is used as a baseline
  • Hyperparameter search is done on the number of epochs
  • ADAPET is a PET-based approach
  • PERFECT uses task-specific adapters and multi-token label-embeddings
  • T-FEW is a PEFT-based few-shot learning method
  • Experiments are run with 5 random seeds and the median result is reported

Experimental setup

  • Evaluating few-shot performance can be difficult
  • To address this, 10 random training splits are used for each dataset and sample size
  • Average measure and standard deviation are reported for each method
  • SETFIT’s ST model is fine-tuned using cosine-similarity loss, learning rate of 1e-3, batch size of 16 and max sequence length of 256 tokens, for 1 epoch

Results

  • SETFIT MPNET outperforms FINETUNE, PERFECT, and ADAPET for both N = 8 and N = 64.
  • SETFIT MPNET is on par with T-FEW 3B for N = 8 and outperforms it for N = 64.
  • SETFIT ROBERTA outperforms GPT3 and PET and surpasses the human baseline in 7 out of 11 tasks.
  • SETFIT ROBERTA falls short of T-FEW 11B by 4.5 points, but is more than 30 times smaller.

Multilingual experiments

  • SETFIT was tested in a multilingual, few-shot text classification scenario.
  • The Multilingual Amazon Reviews Corpus (MARC) was used for the experiments.
  • SETFIT, standard transformer fine-tuning and ADAPET were compared.
  • SETFIT outperformed the other methods in all settings.

Few-shot distillation

  • SETFIT achieves state-of-the-art results in few-shot setups
  • Model distillation can reduce computational load while preserving performance
  • SETFIT student model compared to a standard transformer student model in few-shot distillation setups
  • SETFIT teacher and student models have 110M and 15M parameters respectively
  • SETFIT student outperforms baseline student when only small amounts of unlabeled data are available
  • Performance gains decrease as amount of unlabeled data increases

Computational costs

  • Comparing the computational costs of SET-FIT and PET/PEFT is difficult due to different hardware/memory requirements
  • FLOPs-per-token estimates are used to compare SET-FIT to T-FEW
  • SET-FIT is an order of magnitude faster than T-FEW for inference and training
  • SET-FIT MINILM is two orders of magnitude faster than T-FEW
  • SET-FIT models have much smaller storage costs than T-FEW

Conclusion

  • Introduces SETFIT, a new few-shot text classification approach
  • Advantages over comparable approaches such as T-FEW, ADAPET and PERFECT
  • Faster at inference and training
  • Requires smaller base models
  • Not subject to instability and inconvenience of prompting
  • Robust few-shot text classifier in languages other than English
  • Useful in few-shot distillation setups
  • Development and test datasets used for setting SETFIT’s hyperparameters
  • Input and target templates used for experiments
  • Relative speed-up for inference and training