Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent few-shot methods have achieved impressive results in label-scarce settings.
These methods are difficult to employ due to high variability from manually crafted prompts and require billion-parameter language models.
SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers.
SetFit works by fine-tuning a pretrained ST on a small number of text pairs and using the resulting model to generate rich text embeddings.
SetFit obtains comparable results with PEFT and PET techniques while being an order of magnitude faster to train.
SetFit can be applied in multilingual settings by switching the ST body.

Paper Content

Introduction

Few-shot learning methods are designed to work with a small number of labeled training examples
Several approaches to few-shot learning with PLMs exist, including incontext learning, parameter-efficient finetuning, and pattern exploiting training
These approaches can be impractical due to reliance on large-scale language models and specialized infrastructure
SETFIT is proposed as an alternative approach that does not require large-scale PLMs or manually generated prompts
SETFIT is based on Sentence Transformers and is fine-tuned in a Siamese manner for a text classification objective
SETFIT is related to the few-shot and zero-shot training line of literature
SETFIT is compared to GPT-3, T-FEW, adapters, PET, ADAPET, and PER-FECT

Setfit: sentence transformer fine-tuning

SETFIT is based on Sentence Transformers
Sentence Transformers use Siamese and triplet network structures to create semantically meaningful sentence embeddings
SETFIT uses a two-step training approach
First step is to fine-tune an ST in a contrastive, Siamese manner
Second step is to train a classifier head using the encoded training data from the first step

Experiments

Data

Conducted experiments on text classification datasets
Split datasets into development and test datasets
Used development datasets to set hyperparameters
Test datasets represent different text classification tasks
Datasets available on Hugging Face Hub
Evaluated SETFIT on RAFT benchmark
Evaluated three variations of SETFIT using different underlying ST models

Baselines

Standard transformer fine-tuning is used as a baseline
Hyperparameter search is done on the number of epochs
ADAPET is a PET-based approach
PERFECT uses task-specific adapters and multi-token label-embeddings
T-FEW is a PEFT-based few-shot learning method
Experiments are run with 5 random seeds and the median result is reported

Experimental setup

Evaluating few-shot performance can be difficult
To address this, 10 random training splits are used for each dataset and sample size
Average measure and standard deviation are reported for each method
SETFIT’s ST model is fine-tuned using cosine-similarity loss, learning rate of 1e-3, batch size of 16 and max sequence length of 256 tokens, for 1 epoch

Results

SETFIT MPNET outperforms FINETUNE, PERFECT, and ADAPET for both N = 8 and N = 64.
SETFIT MPNET is on par with T-FEW 3B for N = 8 and outperforms it for N = 64.
SETFIT ROBERTA outperforms GPT3 and PET and surpasses the human baseline in 7 out of 11 tasks.
SETFIT ROBERTA falls short of T-FEW 11B by 4.5 points, but is more than 30 times smaller.

Multilingual experiments

SETFIT was tested in a multilingual, few-shot text classification scenario.
The Multilingual Amazon Reviews Corpus (MARC) was used for the experiments.
SETFIT, standard transformer fine-tuning and ADAPET were compared.
SETFIT outperformed the other methods in all settings.

Few-shot distillation

SETFIT achieves state-of-the-art results in few-shot setups
Model distillation can reduce computational load while preserving performance
SETFIT student model compared to a standard transformer student model in few-shot distillation setups
SETFIT teacher and student models have 110M and 15M parameters respectively
SETFIT student outperforms baseline student when only small amounts of unlabeled data are available
Performance gains decrease as amount of unlabeled data increases

Computational costs

Comparing the computational costs of SET-FIT and PET/PEFT is difficult due to different hardware/memory requirements
FLOPs-per-token estimates are used to compare SET-FIT to T-FEW
SET-FIT is an order of magnitude faster than T-FEW for inference and training
SET-FIT MINILM is two orders of magnitude faster than T-FEW
SET-FIT models have much smaller storage costs than T-FEW

Conclusion

Introduces SETFIT, a new few-shot text classification approach
Advantages over comparable approaches such as T-FEW, ADAPET and PERFECT
Faster at inference and training
Requires smaller base models
Not subject to instability and inconvenience of prompting
Robust few-shot text classifier in languages other than English
Useful in few-shot distillation setups
Development and test datasets used for setting SETFIT’s hyperparameters
Input and target templates used for experiments
Relative speed-up for inference and training

Link to paper#

Abstract#

Paper Content#

Introduction#

Setfit: sentence transformer fine-tuning#

Experiments#

Data#

Baselines#

Experimental setup#

Results#

Multilingual experiments#

Few-shot distillation#

Computational costs#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Setfit: sentence transformer fine-tuning

Experiments

Data

Baselines

Experimental setup

Results

Multilingual experiments

Few-shot distillation

Computational costs

Conclusion