Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Virtual assistants use semantic parsing engines to convert user utterances to commands
  • Semantic parsing is a difficult multilingual transfer task with low transfer efficiency
  • Switching between languages is prevalent for bilingual users
  • This work improves the zero-shot performance of a multilingual and codeswitched semantic parsing system
  • Two stages of multilingual alignment are used to improve performance
  • Results show improved English performance and transfer efficiency

Paper Content

Introduction

  • MMTs are effective for multilingual and intra-sentential codeswitching
  • Prior work has studied explicit alignment of individual embeddings
  • MMTs implicitly perform alignment within their hidden states
  • Gap between performance on training language and zero-shot targets is larger in task-oriented parsing benchmarks
  • Adversarial training can be used for cross-lingual transfer
  • Regularization can be used to maintain multilingual knowledge

Methods

  • Utilize two stages of alignment to improve zero-shot transfer in DAMP
  • Use contrastive learning during pretraining to improve alignment
  • Use domain adversarial training and constrained optimization during finetuning for double alignment
  • Apply improvements to pointer-generator network to produce a parse

Baseline architecture

  • Used pointer-generator network to generate semantic parses
  • Tokenized words into sub-words and retrieved hidden states from encoder
  • Used randomly initialized auto-regressive decoder to produce representations
  • Used perceptron to predict over vocabulary of intents and slot types
  • Used copy logit vector for arguments from original query
  • Applied softmax to concatenation of logits and optimized negative log-likelihood of correct prediction
  • Copy mechanism essential to generating parses with multilingual tokens

Alignment pretraining

  • AMBER is a contrastive pretraining process for semantic parsing
  • AMBER combines 3 explicit alignment objectives: translation language modeling, sentence alignment, and word alignment using attention symmetry
  • Translation language modeling uses parallel sentences as input and masking tokens in each language
  • Sentence alignment directly optimizes similarity of representations across languages using a siamese network training process
  • Word level alignment is optimized with an attention symmetry loss

Cross-lingual adversarial alignment

  • Use token-level language discriminator to get aligned representations at the word level
  • Binary scheme treats all languages not found in the training data as a single class
  • Introduce a general constrained optimization approach for adversarial training
  • Train a discriminator to distinguish between in-domain training data and unlabeled out-of-domain data
  • Data is sampled evenly from all languages to create an adversarial dataset
  • Two-layer perceptron predicts probability that a token is English or Non-English
  • Discriminator loss is binary cross-entropy loss
  • Multi-class classification across all languages uses negative log-likelihood of the correct class as the loss function
  • Optimize task loss while enforcing a constraint derived from first-principles
  • Treat Ī» as a learnable parameter and optimize it to maximize the value of Ī»( āˆ’ L d )

Experiments

  • Evaluated effects of techniques on 3 benchmarks for task-oriented semantic parsing
  • 2 datasets evaluate robustness to intra-sentential codeswitching
  • 1 dataset uses multilingual data to evaluate robustness to inter-sentential codeswitching
  • Examples divided into training, evaluation, and test data at 70/10/20 ratio

Datasets

  • MTOP benchmark evaluates multilingual transfer for a difficult compositional parse structure
  • Benchmark contains queries in 6 languages
  • CST5 benchmark evaluates Hindi-English intra-sentential codeswitching
  • Spanish-English codeswitching dataset constructed using Google Translate
  • Dataset has 5,803 queries in both English and Spanish-English

Results

  • Uses same hyperparameter configurations for all settings
  • Encoder uses mBERT architecture
  • Decoder is randomly initialized 4-layer, 8-head vanilla transformer
  • AdamW used to optimize for 1.2 million training steps
  • Learning rate of 2eāˆ’5, batch size of 16, and decay the learning rate to 0 throughout of training
  • Train on Cloud TPU v3 Pod for approximately 4 hours
  • Loss constraint of 0.3 for adversarial experiments
  • Report Exact Match (EM) accuracy on all test splits

Adversarial baseline comparison

Adversary ablation

  • Adversarial training improves transfer results
  • Binary adversary and constrained optimization are beneficial to adversarial alignment
  • DAMP improves over unconstrained multi-class adversarial technique by 9.9, 6.4, and 0.9 EM accuracy points

Regularization comparison

  • Adversarial training compared to regularization techniques in cross-lingual learning
  • Freezing first 8 layers of encoder and using L1 and L2 norm penalty
  • Adversarial learning improves across all benchmarks

Pretrained decoder comparison

  • Adversarial training does worse than the plain mT5 model
  • Adversarial alignment causes a drop in performance due to accidental translation
  • Adversarial alignment improves performance in certain variants of MTOP and CSTOP
  • The mT5 decoder struggles to adapt to this task, making overall performance worse than DAMP

Improvement analysis

  • Exact match accuracy is used to measure improvements
  • Qualitative analysis is used to examine examples that DAMP predicts correctly but AMBER and mBERT do not
  • 20 examples from each language are manually evaluated
  • Intent prediction is a large portion of the gain
  • Table 4 reports intent prediction results
  • Improvements noted in DAMP are more accurate placement of articles and prepositions inside slot boundaries

Alignment analysis

  • Visualize alignment using two-dimensional projection of encoder embeddings
  • Quantitatively evaluate alignment using post-hoc linear probe

Embedding space visualization

  • UMAP is used to visualize the embedding spaces of each model variant on each MTOP test set.
  • mBERT produces linearly separate clusters of English and Non-English data.
  • AMBER removes the global clustering behavior and produces small local clusters of English and Non-English data.
  • DAMP produces an embedding space with no clear visual clusters of Non-English data.

Post-hoc probing

  • Quantitatively evaluated improvements to alignment
  • Mode collapse during training has been shown to be insufficient
  • Trained a linear probe on a frozen model after training for each variant
  • Probe performance decreases with each stage of alignment
  • Discriminator accuracy decreases with each stage of alignment

Conclusions and future work

  • Introduced Doubly Aligned Multilingual Parser (DAMP)
  • Contrastive alignment pretraining and adversarial alignment during finetuning
  • Improved transfer learning in semantic parsing for inter-sentential and intrasentential codemixed data
  • Compared proposed alignment method to prior techniques and regularization baselines
  • Experiments using English as the base training language for domain adversarial transfer