Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Virtual assistants use semantic parsing engines to convert user utterances to commands
- Semantic parsing is a difficult multilingual transfer task with low transfer efficiency
- Switching between languages is prevalent for bilingual users
- This work improves the zero-shot performance of a multilingual and codeswitched semantic parsing system
- Two stages of multilingual alignment are used to improve performance
- Results show improved English performance and transfer efficiency
Paper Content
Introduction
Related work
- MMTs are effective for multilingual and intra-sentential codeswitching
- Prior work has studied explicit alignment of individual embeddings
- MMTs implicitly perform alignment within their hidden states
- Gap between performance on training language and zero-shot targets is larger in task-oriented parsing benchmarks
- Adversarial training can be used for cross-lingual transfer
- Regularization can be used to maintain multilingual knowledge
Methods
- Utilize two stages of alignment to improve zero-shot transfer in DAMP
- Use contrastive learning during pretraining to improve alignment
- Use domain adversarial training and constrained optimization during finetuning for double alignment
- Apply improvements to pointer-generator network to produce a parse
Baseline architecture
- Used pointer-generator network to generate semantic parses
- Tokenized words into sub-words and retrieved hidden states from encoder
- Used randomly initialized auto-regressive decoder to produce representations
- Used perceptron to predict over vocabulary of intents and slot types
- Used copy logit vector for arguments from original query
- Applied softmax to concatenation of logits and optimized negative log-likelihood of correct prediction
- Copy mechanism essential to generating parses with multilingual tokens
Alignment pretraining
- AMBER is a contrastive pretraining process for semantic parsing
- AMBER combines 3 explicit alignment objectives: translation language modeling, sentence alignment, and word alignment using attention symmetry
- Translation language modeling uses parallel sentences as input and masking tokens in each language
- Sentence alignment directly optimizes similarity of representations across languages using a siamese network training process
- Word level alignment is optimized with an attention symmetry loss
Cross-lingual adversarial alignment
- Use token-level language discriminator to get aligned representations at the word level
- Binary scheme treats all languages not found in the training data as a single class
- Introduce a general constrained optimization approach for adversarial training
- Train a discriminator to distinguish between in-domain training data and unlabeled out-of-domain data
- Data is sampled evenly from all languages to create an adversarial dataset
- Two-layer perceptron predicts probability that a token is English or Non-English
- Discriminator loss is binary cross-entropy loss
- Multi-class classification across all languages uses negative log-likelihood of the correct class as the loss function
- Optimize task loss while enforcing a constraint derived from first-principles
- Treat Ī» as a learnable parameter and optimize it to maximize the value of Ī»( ā L d )
Experiments
- Evaluated effects of techniques on 3 benchmarks for task-oriented semantic parsing
- 2 datasets evaluate robustness to intra-sentential codeswitching
- 1 dataset uses multilingual data to evaluate robustness to inter-sentential codeswitching
- Examples divided into training, evaluation, and test data at 70/10/20 ratio
Datasets
- MTOP benchmark evaluates multilingual transfer for a difficult compositional parse structure
- Benchmark contains queries in 6 languages
- CST5 benchmark evaluates Hindi-English intra-sentential codeswitching
- Spanish-English codeswitching dataset constructed using Google Translate
- Dataset has 5,803 queries in both English and Spanish-English
Results
- Uses same hyperparameter configurations for all settings
- Encoder uses mBERT architecture
- Decoder is randomly initialized 4-layer, 8-head vanilla transformer
- AdamW used to optimize for 1.2 million training steps
- Learning rate of 2eā5, batch size of 16, and decay the learning rate to 0 throughout of training
- Train on Cloud TPU v3 Pod for approximately 4 hours
- Loss constraint of 0.3 for adversarial experiments
- Report Exact Match (EM) accuracy on all test splits
Adversarial baseline comparison
Adversary ablation
- Adversarial training improves transfer results
- Binary adversary and constrained optimization are beneficial to adversarial alignment
- DAMP improves over unconstrained multi-class adversarial technique by 9.9, 6.4, and 0.9 EM accuracy points
Regularization comparison
- Adversarial training compared to regularization techniques in cross-lingual learning
- Freezing first 8 layers of encoder and using L1 and L2 norm penalty
- Adversarial learning improves across all benchmarks
Pretrained decoder comparison
- Adversarial training does worse than the plain mT5 model
- Adversarial alignment causes a drop in performance due to accidental translation
- Adversarial alignment improves performance in certain variants of MTOP and CSTOP
- The mT5 decoder struggles to adapt to this task, making overall performance worse than DAMP
Improvement analysis
- Exact match accuracy is used to measure improvements
- Qualitative analysis is used to examine examples that DAMP predicts correctly but AMBER and mBERT do not
- 20 examples from each language are manually evaluated
- Intent prediction is a large portion of the gain
- Table 4 reports intent prediction results
- Improvements noted in DAMP are more accurate placement of articles and prepositions inside slot boundaries
Alignment analysis
- Visualize alignment using two-dimensional projection of encoder embeddings
- Quantitatively evaluate alignment using post-hoc linear probe
Embedding space visualization
- UMAP is used to visualize the embedding spaces of each model variant on each MTOP test set.
- mBERT produces linearly separate clusters of English and Non-English data.
- AMBER removes the global clustering behavior and produces small local clusters of English and Non-English data.
- DAMP produces an embedding space with no clear visual clusters of Non-English data.
Post-hoc probing
- Quantitatively evaluated improvements to alignment
- Mode collapse during training has been shown to be insufficient
- Trained a linear probe on a frozen model after training for each variant
- Probe performance decreases with each stage of alignment
- Discriminator accuracy decreases with each stage of alignment
Conclusions and future work
- Introduced Doubly Aligned Multilingual Parser (DAMP)
- Contrastive alignment pretraining and adversarial alignment during finetuning
- Improved transfer learning in semantic parsing for inter-sentential and intrasentential codemixed data
- Compared proposed alignment method to prior techniques and regularization baselines
- Experiments using English as the base training language for domain adversarial transfer