Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning speech models.
- It consists of two subtasks and an evaluation dataset.
- The subtasks involve correctly identifying strong versus weak prosodic boundaries and distinguishing between pauses inserted between words and within words.
- Human evaluation scores are provided.
- SSL models were able to perform above chance on both tasks, even when trained on an unseen language.
- Non-native models performed worse than native ones on the lexical task.
- Models trained on more data performed better in the two subtasks.
Paper Content
Introduction
- Self-supervised learning (SSL) speech models have been developed to remove the need for labeled data
- Multiple benchmarks and metrics have been developed to test the linguistic knowledge of such models
- Prosody (rhythm, stress, and intonation) has not been evaluated in SSL models
- A new evaluation benchmark, ProsAudit, has been proposed to assess SSL speech models’ ability to learn prosodic information
- ProsAudit consists of two subtasks: protosyntax and lexical
- Results of human evaluation on the two subtasks are provided
- Results will be integrated into the Zero-Resource Speech Challenge leaderboard
- Analysis of factors like input quantity and nativeness is conducted
Methods
Prosaudit benchmark
- Created two prosodic benchmarks in English
- Used Boston University Radio News Corpus (BU) dataset
- Segments had to meet specific criteria (duration, prosodic boundaries, pauses)
- Automatically deleted all existent annotated pauses from the stimuli
- Protosyntax and lexical tasks designed to evaluate models’ understanding of prosody
- Protosyntax task: one stimulus has pause at “natural” location, other at “unnatural”
- Lexical task: prosodic boundary present at word boundary in natural condition, within word in unnatural condition
- Sampled final stimuli with similarity losses
- Inserted 400ms pause with crossfading
- Test and dev sets created with random sampling
- Final score corresponds to average score for all dev or test pairs
Baselines
- Evaluated several self-supervised learning models of speech on protosyntax and lexical tasks
- Evaluated models against prosody-aware models (pGSLM)
- pGSLM models have three components: acoustic model, quantizer, language model
- Two versions of GSLM models: standard and deduplicated
- pGSLM models build upon GSLM models by adding tasks to predict fundamental frequency and duration
- Evaluated models on English and French models from STELA
- Human evaluation with Mechanical Turk
- Discarded non-native English participants and participants who did not pass at least 5 out of 6 examples
- Final stimuli (human subset) composed of 521 pairs for protosyntax and 510 pairs for lexical task
Results
Benchmark
- All models perform above chance in both the protosyntax and lexical tasks.
- STELA models perform better on the protosyntax task, GSLM and pGSLM models perform better on the lexical task.
- Not much difference in performance between the four pGSLM models.
- GSLM model with deduplicated units performs similarly to the pGSLM models.
Further analyses
- Humans performed better than chance on both lexical and protosyntax tasks.
- Humans performed slightly lower than a Japanese version of the task.
- Models scored higher than humans on the lexical task.
- Participants were less confident in their ratings on the lexical task.
Effect of size.
- Models trained on a small amount of data (50 hours) can acquire structural prosodic knowledge.
- Performance improves with increasing size of training dataset, particularly for the lexical task.
- Native models have an advantage over non-native models, particularly in the lexical task.
Discussion & conclusion
- Introduced ProsAudit, a zero-shot benchmark for measuring English prosodic knowledge of speech SSL models
- Models performed well above chance in protosyntax task, suggesting knowledge is embedded in speech SSL models
- Models trained on another language perform relatively well in protosyntax task, suggesting some prosodic knowledge is universal
- GSLM models performed better on lexical task than protosyntax task, suggesting strong lexical knowledge of English
- STELA models do not perform as well on lexical task as protosyntax task, suggesting weak lexical knowledge
- Native models’ performance is strongly correlated with data size, non-native models perform only slightly above chance
- pGSLM models score only slightly better on protosyntax and lexical tasks than GSLM counterparts
- Benchmark incorporated into Zero-Resource Speech challenge, inspiring further research to enhance prosodic capabilities of SSL models