Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning speech models.
  • It consists of two subtasks and an evaluation dataset.
  • The subtasks involve correctly identifying strong versus weak prosodic boundaries and distinguishing between pauses inserted between words and within words.
  • Human evaluation scores are provided.
  • SSL models were able to perform above chance on both tasks, even when trained on an unseen language.
  • Non-native models performed worse than native ones on the lexical task.
  • Models trained on more data performed better in the two subtasks.

Paper Content

Introduction

  • Self-supervised learning (SSL) speech models have been developed to remove the need for labeled data
  • Multiple benchmarks and metrics have been developed to test the linguistic knowledge of such models
  • Prosody (rhythm, stress, and intonation) has not been evaluated in SSL models
  • A new evaluation benchmark, ProsAudit, has been proposed to assess SSL speech models’ ability to learn prosodic information
  • ProsAudit consists of two subtasks: protosyntax and lexical
  • Results of human evaluation on the two subtasks are provided
  • Results will be integrated into the Zero-Resource Speech Challenge leaderboard
  • Analysis of factors like input quantity and nativeness is conducted

Methods

Prosaudit benchmark

  • Created two prosodic benchmarks in English
  • Used Boston University Radio News Corpus (BU) dataset
  • Segments had to meet specific criteria (duration, prosodic boundaries, pauses)
  • Automatically deleted all existent annotated pauses from the stimuli
  • Protosyntax and lexical tasks designed to evaluate models’ understanding of prosody
  • Protosyntax task: one stimulus has pause at “natural” location, other at “unnatural”
  • Lexical task: prosodic boundary present at word boundary in natural condition, within word in unnatural condition
  • Sampled final stimuli with similarity losses
  • Inserted 400ms pause with crossfading
  • Test and dev sets created with random sampling
  • Final score corresponds to average score for all dev or test pairs

Baselines

  • Evaluated several self-supervised learning models of speech on protosyntax and lexical tasks
  • Evaluated models against prosody-aware models (pGSLM)
  • pGSLM models have three components: acoustic model, quantizer, language model
  • Two versions of GSLM models: standard and deduplicated
  • pGSLM models build upon GSLM models by adding tasks to predict fundamental frequency and duration
  • Evaluated models on English and French models from STELA
  • Human evaluation with Mechanical Turk
  • Discarded non-native English participants and participants who did not pass at least 5 out of 6 examples
  • Final stimuli (human subset) composed of 521 pairs for protosyntax and 510 pairs for lexical task

Results

Benchmark

  • All models perform above chance in both the protosyntax and lexical tasks.
  • STELA models perform better on the protosyntax task, GSLM and pGSLM models perform better on the lexical task.
  • Not much difference in performance between the four pGSLM models.
  • GSLM model with deduplicated units performs similarly to the pGSLM models.

Further analyses

  • Humans performed better than chance on both lexical and protosyntax tasks.
  • Humans performed slightly lower than a Japanese version of the task.
  • Models scored higher than humans on the lexical task.
  • Participants were less confident in their ratings on the lexical task.

Effect of size.

  • Models trained on a small amount of data (50 hours) can acquire structural prosodic knowledge.
  • Performance improves with increasing size of training dataset, particularly for the lexical task.
  • Native models have an advantage over non-native models, particularly in the lexical task.

Discussion & conclusion

  • Introduced ProsAudit, a zero-shot benchmark for measuring English prosodic knowledge of speech SSL models
  • Models performed well above chance in protosyntax task, suggesting knowledge is embedded in speech SSL models
  • Models trained on another language perform relatively well in protosyntax task, suggesting some prosodic knowledge is universal
  • GSLM models performed better on lexical task than protosyntax task, suggesting strong lexical knowledge of English
  • STELA models do not perform as well on lexical task as protosyntax task, suggesting weak lexical knowledge
  • Native models’ performance is strongly correlated with data size, non-native models perform only slightly above chance
  • pGSLM models score only slightly better on protosyntax and lexical tasks than GSLM counterparts
  • Benchmark incorporated into Zero-Resource Speech challenge, inspiring further research to enhance prosodic capabilities of SSL models