2022

Maestro

Maestro is a a self-supervised training method to unify representations learned from speech and text modalities that can transfer to downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST). Maestro was proposed by Google in 2022 and published in this paper: “MAESTRO: Matched Speech Text Representations through Modality Matching”. Sadly, Google hasn’t open-sourced the code for this paper :( …

Published on arXiv on : 7 Apr 2022
mSLAM: Multilingual SLAM

mSLAM stands for “Multilingual Speech and Language Model” which is a multilingual speech and language model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM is the multilingual version of SLAM which has been pre-trained on speech data from $51$ languages and text data from $101$ languages. mSLAM was proposed by Google in 2022 and published in their paper: “mSLAM: Massively multilingual joint pre-training for speech and text”. …

Published on arXiv on : 3 Feb 2022

2021

SLAM: Speech Language Model

SLAM stands for “Speech and Language Model” which is a pre-trained model on speech and text data that can be later fine-tuned on either language-related tasks such as “Machine Translation” or speech-related tasks such as “Speech Recognition”. SLAM was proposed by Google Research in 2021 and published in their paper under the same name “SLAM: A Unified Encoder For Speech and Language Modeling via Speech-Text Joint Pre-Training”. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. …

Published on arXiv on : 20 Oct 2021
SpeechT5

SpeechT5, stands for “Speech Text-to-Text Transfer Transformer”, is a unified framework for speech and text that leverages the large-scale unlabeled speech and text data hoping to improve the modeling capability for both speech and text. The name is inspired by T5 framework by Google which did the same on the textual modality. SpeechT5 was proposed by Microsoft in 2021 and published in this paper: SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. The official code for this framework can be found on Microsoft’s official GitHub repository: Microsoft/SpeechT5. …

Published on arXiv on : 14 Oct 2021

Dual-decoder Transformer

Dual-decoder Transformer is a Transformer architecture that consists of two decoders; one responsible for Automatic Speech Recognition (ASR) while the other is responsible for Speech Translation (ST). This model was proposed by FAIR and Grenoble Alpes University in 2020 and published in this paper: Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation. The official code of this paper can be found in the following GitHub repository: speech-translation. …

Published on arXiv on : 2 Nov 2020