2023

LP Conformer for VSR/AVSR

LP Conformer stands for “Linear Projection Conformer” which is a visual speech recognition (VSR) model that is able to read lip movements and transform them into text.This model was proposed by Google in 2023 and published in this paper: “Conformers Are All You Need For Visual Speech Recognition”. An illustration of the model can be seen in the following figure: …

Published on arXiv on : 17 Feb 2023
Audio-Visual data2vec

AV-data2vec stands for “Audio-Visual data2vec” which extends the data2vec framework from uni-modal to multi-modal. AV-data2vec encodes masked audio-visual data and performs a masked prediction task of contextualized targets based on the unmasked input data similar to data2vec and data2vec 2.0. AV-data2vec was proposed by Meta in early 2022 and published in this paper: AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations. …

Published on arXiv on : 10 Feb 2023

2022

data2vec 2.0

data2vec 2.0 is the subsequent version of data2vec which improves the compute efficiency by using efficient data encoding, fast convolutional decoder, and different masked versions of each sample. data2vec 2.0 was proposed by Meta in late 2022 and published in this paper: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. The official code for data2vec 2.0 can be found as part of Fairseq framework on GitHub: fairseq/data2vec. …

Published on arXiv on : 14 Dec 2022
Whisper

Whisper stands for “Web-scale Supervised Pretraining for Speech Recognition” (I know, it should’ve been “WSPSR”). Whisper is a speech model trained in a supervised setup on 680,000 hours of labeled audio data to handle different speech-related tasks such as “transcription”, “translation”, “VAD”, and “alignment” on approximately $100$ languages. Whisper was proposed by OpenAI in 2022 and published in this paper “Robust Speech Recognition via Large-Scale Weak Supervision”. The official code for Whisper can be found on OpenAI’s official GitHub repository: openai/whisper. The following figure shows the architecture of Whisper: …

Published on arXiv on : 6 Dec 2022
u-HuBERT: A Unified HuBERT

u-HuBERT stands for “Unified Hidden Unit BERT” which is a unified self-supervised pre-training framework that can leverage unlabeled speech data of many different modalities for pre-training, including both uni-modal and multi-modal speech. u-HuBERT was proposed by Meta AI in 2022 and published in this paper: “A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer” by the same author who introduced HuBERT and AV-HuBERT. …

Published on arXiv on : 14 Jul 2022
AV-HuBERT for AVSR

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments. One way to help with that, is to complement the audio stream with visual information that is invariant to noise which helps the model performance. Mixing visual stream with audio stream is known as Audio-visual speech recognition (AVSR). …

Published on arXiv on : 14 Jul 2022
data2vec

data2vec is a framework proposed by Meta in 2022 that uses self-supervised learning on “speech, text, and image” modalities to create a single framework that works for all three. So, instead of using word2vec, wav2vec, or image2vec, we can use data2vec instead. This work could be a step closer to models that understand the world better through multiple modalities. data2vec was published in this paper: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. The official code for data2vec can be found as part of Fairseq framework on GitHub: fairseq/data2vec. …

Published on arXiv on : 7 Feb 2022
BEST-RQ

BEST-RQ stands for “BERT-based Speech pre-Training with Random-projection Quantizer” which is a BERT-like self-supervised learning technique for speech recognition. BEST-RQ masks input speech signals and feeds them to an encoder which learns to predict the masked labels using the unmasked labels. Both masked and unmaksed labels are provided by a random-projection quantizer as shown in the following figure. BEST-RQ was proposed by Google Brain in 2022 and published in this paper: Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition. …

Published on arXiv on : 3 Feb 2022

2021

XLS-R

XLS-R stands for “Cross-Lingual Speech Representation” which is a large-scale version of XLSR for not only cross-lingual speech recognition task but also for speech translation and speech classification tasks. XLS-R was pre-trained on nearly half a million hours of publicly available speech audio in 128 languages. XLS-R was proposed by Facebook AI in 2021 and published in this paper: “XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning At Scale”. The official code of this paper can be found as a part of FairSeq’s official GitHub repository: fairseq/xlsr. …

Published on arXiv on : 17 Nov 2021
w2v-BERT

w2v-BERT combines the core methodologies of self-supervised pre-training of speech embodied in the wav2vec 2.0 model and self-supervised pre-training of language emobdied in BERT. w2v-BERT was proposed by Google Brain in 2021 and published in this paper “w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training”. The w2v-BERT pre-training framework is illustrated down blow: …

Published on arXiv on : 7 Aug 2021
HuBERT

HuBERT stands for “Hidden-unit BERT”, which is a BERT-like model trained for speech-related tasks. HuBERT utilizes a novel self-supervised learning method that made its performance either matches or improves upon the state-of-the-art wav2vec 2.0 performance. HuBERT was proposed by FAIR in 2021 and published in this paper under the same name: “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units”. The official code for HuBERT can be found as part of Fairseq framework on GitHub: fairseq/hubert. …

Published on arXiv on : 14 Jun 2021
wav2vec-U

Wav2Vec Unsupervised is a model created by Facebook AI Research in May 2021 and published in this paper: Unsupervised Speech Recognition. …

Published on arXiv on : 24 May 2021
Auto-AVSR

Auto-AVSR stands for “Automatic Audio-Visual Speech Recognition” which is an off-the-shelf hybrid audio-visual model based on a ResNet-18 for visual encoding and Conformer for audio encoding. Auto-AVSR was originally proposed in 2021 by researchers form the Imperial College London and published in this paper: “End-to-end Audio-visual Speech Recognition with Conformers”. …

Published on arXiv on : 12 Feb 2021

2020

Improved RNN Transducer

Improved RNN-T or Improved Recurrent Neural Network Transducer is an improved version of the RNN-Transducer where a normalized jointer network is introduced to improve performance. This improved version was proposed by Bytedance AI Lab in 2020 and published in this paper: Improving RNN Transducer with Normalized Jointer Network. To further improve the performance of the RNN-T system, they used a masked Conformer model as the encoder network and the Transformer-XL as the predictor network. …

Published on arXiv on : 3 Nov 2020
Combined Semi-supervised Learning

Combined SSL or Combined Semi-supervised Learning is a new approach combining semi-supervised learning techniques such as “iterative self-learning” with pre-trained audio encoders to create an ASR system that achieves state-of-the-art results on LibriSpeech. This appraoch was developed by Google Research and Google Brain in 2020 and published in this paper: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition. The resutling ASR from this approach follows the Transducer architecture where the encoder is a Conformer model while the deocder is an LSTM model as shown in the following figure: …

Published on arXiv on : 20 Oct 2020
ConvT-T

ConvT-Transducer stands for “Conv-Transformer Transducer” which is a Transducer-based streamable automatic speech recognition (ASR) system created by Huawei Noah’s Ark Lab in 2020 and published in their paper: Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable. The original Transformer architecture, with encoder-decoder architecture, is only suitable for offline ASR as it relies on a bidirectional attention mechanism. To make the Transformer suitable for streaming ASR, they applied the following modifications: …

Published on arXiv on : 13 Aug 2020
XLSR

XLSR stands for “Cross-Lingual Speech Representation” which is a large-scale multilingual speech recognition model based on wav2vec 2.0 that was pre-trained on raw waveform files from 53 languages to perform speech recognition task in unsupervised manner. XLSR was proposed by Facebook AI Research in 2020 and published in their paper: “Unsupervised Cross-Lingual Representation Learning for Speech Recognition”. The official code of this paper can be found as a part of FairSeq’s official GitHub repository: fairseq/wav2vec. …

Published on arXiv on : 24 Jun 2020
wav2vec 2.0

Wav2Vec 2.0 is a self-supervised end-to-end ASR model pre-trained on raw audio data via masking spans of latent speech representations, similar to MLM used with BERT. Wav2vec was created by Facebook AI Research in 2021 and published in this paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. The official code for this paper can be found as part of the fairseq framework on GitHub: fairseq/wav2vec2.0. …

Published on arXiv on : 20 Jun 2020
Conformer

Conformer stands for “Convolution-augmented Transformer” which is an encoder architecture that combines convolution neural networks (CNN) and Transformers to perform speech recognition task. Conformer was created by Google in 2020 and published in this paper under the same name “Conformer: Convolution-augmented Transformer for Speech Recognition”. The unofficial code for this paper can be found in the following GitHub repository: conformer. …

Published on arXiv on : 16 May 2020
ContextNet

ContextNet is a CNN-RNN transducer model that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. ContextNet was proposed by Google in 2020 and published in this paper under the same name: “ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context”. In this paper, they followed the RNN-Transducer framework where they used ContextNet as the encoder and a single layer LSTM as the decoder. …

Published on arXiv on : 7 May 2020
mCPC

CPC stands for “Contrastive Predictive Coding” which is a learning approach created by Deep Mind in 2018 and published in this paper “Representation Learning with Contrastive Predictive Coding” to extract useful representations from data in an unsupervised way. Researchers from Facebook AI Research have modified the CPC method and explored how to it to pre-train speech models using unlabeled audio data directly from raw waveform in their paper Unsupervised Pretraining Transfers Well Across Languages published in 2020. …

Published on arXiv on : 7 Feb 2020

2019

Acoustic BERT

BERT was created as a language model that deals with textual data, However, have you ever wondered if we used BERT on an acoustic data, what will be its performance? Apparently, some researchers at Facebook AI Research in 2019 tried to answer that question and published a paper called “Effectiveness of Self-Supervised Pre-Training for Speech Recognition” where they tried to fine-tune a pre-trained BERT model for speech recognition task using CTC loss function. To enable BERT to deal with input audio data, they tried using two different approaches as shown in the following figure: …

Published on arXiv on : 10 Nov 2019
T-T: Transformer Transducer

Transformer Transducer is an end-to-end speech recognition model with Transformer encoders that is able to encode both audio and label sequences independently. It is similar to the Recurrent Neural Network Transducer (RNN-T) model. The only difference is this model Transformer encoder instead of RNNs for information encoding. Transformer Transducer was proposed by Google Research in 2019 and published in this paper “Transformer Transducer: A Streamable Speech Recognition Model”. The unofficial code for this paper can be found in this GitHub repository: Transformer-Transducer. …

Published on arXiv on : 21 Oct 2019
vq-wav2vec

vq-wav2vec stands for “vector-quantized wav2vec” which is a quantized version of the wav2vec model that learns discrete representations of audio segments. vq-wav2vec was created by Facebook AI Research in 2019 and published in this paper: vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. The official code for this paper can be found in Facebook’s fairseq framework. …

Published on arXiv on : 12 Oct 2019
Convolutional Transformer

Recently, Transformer architecture has been shown to perform very well for neural machine translation and many other NLP tasks. There has been recent research interest in using transformer networks for end-to-end ASR both with CTC loss (e.g SAN-CTC) and with encoder-decoder architecture (e.g Speech-Transformer). In this paper “Transformers with convolutional context for ASR” published in 2019, researchers from FAIR try to adapt the Transformer architecture to the speech recognition task by replacing the positional embedding in the architecture with convolution network. …

Published on arXiv on : 26 Apr 2019
SpecAugment: Spectrogram Augmentation

SpecAugment stands for “Spectrogram Augmentation” which is an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances. This scheme has shown to be highly effective in enhancing the performance of end-to-end ASR systems. SpecAugment was proposed by Google in 2019 and published in this paper under the same name: “SpecAugment: A simple data augmentation method for Automatic Speech Recognition”. The unofficial code for this paper can be found in this GitHub repository: SpecAugment. …

Published on arXiv on : 18 Apr 2019
wav2vec

Wav2Vec is a fully convolutional neural network that that takes raw audio (wav) as input and computes a general representation (vector) that can be input to a speech recognition system. In other words, it’s a model the converts wav to vectors; hence the name. Wav2vec was created by Facebook AI Research in 2019 and published in this paper: Wav2Vec: Unsupervised Pre-Training For Speech Recognition. The official code for this paper can be found as part of Fairseq framework. …

Published on arXiv on : 11 Apr 2019
Jasper

Jasper is an End-to-End convolutional neural ASR system that uses a stack of 1D convolutions, batch normalization, ReLU, dropout, and residual connections trained with CTC loss to obtain state-of-the-art results on LibriSpeech dataset. Jasper was proposed by Nvidia in 2019 and published in this paper under the same name: “Jasper: An End-to-End Convolutional Neural Acoustic Model”. The official code can be found on Nvidia official GitHub Repository: Japser. …

Published on arXiv on : 5 Apr 2019
TDS Conv

Time-Depth Separable (TDS) Convolution is a fully convolutional encoder architecture accompanied by a simple and efficient decoder. This encoder was proposed by Facebook AI Research in 2019 and published in this paper: Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions. The unofficial implementation for this encoder can be found in the following GitHub repository: tds.py. …

Published on arXiv on : 4 Apr 2019

2018

Speech-Transformer

Speech Transformer is an end-to-end Automatic Speech Recognition (ASR) system that adapts the Transformer architecture to the speech recognition task which can be trained faster and with more efficiency than sequence to sequence models such as RNN-Transducer. Speech Transformer was proposed by the Chinese Academy of Sciences in 2018 and published in this paper under the same name: “Speech-Transformer: A No-Recurrence Sequence-To-Sequence Model”. …

Published on arXiv on : 15 Apr 2018

2017

Wav2text

Wav2Text is a model that was created by Nara Institute of Science and Technology in Japan and published in this paper: Attention-Based Wav2Text With Feature Transfer Learning. …

Published on arXiv on : 22 Sep 2017

2016

Wav2Letter

Wav2Letter is an end-to-end model for speech recognition that combines a convolutional networks with graph decoding. Wav2letter was trained on speech signal to transcribe letters/characters, hence the name “wav-to-letter”. Wav2letter was created by Facebook in 2016 and published in this paper: Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. The official code for this paper can be found in Flashlight’s official GitHub repository: wav2letter++. …

Published on arXiv on : 11 Sep 2016

2015

Deep Speech 2.0

Deep Speech 2 is a model created by Baidu in December 2015 (exactly one year after Deep Speech) and published in their paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. This paper is considered a follow-on the Deep Speech paper, the authors extended the original architecture to make it bigger while achieving 7× speedup and 43.4% relative improvement in WER. Also, the authors incorporating the convolution layers as shown in the following figure: …

Published on arXiv on : 8 Dec 2015
LAS: Listen, Attend and Spell

LAS stands for “Listen, Attend and Spell” which is a acoustic model that learns to transcribe an audio sequence signal to a word sequence, one character at a time. LAS was created by Google Brain in 2015 and published in this paper under the same name: Listen, Attend and Spell. …

Published on arXiv on : 5 Aug 2015

2014

Deep Speech

Deep Speech is a well-optimized end-to-end RNN system for speech recognition created by Baidu Research in 2014 and published in their paper: Deep Speech: Scaling up end-to-end speech recognition. Deep Speech is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. …

Published on arXiv on : 17 Dec 2014

2012

RNN-T: RNN Transducer

RNN-T stands for “Recurrent Neural Network Transducer” which is a promising architecture for general-purpose sequence such as audio transcription built using RNNs. RNN-T was proposed by Alex Graves at the University of Toronto back in 2012 and published under the name: Sequence Transduction with Recurrent Neural Networks. This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. …

Published on arXiv on : 14 Nov 2012

2006

CTC

Data sets for speech recognition are usually a dataset of audio clips and corresponding transcripts. The main issue in these datasets is that we don’t know how the characters in the transcript align to the audio. Without this alignment, it would be very hard to train a speech recognition model since people’s rates of speech vary. CTC provides a solution to this problem. …

Published on arXiv on : 25 Jun 2006