2021

VITS

VITS stands for “Variational Inference with adversarial learning for Text-to-Speech”, which is a single-stage non-autoregressive Text-to-Speech model that is able to generate more natural sounding audio than the current two-stage models such as Tacotron 2, Transformer TTS, or even Glow-TTS. VITS was proposed by Kakao Enterprise in 2021 and published in this paper: “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”. The official implementation for this paper can be found in this GitHub repository: vits. The official synthetic audio samples resulting from VITS can be found in this website. …

Published on arXiv on : 11 Jun 2021

2020

Wave-Tacotron

Wave-Tacotron is a single-stage end-to-end Text-to-Speech (TTS) system that directly generates speech waveforms from text inputs. Wave-Tacotron was proposed by Google Research in 2020 and published in this paper under the same name: “Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis”. The official audio samples from Wave-Tacotron can be found in this website. Sadly, I couldn’t find any public implementation for this paper. …

Published on arXiv on : 6 Nov 2020
HiFi-GAN

HiFi-GAN stands for “High Fidelity General Adversarial Network” which is a neural vocoder that is able to generate high fidelity speech synthesis from mel-spectrograms efficiently more than any other auto-regressive vocoder such (e.g. WaveNet, ClariNet) or GAN-based vocoder (e.g. GAN-TTS, MelGAN). HiFi-GAN was proposed by Kakao Enterprise in 2020 and published in this paper under the same name: “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis”. The official implementation for this paper can be found in this GitHub repository: hifi-gan. Also, the official audio samples can be found in this website. …

Published on arXiv on : 12 Oct 2020
FastSpeech 2

FastSpeech model was a novel non-autoregressive TTS model that achieve on-par results to auto-regressive counterparts while being 38 times faster. Despite these advantages, FastSpeech had three main issues: …

Published on arXiv on : 8 Jun 2020
Glow-TTS

Glow-TTS, a flow-based generative non-autoregressive model for two-staged Text-to-Speech systems. Given an input text, Glow-TTS is able to generate mel-spectrogram that does not require any external aligner, unlike other models such as FastSpeech or ParaNet. Glow-TTS model was proposed by Kakao Enterprise and published in this paper under the same name: “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search”. The official implementation of this model can be found in this GitHub repository: glow-tts. The official synthetic audio samples resulting from Glow-TTS can be found in this website. …

Published on arXiv on : 22 May 2020

2019

FastSpeech

FastSpeech is a novel non-autoregressive Text-to-Speech (TTS) model based on the Transformer architecture. It takes a text (phoneme) sequence as input and generates mel-spectrograms non-autoregressively, later a vocoder is used to convert the spectrogram to audio waveforms. FastSpeech was proposed by Microsoft in 2019 and published in this paper under the same name: “FastSpeech: Fast, Robust and Controllable Text to Speech”. The official synthesized speech samples resulted from FastSpeech can be found in this website. The unofficial PyTorch implementation of FastSpeech can be found in this GitHub repository: FastSpeech. …

Published on arXiv on : 22 May 2019

2018

WaveGlow

WaveGlow is a flow-based generative Vocoder capable of generating high quality speech waveforms from mel-spectrograms. WaveGlow got that name as it combines insights from Glow (flow-based generative model created by OpenAI in 2018) and WaveNet (another Vocoder model) in order to provide fast, efficient and high quality audio synthesis. WaveGlow was proposed by NVIDIA in 2018 and published in this paper under the same name: “WaveGlow: A Flow-based Generative Network for Speech Synthesis”. The official PyTorch implementation of this paper can be found on NVIDIA’s official GitHub repository: NVIDIA/waveglow. The official synthetic audio samples resulting from WaveGlow can be found in this website. …

Published on arXiv on : 31 Oct 2018
Transformer TTS

Transformer TTS is a non-autoregressive TTS system that combines the advantages of Tacotron 2 and Transformer in one model, in which the multi-head attention mechanism is introduced to replace the RNN structures in the encoder and decoder, as well as the vanilla attention network. Transformer TTS was proposed by Microsoft in 2018 and published in this paper: “Neural Speech Synthesis with Transformer Network”. The official audio samples resulted from this model can be found in website. The unofficial PyTorch implementation of this paper can be found in this GitHub repository: Transformer-TTS. …

Published on arXiv on : 19 Sep 2018
SV2TTS

SV2TTS stands for “Speaker Verification to Text-to-Speech” which is a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. SV2TTS was proposed by Google in 2018 and published in this paper: “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”. The official audio samples outputted from this model can be found on this website. The unofficial implementation for this paper can be found in this GitHub repository: Real-Time-Voice-Cloning. …

Published on arXiv on : 12 Jun 2018

2017

Tacotron 2

Tacotron 2 is a two-staged text-to-speech (TTS) model that synthesizes speech directly from characters. Given (text, audio) pairs, Tacotron 2 can be trained completely from scratch with random initialization to output spectrogram without any phoneme-level alignment. After that, a Vocoder model is used to convert the audio spectrogram to waveforms. Tacotron 2 was proposed by the same main authors that proposed Tacotron earlier in the same year (2017). Tacotron 2 was published in this paper: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The official audio samples outputted from the trained Tacotron 2 by Google is provided in this website. The unofficial PyTorch implementation for Tacotron 2 can be found in Nvidia’s official GitHub repository: NVIDIA/tacotron2. …

Published on arXiv on : 16 Dec 2017
Tacotron

Tacotron is a two-staged generative text-to-speech (TTS) model that synthesizes speech directly from characters. Given (text, audio) pairs, Tacotron can be trained completely from scratch with random initialization to output spectrogram without any phoneme-level alignment. After that, a Vocoder model is used to convert the audio spectrogram to waveforms. Tacotron was proposed by Google in 2017 and published in this paper under the same name: Tacotron: Towards End-to-End Speech Synthesis. The official audio samples outputted from the trained Tacotron by Google is provided in this website. The unofficial TensorFlow implementation for Tacotron can be found in this GitHub repository: tacotron. …

Published on arXiv on : 29 Mar 2017

2016

WaveNet

WaveNet is a generative deep neural network for generating raw audio waveforms based on PixelCNN architecture. WaveNet was proposed by Deep Mind in 2016 and published in this paper: WaveNet: A Generative Model for Raw Audio. The official audio samples outputted from the trained WaveNet by Google is provided in this website. The unofficial TensorFlow implementation for WaveNet can be found in this GitHub repository: tensorflow-wavenet. …

Published on arXiv on : 12 Sep 2016