XLSR stands for “Cross-Lingual Speech Representation” which is a large-scale multilingual speech recognition model based on wav2vec 2.0 that was pre-trained on raw waveform files from 53 languages to perform speech recognition task in unsupervised manner. XLSR was proposed by Facebook AI Research in 2020 and published in their paper: “Unsupervised Cross-Lingual Representation Learning for Speech Recognition”. The official code of this paper can be found as a part of FairSeq’s official GitHub repository: fairseq/wav2vec.

As we can see from the following figure, XLSR has the same architecture as wav2vec 2.0, the only difference is that XLSR was trained on multiple languages unlike wav2vec 2.0 which was trained only on English. The quantization in XLSR produces multilingual quantized speech units whose embeddings are then used as targets for a Transformer trained by contrastive learning. The model learns to share discrete tokens across languages, creating bridges among these languages.

Note to Reader:
Reading the wav2vec 2.0 post is much needed before going on reading this post.


To train XLSR models, they used three different multilingual speech datasets:

  • CommonVoice: is a multilingual corpus of read speech comprising more than two thousand hours of speech data in 38 languages.

  • BABEL: is a multilingual corpus of conversational telephone speech which includes 14 Asian and African languages.

  • MLS: The Multilingual LibriSpeech dataset is a large corpus derived from read audio books of Librivox and consists of 8 languages.

Pre-trained Models

During pre-training, they pre-trained two different configurations of the XLSR model; both have the same architecture but with different Transformer settings as seen in the following table. During fine-tuning, a classifier representing the output vocabulary of the respective downstream task is built on top of the model and trained on the labeled data with a CTC loss.

</table> </div> The following is a list of all pre-trained models in the paper: - **XLSR-English:** It is an XLSR-base model pre-trained on only English audio data. - **10x XLSR-Monolingual:** They are 10 different XLSR-base models, each is pre-trained on audio data from one language. These languages are: Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (du), Russian (ru), Swedish (sv), Turkish (tr), Tatar (tt) and Chinese (zh). - **XLSR-10:** It is a multilingual XLSR-base and large model pre-trained on 10 different languages. These languages are: Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (du), Russian (ru), Swedish (sv), Turkish (tr), Tatar (tt) and Chinese (zh). - **XLSR-53:** It is a XLSR-large model pre-trained on 56,000 hour audio data from 53 different languages. Experiments & Results --------------------- In the first experiment, they tried to compare between XLSR-Monolingual and XLSR-10 pre-trained models fine-tuned individually on [CommonVoice](https://commonvoice.mozilla.org/en/languages). As we can see from the following table, XLSR-10 (base) obtains $13.6$ PER (Phoneme Error Rate) on average which is a relative PER reduction of $49\%$ compared to XLSR-Monolingual. XLSR-10 (large) obtains 12.2 PER on average which is PER reduction of $72\%$ relative to [mCPC](https://anwarvic.github.io/speech-recognition/mCPC):
Same experiment was applied on different languages from [BABEL](https://catalog.ldc.upenn.edu/byyear) dataset, and similar results were found. XLSR-10 (base) improves over XLSR-Monolingual by $18\%$ relative CER (Character Error Rate). XLSR-10 (large) XLSR-10 Large reduces CER by $38\%$ relative to multi-BLSTMP+VGG baseline:
These results show that the approach used to train XLSR is very effective for learning generic cross-lingual representations in an unsupervised way. Also, it shows that cross-lingual transfer are particularly effective on low-resource languages such as Swedish (sv) and Turkish (tr). To better assess the cross-lingual transfer of the learned representations, they decided to evaluate the XLSR-10 model trained on [BABEL](https://catalog.ldc.upenn.edu/byyear) dataset using four languages not seen during pre-training. They fine-tuned this model on each language, and compare it to monolingual models pretrained specifically on these languages. The following table shows that XLSR-10 still outperforms XLSR-Monolingual.
This further suggests that the learned representations capture generic features of the speech signal which transfer to many languages. XLSR-53 ------- When we fine-tune the pretrained model on each language individually, then we end up with a different model for each language. On the other hand, multilingual speech recognition aims to build a single model for all languages that performs as well or better than individual monolingual models. In the paper, they have trained an XLSR-large model on 53 different languages using 56,000 hour audio data. XLSR-53 outperformed all other pre-trained models on all languages as shown in the following two tables (the first table for [CommonVoice](https://commonvoice.mozilla.org/en/languages) dataset, and the other for [BABEL](https://catalog.ldc.upenn.edu/byyear) dataset):
On the [MLS](https://www.openslr.org/94/) dataset, they fine-tuned XLSR-53 on 1h, 10h, 100h and full data to evaluate the few-shot capability of the model. As shown from the WER reported in the following table, XLSR-53 signicantly outperforms the LibriVox [wav2vec 2.0](https://anwarvic.github.io/speech-recognition/wav2vec_2) model on all languages except English and by significant margin:
Model Size $$N$$ $$d_{m}$$ $$d_{\text{ff}}$$ $$h$$ Dropout LR peak</td> </tr> </thead>
BASE 12 768 3072 8 0.1 $$1e^{- 5}$$
LARGE 24 1024 4096 12 0.1 $$1e^{- 3}$$