XLME: XLM via ELECTRA
XLME stands for “Crosslingual Language Modeling via Electra” which is a crosslingual language model that was pretrained on two ELECTRAstyle tasks as we are going to see later. XLME was proposed by Microsoft in 2021 and published in their paper: Crosslingual Language Model Pretraining via ELECTRA. The official code for this paper can be found on Microsoft’s GitHub repository: microsoft/unilm.
Using a pretrained language model and then finetune it on downstream tasks has become a de facto trend in the field. However, these pretraining techniques such as (Masked Language Modeling (MLM), Translation Masked Modeling (TLM), ...etc.) usually requires massive computation resources. As shown in the following figure, XLME (red line) achieves 130x speedup compared with XLMR augmented with TLM and around 100x speedup compared with XLMR, XLMAlign, and InfoXLM:
Pretraining
Similar to ELECTRA, XLME has two Transformer components, i.e., generator and discriminator. The generator predicts the masked tokens given the masked sentence or translation pair, and the discriminator distinguishes whether the tokens are replaced by the generator. XLME was pretrained on two different pretraining tasks: multilingual replaced token detection (MRTD), and translation replaced token detection (TRTD). The overall training objective is to minimize:
\[\mathcal{L} = \mathcal{L}_{\text{MLM}} + \lambda\mathcal{L}_{\text{MRTD}} + \mathcal{L}_{\text{TLM}} + \lambda\mathcal{L}_{\text{TRTD}}\]Multilingual RTD
This pretraining task is a multilingual form of the Replaced Token Detection (RTD) task introduced in the ELECTRA model. Here is a small recap on RTD. The RTD task requires the model to distinguish real input tokens from corrupted multilingual sentences. The input sentences get corrupted by the generator model, and the discriminator should be able to classify the real tokens from the replaced ones as shown in the following figure:
The multilingual RTD is exactly the same with a few differences:

The input text can be in various languages.

Both the generator and the discriminator are shared across languages. The vocabulary is also shared for different languages.

Masking is done uniformly while it was only $15\%\ $in the ELECTRA paper.
Note:
They also tried span masking, but it significantly weakened the generator’s prediction accuracy, which in turn harmed the pretraining.
Given a input sequence $x$ that was masked using $M_{e}$ set of random positions; the loss function of the generator $G$ is:
\[\mathcal{L}_{\text{MLM}}\left( x;\ \theta_{G} \right) =  \sum_{i \in M_{e}}^{}{\text{log}\left(p_{G}\left( x_{i} \middle x^{\text{masked}} \right) \right)}\]The loss function of the discriminator $D$ is the following; knowing that $n$ is the length of $x$ and $r_{i}$ is the label of the output (1 for “yes” and 0 for “No”):
\[\mathcal{L}_{\text{MRTD}}\left( x;\theta_{D} \right) =  \sum_{i = 1}^{n}{\log\left( p_{D}\left( r_{i} \middle x^{\text{corrupt}} \right) \right)}\]Translation RTD
Translation RTD is a novel discriminative pretraining task which aims to distinguish real input tokens from the translation pairs concatenated together. An input translation pair $\left( e,f \right)$ gets concatenated together into a single sentence and then treated the same way as MRTD as shown in the following figure:
Given a concatenated translation pair $\left\lbrack e;f \right\rbrack$ that was masked using $M_{e}$ and $M_{f}$ sets of random positions for $e$ and $f$ respectively, the loss function of the generator $G$ is:
\[\mathcal{L}_{\text{TLM}}\left( e,f;\theta_{G} \right) =  \sum_{i \in M_{e}}^{}{\log\left( p_{G}\left( e_{i} \middle \left\lbrack e;f \right\rbrack^{\text{masked}} \right) \right)}\] \[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  \sum_{i \in M_{f}}^{}{\log\left( p_{G}\left( f_{i} \middle \left\lbrack e;f \right\rbrack^{\text{masked}} \right) \right)}\]The loss function of the discriminator $D$ is the following; knowing that $n$ is the length of the concatenation and $r_{i}$ is the label of the output (1 for “yes” and 0 for “No”):
\[\mathcal{L}_{\text{TRTD}}\left( e,f;\theta_{D} \right) =  \sum_{i = 1}^{n}{\log\left( p_{D}\left( r_{i} \middle \left\lbrack e;f \right\rbrack^{\text{corrupt}} \right) \right)}\]Note:
The generators in these pretraining tasks act like language models. That’s why their loss functions were marked as $\mathcal{L}_{\text{MLM}}$ and $\mathcal{L}_{\text{TLM}}$.
Gated Relative Position Bias
In this paper, they proposed to use gated relative position bias inspired by the GRU cells, in the selfattention mechanism. First, let’s recap how the selfattention mechanism works. Given, input tokens $\left\{ x_{1},\ …x_{n} \right\}$, the selfattention mechanism uses $q_{i}$, $k_{i}$, and $v_{i}$ for each input to compute the head output ${\widetilde{h}}_{i}$ as shown below:
\[{\widetilde{h}}_{i} = \text{softmax}\left\lbrack \frac{q_{i}W_{i}^{Q}\left( k_{i}W_{i}^{K} \right)^{T}}{\sqrt{d_{k}}} \right\rbrack v_{i}W_{i}^{V}\]Where $W_{i}^{Q},W_{i}^{K} \in \mathbb{R}^{d_{m} \times d_{k}},W_{i}^{V} \in \mathbb{R}^{d_{m} \times d_{v}}$ are learned matrices. In this paper, the selfattention output is slightly different as shown below:
\[{\widetilde{h}}_{i} = \text{softmax}\left\lbrack \frac{q_{i}W_{i}^{Q}\left( k_{i}W_{i}^{K} \right)^{T}}{\sqrt{d_{k}}} + b_{i  j} \right\rbrack v_{i}W_{i}^{V}\]Where $b_{i  j}$ denotes the gated relative position bias which is computed via:
\[g^{\left( \text{update} \right)},\ g^{\left( \text{reset} \right)} = \sigma\left( q_{i}\text{.u} \right),\ \sigma\left( q_{i}\text{.v} \right)\] \[b_{i  j} = d_{i  j} + g^{\left( \text{update} \right)}d_{i  j} + \left( 1  g^{\left( \text{update} \right)} \right).wg^{\left( \text{reset} \right)}d_{i  j}\]Where $d_{i  j}$ is learnable relative position bias, the vectors $u,\ v \in \mathbb{R}^{d_{k}}$ are parameters, σ is the sigmoid function, and $w \in \mathbb{R}$ is a learnable value.
Results
In the following experiments, they used a 12layer Transformer as the discriminator, with hidden size of 768, and FFN hidden size of 3,072 while they used a 4layer Transformer as the discriminator using the same hyperparameters. They used the same vocabulary with XLMR that consists of 250K subwords tokenized by SentencePiece.
They jointly pretrained the generator and the discriminator using Adam optimizer for 125K training steps with a dynamic batching of approximately 1M tokens. $\lambda$ was set to $50$. Check the following table for the full list of pretraining hyperparameters.
For pretraining, they used the CC100 dataset for the MRTD task which contains texts in 100 languages collected from CommonCrawl. They used parallel corpora in 100 languages, collected from MultiUN, IIT Bombay, OPUS, WikiMatrix, and CCAligned, for the TRTD task. For sampling, they used temperature sampling of $T = \frac{10}{7}$.
After pretraining the XLME model was finetuned on various tasks from the XTREME benchmark. The following table shows the hyperparameters used when finetuning. The XTREME benchmark contains seven crosslingual understanding tasks, namely partofspeech tagging (POS) on the Universal Dependencies v2.5, NER named entity recognition on the Wikiann dataset, crosslingual natural language inference on XNLI, crosslingual paraphrase adversaries from word scrambling (PAWSX), and crosslingual question answering on MLQA, XQuAD, and TyDiQAGoldP.
The following table shows the evaluation results on XTREME benchmark knowing that results of XLME and XLMR~base~ are averaged over five runs.
As seen from the previous table, XLME outperforms previous models on Question Answering and Classification problems while achieves competitive performance of Structured Predictions. All of that while uses substantially less computation: