mRASP2
mRASP2 stands for “multilingual Random Aligned Substitution Pretraining”. It’s mRASP2 because it’s an extension to the mRASP model proposed by the same lab (ByteDance AI Lab) a year earlier. mRASP2 framework was proposed in 2021 and published in this paper: Contrastive Learning for Manytomany Multilingual Neural Machine Translation. The official code for this paper can be found in this GitHub repository: mRASP2.
mRASP2, as shown in the following figure, is a framework for training manytomany multilingual neural machine translation models using both parallel corpora and monolingual corpora. This framework is empowered by two techniques:

mCOLT: a contrastive learning scheme for the encoder to close the gap among representations of similar sentences across different languages.

Aligned Augmentation (AA): Data augmentation on both parallel and monolingual data to create pseudopairs to improve multilingual translation quality.
The base architecture of mRASP2 is the stateoftheart Transformer. A little different from mRASP, they chose a larger setting with a 12layer encoder and a 12layer decoder to increase the model capacity. The model dimension is $1024$ on $16$ heads. To ease the training of the deep model, they applied Layer Normalization for word embedding and prenorm residual connection for both encoder and decoder.
More formally, $D$ denotes all parallel datasets involved in training where $D_{i,j}$ denotes a parallel dataset of $\left( L_{i},\ L_{j} \right)$ language pair. To distinguish different languages, they added an additional language identification token preceding each sentence, for both source side and target side.
mCOLT
mCOLT stands for “multilingual Contrastive Learning for Translation” which is a contrastive loss function for the encoder. Its key idea is to minimize the representation gap of similar sentences of different languages and maximize that of irrelevant sentences. More formally, given a bilingual translation pairs $\left( x^{i},\ x^{j} \right) \in D$ where $\left( x^{i},\ x^{j} \right)$ is a positive example, and $\left( x^{i},\ y^{j} \right)$ is a negative example as $y^{j}$ is randomly sampled from the same language $L_{j}$. The objective of contrastive learning is to minimize the following loss:
\[\mathcal{L}_{\text{ctr}} = \sum_{x^{i},x^{j} \in D}^{}{ \log\left( \frac{\frac{e^{\text{sim}^{+}\left( \mathcal{R}\left( x^{i} \right),\ \mathcal{R}\left( x^{j} \right) \right)}}{t}}{\sum_{y^{j}}^{}\frac{e^{\text{sim}^{}\left( \mathcal{R}\left( x^{i} \right),\ \mathcal{R}\left( y^{j} \right) \right)}}{t}} \right)}\]Where:

$sim()$ calculates the cosine similarity of different sentences. $sim +$ and $sim $ denote positive and negative similarity respectively. To simplify implementation, the negative samples are sampled from the same training batch.

$\mathcal{R}\left( x \right)$ denotes the encoded output of an arbitrary sentence $x$.

$t$ is the temperature. Higher temperature increases the difficulty to distinguish positive sample from negative ones. In the paper, temperature was set to $0.1$.
Now, the training loss $\mathcal{L}$ used for training mRASP2 is a combination of two loss functions; the contrastive loss $\mathcal{L}_{\text{ctr}}$ defined above and the cross entropy $\mathcal{L}_{\text{ce}}$:
\[\mathcal{L} = \mathcal{L}_{\text{ce}} + \lambda\left s \right\mathcal{L}_{\text{ctr}}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathcal{L}_{\text{ce}} = \sum_{x^{i},x^{j} \in D}^{}{ \log\left( P_{\theta}\left( x^{i} \middle x^{j} \right) \right)}\]Where

$\lambda$ is the coefficient to balance the two training losses. In the paper, it was set to $1.0$.

$\left s \right$ is the average sequence length since $\mathcal{L}_{\text{ctr}}$ is calculated on the sentencelevel and $\mathcal{L}_{\text{ce}}$ is calculated on the tokenlevel.

$x^{i}$ and $x^{j}$ represent sentences in language $L_{i}$ and $L_{j}$ respectively.

$\theta$ is the parameter of multilingual Transformer model.
Aligned Augmentation
Aligned Augmentation (AA) is a data augmentation technique that can be applied on both parallel and monolingual data in order to improve multilingual translation quality. Aligned Augmentation is considered an extension of RAS (Random Aligned Substitution) which was proposed in mRASP paper.
For a bilingual sentence pair $\left( x^{i},\ x^{j} \right)$ in two languages $L_{i}$ and $L_{j}$, Aligned Augmentation creates a perturbed sentence $C\left( x^{i} \right)$ by replacing aligned words from a MUSE synonym dictionary with a probability of $90\%$ and keep them unchanged otherwise; forming a pseudoparallel training example $\left( C\left( x^{i} \right),\ x^{j} \right)$:
For a monolingual sentence $x^{i}$ of language $L_{i}$, Aligned Augmentation creates a perturbed sentence $C\left( x^{i} \right)$ the same way as the bilingual sentence; forming a pseudo selfparallel example $\left( C\left( x^{i} \right),\ x^{i} \right)$:
Now, both a pseudoparallel training example $\left( C\left( x^{i} \right),\ x^{j} \right)$ and a pseudo selfparallel example $\left( C\left( x^{i} \right),\ x^{i} \right)$ will be used to increase the training data and therefore boosting the multilingual translation quality.
Experiments
In their experiments, they used the Transformer model with 12 encoder layers and 12 decoder layers. The embedding size and FFN dimension were set to $1024$ on $16$ heads. For multilingual vocabulary, They followed the shared BPE vocabulary of $64,808$ tokens plus $59$ language tokens such as $\left\langle \text{EN\ id} \right\rangle$, $\left\langle \text{FR\ id} \right\rangle$...etc.
They also used a dropout rate of $0.1$, as well as a learning rate of $3e^{ 4}$ with polynomial decay scheduling and a warmup step of $10k$. For optimization, they use Adam optimizer with $\epsilon = 1e^{ 6}$ and $\beta_{2} = 0.98$. To stabilize training, they set the threshold of gradient norm to be $5.0$ and clip all gradients with a larger norm.
For bilingual data, they used the Parallel Corpus (PC32) proposed in the mRASP paper. PC32 contains 97.6 million parallel sentences of 32 Englishcentric language pairs. The following figure shows the languages found in the corpus along with the number of sentences available for each language:
For monolingual data, they created Monolingual Corpus (MC24) of 1.01 billion sentences of 24 languages. MC24 is a subset of the Newscrawl dataset by using only those languages in PC32, plus three additional languages (Nl, Pl, Pt) highlighted in red in the following figure:
In order to balance the volume across different languages, they applied temperature sampling of $T = 5$. For a given language pair $l$ with $D_{l}$ number of parallel sentences, the probability of the sample being from language $l$ is:
\[p_{l} = \left( \frac{D_{l}}{\sum_{k}^{}D_{k}} \right)^{\frac{1}{T}}\]Results
The following table shows the BLEU score of different models (bilingual, pretrained then finetuned, and multilingual) on the evaluation sets of WMT benchmark. As shown in the following table, mRASP2 clearly improves multilingual baselines by a large margin in 10 translation directions.
Note:
mTransformer is a manytomany 12 layers standard transformer model trained on PC32 dataset and used as a baseline. As we can see from the past table, this model achieves very competitive results and they explained that that was due to these reasons:
They used a batch size of 3 million tokens. The batch size plays a crucial role in the success of training multilingual NMTs.
They used gradient norm to stable the training. Without it, the large scale training will collapse sometimes.
The following table shows the BLEU score of unsupervised translation on IWSLT, WMT, and OPUS100 evaluations sets. These three models were trained on only monolingual data and mRASP2 outperforms them by a huge margin on all languagepairs:
The following table shows the BLEU score of zeroshot translation on OPUS100 evaluations sets. As we can see, mRASP2 achieves consistent BLEU gains in zeroshot directions on different evaluation sets:
To understand what contributes to the performance gain, they conducted analytical experiments and reported the results in the following table:
And they found out the following:

③ performs comparably with ① in supervised and unsupervised scenarios, whereas achieves a substantial BLEU improvement for zeroshot translation. This indicates that by introducing contrastive loss, we can improve zeroshot translation quality without harming other directions.

① and ② perform poorly for zeroshot directions. This means contrastive loss is crucial for the performance in zeroshot directions.

mRASP2 further improves BLEU in ③, ④, and ⑤ especially in unsupervised directions. Therefore it is safe to say that mRASP2 learns a better representation space by using monolingual data.