Dr. NMT
DrNMT stands for “Discriminative Reranking for Neural Machine Translation” which is a reranking framework created by Facebook AI in 2021 and published in this paper: Discriminative Reranking for Neural Machine Translation. The official implementation for this paper can be found in this GitHub repository: DrNMT.
A reranking model is a model that is able to take a list of hypotheses generated by a baseline model and rank them based on a desired metric. In this paper, DrNMT takes as input both the source sentence as well as a list of hypotheses to output a ranked list of a desired metric, e.g. BLEU, over the nbest list.
Architecture
DrNMT is a transformer architecture which takes as input the concatenation of the source sentence $x$ and one of the hypothesis sentences generated by a NMT model $u \in \mathcal{U}(x)$. It outputs higher scores for hypotheses of better quality based on a metric $\mu(u,r)$ such as BLEU, where quality is measured with respect to a reference sentence $r$.
The architecture includes also position embeddings and language embeddings, to help the model represent tokens that are shared between the two languages. The final hidden state corresponding to the start of sentence token $\left\langle s \right\rangle$ is a one hidden feedforward network with $d$ hidden units with $\text{tanh}$ activation. It serves as the joint representation for $(x, u)$:
Given a source sentence $x$, $n$ of hypotheses generated from Trans, a reference sentence $r$, and an evaluation metric $\mu$, DrNMT parameters $\theta$ are learned by minimizing the KLdivergence over the training dataset according to the following formula:
\[\mathcal{L}\left( \theta \right) =  \sum_{j = 1}^{n}{p_{T}\left( u_{j} \right)\text{.log}\left( p_{M}\left( u_{j} \middle x;\theta \right) \right)}\] \[p_{M}\left( u_{i} \middle x;\theta \right) = \text{softmax}\left( o_{i}\left( u_{i} \middle x;\theta \right) \right) = \frac{\exp\left( o_{i}\left( u_{i} \middle x;\theta \right) \right)}{\sum_{j = 1}^{n}{\exp\left( o_{j}\left( u_{j} \middle x;\theta \right) \right)}}\] \[p_{T}\left( u_{i} \right) = \text{softmax}\left( \frac{\mu\left( u_{i},\ r \right)}{T} \right) = \frac{\exp\left( \frac{\mu\left( u_{i},\ r \right)}{T} \right)}{\sum_{j = 1}^{n}{\exp\left( \frac{\mu\left( u_{j},\ r \right)}{T} \right)}}\]Where $T$ is the temperature to control the smoothness of the distribution. In practice, we apply a minmax normalization on $\mu$ instead of using $T$ so that the best hypothesis scores 1 and the worst 0.
Results
In the paper, they experimented on four different language pairs: GermanEnglish (DeEn), EnglishGerman (EnDe), EnglishTamil (EnTa) and RussianEnglish (RuEn). The following table shows the number of sentences in each dataset used in the experiments after preprocessing:
They trained vanilla Transformer models using the bitext data to generate the nbest hypothesis list. To alleviate overfitting, BackTranslation data was generated from beam decoding with beam size equal to 5. They also used pretrained XLMR as the transformerpart of the DrNMT. Experiments on the four WMT directions show that DrNMT yields improvements of up to 4 BLEU over the beam search output:
The following are the models used in the previous table:

beam (fw): Feedforward transformer with beam decoding.

beam (fw) + MLM: This is the same as beam (fw) with pretrained masked language model (MLM) added. This technique was proposed by this paper: Masked Language Model Scoring. This takes a pretrained masked language model (MLM) on the target side, and iteratively masks one word of the hypothesis at the time and aggregates the corresponding scores to yield a score for the whole hypothesis. Then, this score is combined with the score of the forward model to rerank the nbest list.

beam (fw) + MLM: This is the same as beam (fw) + MLM where MLM is tuned on our target side monolingual dataset.

DrNMT: is the model described before.

DrNMT + NCD: This is the same as DrNMT with noisy channel decoding (NCD), a technique proposed in this paper: Simple and Effective Noisy Channel Modeling for Neural Machine Translation.

Oracle: The oracle is computed by selecting the best hypotheses based on BLEU with respect to the human reference. Of course, the oracle may be not achievable because of uncertainty in the translation task.
In the paper, they also examined the effect of training the reranker with different sizes of the nbest list. And they found out that; as the size of the nbest list during test time increases, the performance of all rerankers and NCD improve. The following figure shows the performance of DrNMT on DeEn validation sets from four rerankers trained with 5, 10, 20 and 50 hypotheses, respectively: