Improved RNN Transducer
Improved RNNT or Improved Recurrent Neural Network Transducer is an improved version of the RNNTransducer where a normalized jointer network is introduced to improve performance. This improved version was proposed by Bytedance AI Lab in 2020 and published in this paper: Improving RNN Transducer with Normalized Jointer Network. To further improve the performance of the RNNT system, they used a masked Conformer model as the encoder network and the TransformerXL as the predictor network.
They introduced this normalized jointer network is because they observed a huge gradient variance during RNNT training such that the gradient variance is amplified for $U$ (transcription length) times on the encoderside, and was amplified for $T$ (acoustic length) times on the predictor side. This makes the predictor and encoder hard to be optimized. To address the issue, they proposed the normalized jointer network which applies normalization for gradients at both encoder and predictor by a factor of $U$ and $T$ respectively.
Masked Conformer Encoder
Recall that the encoder of the RNNT model works as the acoustic model where usually the input to encoder network is melfbank features $X = \left\{ x_{1},\ …x_{T} \right\}$, and the encoder network converts the melfbank features $X$ to the highlevel representations $H^{\text{enc}} = \left\{ h_{1}^{\text{enc}},\ …h_{T}^{\text{enc}} \right\}$. The encoder network used in this paper is a masked Conformer model.
Recall that the Conformer major component is a stack of conformer blocks, each of which is a series of multiheaded self attention, depthwise convolution and feedforward layers as shown in the following figure:
Where the multihead selfattention module is the same multihead selfattention attention mechanism introduced in the Transformer model which can be shown in the following equation:
\[\text{Attention}\left( Q,\ K,\ V \right) = softmax\left( \frac{QK^{\intercal}}{\sqrt{d}} \right)V\]In this paper, they modified the selfattention mechanism by adding mask $M$ to the selfattention part, and normalized by other words. The selfattention output at index $i$ can be calculated using the following equation:
\[\text{out}_{i} = \sum_{j}^{}\left( \frac{\exp\left( \frac{Q_{i}.K_{j}}{\sqrt{d}} \right).M_{\text{ij}}}{\sum_{k}^{}{\exp\left( \frac{Q_{i}.K_{k}}{\sqrt{d}} \right).M_{\text{ik}}}}.V_{j} \right)\]This mask mechanism introduced in conformer has two advantages:

First, adding mask to selfattention in conformer helps the convergence, especially when the training utterance is very long.

Second, with the mask introduced in selfattention, it is quite easy to change a nonstreaming RNNT system to a streaming fashion by masking out the right context of self attention part.
TransformerXL Predictor
Recall that the predictor/decoder network of the RNNT model works as the language model where usually the input to predictor network is the nonblank tokens $Y = \left\{ y_{1},\ …y_{U} \right\}$, and the predictor network converts them to the highlevel representations $H^{\text{pre}} = \left\{ h_{1}^{\text{pre}},\ …h_{T}^{\text{pre}} \right\}$. The decoder network used in this paper is a TransformerXL model.
The transformerXL contains a segmentlevel recurrence mechanism, which maintains an extralong context. Meanwhile, transformerXL proposed a novel positional encoding scheme that adapts the sinusoid formulation in the relative positional embedding. It helps the model generalize to a longer length during evaluation.
Normalized Jointer
The jointer network combines the high representation from encoder $H^{\text{enc}}$ and predictor network $H^{\text{pre}}$ using a fullyconnected network like so:
\[z\left( t,\ u \right) = \text{FC}\left( \tanh\left( h_{t}^{\text{enc}} + h_{u}^{\text{pre}} \right) \right)\]To understand how this equation affects the gradients during backpropagation, let’s write it with respect to each variable alone:
\[z\left( t,\ : \right) = \text{FC}\left( \tanh\left( h_{t}^{\text{enc}} + h_{:}^{\text{pre}} \right) \right)\] \[z\left( :,\ u \right) = \text{FC}\left( \tanh\left( h_{:}^{\text{enc}} + h_{u}^{\text{pre}} \right) \right)\]In the backward progress during backpropagation, gradient of $dh_t^{\text{enc}}$ and $dh_u^{\text{pre}}$ will have the following relationship with jointer network’s gradient $dz\left( t,\ u \right)$:
\[{dh}_{t}^{\text{enc}} = \sum_{u = 1}^{U}{dz\left( t,\ u \right)},\ \ \ \ \ \ \ {dh}_{u}^{\text{pre}} = \sum_{t = 1}^{T}{dz\left( t,\ u \right)}\]And usually in speech recognition task, $U$ is much bigger than $T$ which will cause unhealthy optimization of parameters. To overcome this problem, they simply divided $dh_t^{\text{enc}}$ by $U$ and divided $dh_u^{\text{pre}}$ by $T$ as shown below:
\[{dh}_{t}^{\text{enc}} = \frac{1}{U}\sum_{u = 1}^{U}{dz\left( t,\ u \right)},\ \ \ \ \ \ \ {dh}_{u}^{\text{pre}} = \frac{1}{T}\sum_{t = 1}^{T}{dz\left( t,\ u \right)}\]With this simple modification to the gradient from RNNT’s jointer network to encoder and predictor network, the gradient norm of RNNT’s training becomes more stable. And the validation loss decreases faster and lower as illustrated in the following figure:
With all three proposed improvements, they achieved SOTA in the Chinese AIShell1 dataset with $6.15\%$ and $5.37\%$ CER for streaming and nonstreaming recognition respectively.