2020

GPT-3

GPT-3 is an enormous model built on the transformer-decoder architecture published in 2020 by OpenAI in this paper: “Language Models are Few-Shot Learners” whose title is very indicative of what the paper wanted to show. The paper didn’t provide any new architecture, they used the same architecture as GPT-2. They just made it way bigger and trained over more data. …

Published on arXiv on : 28 May 2020
Adapter Fusion

AdapterFusion is a new variant of the Adapter layers where it extends the functionality of adapters to be multi-tasking instead of being per a single task. AdapterFusion is proposed by researchers in UKP Lab, Technical University of Darmstadt and New York University and published in their paper: AdapterFusion: Non-Destructive Task Composition for Transfer Learning in May 2020. …

Published on arXiv on : 1 May 2020
ETC: Extended Transformer Construction

ETC stands for “Extended Transformer Construction” which is a new Transformer architecture for language modeling over long sentences and achieves state-of-the-art performance on various long-sentence tasks as shown in the following table. ETC was proposed by Google in 2020 and published in this paper: “ETC: Encoding Long and Structured Inputs in Transformers”. The official code for this paper can be found on Google Research’s official GitHub repository: research-etc-model . …

Published on arXiv on : 17 Apr 2020
Longformer: Long Transformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which has a time complexity of $O\left( n^{2} \right)$ where $n$ is the input length. Longformer stands for “Long Transformer” which is a encoder-side transformer with a novel attention mechanism that scales linearly with sequence length making it easy to process documents of thousands of tokens or longer. Longformer was proposed by Allen Institute in 2020 and published in their paper: Longformer: The Long-Document Transformer. The official code for this paper can be found in the official GitHub page of Allen Institute: allenai/longformer. …

Published on arXiv on : 10 Apr 2020
ELECTRA

ELECTRA stands for “Efficiently Learning an Encoder that Classifies Token Replacements Accurately” which is a discriminator language model unlike the widely-used generative language models such as BERT, GPT, ...etc. ELECTRA was proposed by Stanford University in collaboration with Google Brain in 2020 and published in their paper: ELECTRA: Pre-training text Encoders. The official code of this paper can be found on Google Research’s official GitHub repository: google-research/electra. …

Published on arXiv on : 23 Mar 2020
DistilBERT

DistilBERT is a smaller, faster, cheaper and lighter version of BERT created by Hugging Face in March 2020 and published in this paper: “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”. In this paper, they used knowledge distillation to reduce the size of a BERT by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. This was possible by using a triple loss function that combines language modeling, distillation and cosine-distance losses. …

Published on arXiv on : 1 Mar 2020

2019

BART

BART stands for “Bidirectional Auto-regressive Transformer” which is a pre-training scheme for models created by Facebook AI in 2019 and published in this paper: “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. Pre-training is the process of training a model with one task that is able to help it form parameters that can be used to make other tasks easier. And this is what we, human beings, do. We use our old knowledge of what we have learned in the past to understand new knowledge and handle a variety of new tasks. …

Published on arXiv on : 29 Oct 2019
Google's T5

T5 stands for “Text-to-Text Transfer Transformer” which is a text-to-text framework proposed by Google in 2019 and published in this paper: “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. The official code for this paper can be found on Google Research’s official GitHub repository: google-research/text-to-text-transfer-transformer. …

Published on arXiv on : 23 Oct 2019
ALBERT

ALBERT, stands for “A Lite BERT”, reduced version of BERT which is a smaller, faster, cheaper and easier to scale. ALBERT was created by Google & Toyota Technical Institute in February 2019 and published in this paper: “ALBERT: A Lite Bert For Self-Supervised Learning Of Language Representations” and you can fine the official code for this paper in Google Research’s official GitHub repository: google-research/ALBERT. …

Published on arXiv on : 26 Sep 2019
TinyBERT

TinyBERT is a distilled version of BERT using a novel knowledge distillation method called “Transformer distillation” that was specially designed for Transformer models such as BERT. TinyBERT was proposed in 2019 by Huawei Noah’s Ark Lab and published in this paper under the same name “TinyBERT: Distilling Bert For Natural Language Understanding”. The official code for this paper can be found in the following GitHub repository: TinyBERT. …

Published on arXiv on : 23 Sep 2019
StructBERT

StructBERT stands for “Structural BERT” which is an extension of BERT created by incorporating language structures into pre-training. StructBERT was proposed in 2019 by Alibaba Group and published in their “StructBERT: Incorporating Language Structures Into Pre-Training For Deep Language Understanding” paper. The official code for this paper can be found in the following GitHub repository: alibaba/StructBERT. …

Published on arXiv on : 13 Aug 2019
Big Models pollute Earth

Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result, these models are costly to train both financially and environmentally. …

Published on arXiv on : 2 Aug 2019
RoBERTa

RoBERTa, stands for “Robustly optimized BERT approach”, is an approach to train BERT created by Facebook AI in 2019 and published in this paper: “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. The official code for this paper can be found on Facebook’s FairSeq official GitHub repository: fairseq/roberta. …

Published on arXiv on : 26 Jul 2019
SpanBERT

SpanBERT is a model created by Facebook AI and Allen Institute in January 2019 and published in this paper “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. SpanBERT is just an extension to BERT where it better represents and predict continuous random spans of text, rather than random tokens. This is crucial since many NLP tasks involve spans of text rather than single tokens. SpanBERT is different from BERT in both the masking scheme and the training objectives: …

Published on arXiv on : 24 Jul 2019
XLNet

XLNet stands for “Extra-Long Net” which is a model that integrates both GPT and BERT introduced in 2019 by Google Brain and published in this paper: “XLNet: Generalized Autoregressive Pretraining for Language Understanding” by the same authors of Transformer-XL. The official code for this paper can be found in the following GitHub repository: xlnet. …

Published on arXiv on : 19 Jun 2019
MASS

MASS, stands for “Masked Sequence to Sequence”, is a pre-training scheme proposed by Microsoft in 2019 and published in this paper: “MASS: Masked Sequence to Sequence Pre-training for Language Generation” and the code is publicly available on Microsoft’s official account on GitHub. Inspired by BERT, MASS encoder takes a sentence with a masked fragment as input, and its decoder predicts this masked fragment. …

Published on arXiv on : 7 May 2019
GPT-2

GPT-2 stands for “Generative Pre-trained Transformer” which is a language model published in this paper: “Language Models are Unsupervised Multitask Learners” by OpenAI in 2019. In the paper, they tried to demonstrate that language models can perform down-stream tasks such as (question answering, machine translation, reading comprehension, and summarization) in a zero-shot setting – without any parameter or architecture modification. …

Published on arXiv on : 14 Feb 2019
Adapter Layers

At the current moment, the norm in NLP involves downloading and fine-tuning pre-trained models consisting of hundreds of millions, or even billions of parameters. Modifying these models, no matter how simple the modification is, requires re-training the whole model. And re-training these huge models is expensive, slow, and time-consuming, which impedes the progress in NLP. Adapters are one way to fix this problem. …

Published on arXiv on : 2 Feb 2019
Transformer-XL

Transformer-XL, stands for “Transformer Extra Long”, is a language model published in this paper: “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context” by Google Brain in 2019.The official code for this paper can be found in the following GitHub repository: transformer-xl . …

Published on arXiv on : 9 Jan 2019

2018

BERT

BERT stands for “Bidirectional Encoder Representations from Transformers” which is a model published by researchers at Google in this paper: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” in 2018. It has caused a stir in the NLP community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others. …

Published on arXiv on : 11 Oct 2018
GPT

Transform is a state-of-the-art architecture for machine translation. OpenAI tried to use this architecture for the language modeling task in this paper “Improving Language Understanding by Generative Pre-Training” under the name “Improving Language Understanding by Generative Pre-Training” which was published in 2018. Pre-training is the process of training a model with one task (language modeling in the paper) that is able to help it form parameters that can be used to make other tasks easier (four other tasks: natural language inference, question answering, semantic similarity, and text classification). …

Published on arXiv on : 11 Jun 2018

2016

GCNN: Gated CNN

One of the major defects of Seq2Seq models is that it can’t process words in parallel. For a large corpus of text, this increases the time spent translating the text. CNNs can help us solve this problem. In this paper: “Language Modeling with Gated Convolutional Networks”, proposed by FAIR (Facebook AI Research) in 2017, the researchers developed a new architecture that uses gating mechanism over stacked convolution layers that outperforms the Seq2Seq model. …

Published on arXiv on : 23 Dec 2016

2011

Tree Recursive Neural Network

Tree Recursive Neural Network is a model created by Richard Socher et al. and published in this paper: Parsing Natural Scenes and Natural Language with Recursive Neural Networks. The main idea behind Tree Recursive Neural Network is to provide a sentence embedding that could represent the meaning of the sentence the same way we did with word embedding. So, two sentences that of different words like “the country of my birth” and “the place where I was born” will have similar vector despite having totally different words. The meaning vector of a sentence is determined by actually two things: …

Published on arXiv on : 28 Jul 2011

2003

Neural N-gram Language Model

As we discussed before, the n-gram language model has a few problems like the data sparsity and the big storage need. That’s why these problems were first tackled by Bengio et al in 2003 and published under the name “A Neural Probabilistic Language Model”, which introduced the first large-scale deep learning for natural language processing model. This model learns a distributed representation of words, along with the probability function for word sequences expressed in terms of these representations. The idea behind this architecture is to deal with the language model task as if it is a classification problems where: …

Published on arXiv on : 9 Feb 2003

1985

RNN: Recurrent Neural Networks

The neural n-gram language model we've seen earlier was trained using the a window-sized subset of the previous tokens. And this falls short with long sentences as where the contextual dependencies are longer than the window size. Now, we need a model that is able to capture dependencies outside the window. In other words, we need a system that has some kind of memory to save these long dependencies. …

Published on arXiv on : 19 Sep 1985