GPT-3

GPT-3 is an enormous model built on the transformer-decoder architecture published in 2020 by OpenAI in this paper: “Language Models are Few-Shot Learners” whose title is very indicative of what the paper wanted to show. The paper didn’t provide any new architecture, they used the same architecture as GPT-2. They just made it way bigger and trained over more data.

The whole purpose of this paper is to show that GPT-3 can be used with a variety of tasks using either zero-shot, or one-shot or a few-shots learning schemes and even reaching competitiveness with prior state-of-the-art fine-tuned models. Before getting into more details about the model, let’s first discuss what do I mean by these learning schemes and how they are different from fine-tuning:

Few-shot (FS):
It’s the setting where the model is given K (usually from 10 to 100) examples of the task at inference time as conditioning, but no weight updates are allowed. As we can see in the following figure, GPT-3 was given three different examples along with the task description:

One-shot (1S):
It’s the same as few-shot except that only one demonstration is allowed, in addition to the task description. The reason to distinguish one-shot from few-shot is that it most closely matches the way in which some tasks are communicated to humans:

Zero-shot (0S):
It’s the same as one-shot except that no demonstrations are allowed, just the task description. This method provides maximum potential for robustness but is also the most challenging setting even for humans.

Fine-Tuning (FT):
It has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. This setting lacks from poor generalization out-of-distribution:

Model

As said earlier, they used the same model and architecture as GPT-2. To study the dependence of performance on model size, they trained 8 different sizes of model as shown in the following table

Where:

$n_{\text{params}}$: is the total number of trainable parameters.
$n_{\text{layers}}$: is the total number of layers.
$d_{\text{model}}$: is the number of units in each bottleneck layer (we always have the feed-forward layer four times the size of the bottleneck layer, $d_{\text{feedforward}} = 4 \times d_{\text{model}}$).
$n_{\text{heads}}$: is the number of attention heads/layers, since each layer has just one attention head.
$d_{head}$: is the dimension of each attention head.

As you can see, GPT3 is massive as its context-widow $n_{\text{ctx}} = 2048$ tokens wide with about 175 billion learnable parameters spread over 96 transformer-decoder layers.

The data used for this models are according to the following table

And the following is a comparison between the training time taken to train BERT, RoBERTa, T5 and GPT-3. As we can see from the graph, it took almost 5000 days to train GPT-3.

Results

The following is a comparison among the different learning schemes used with GPT-3 and the state or the art (fine-tuned) model on various tasks:

Language Modeling:
- Dataset: Penn Tree Bank
- Evaluation Metric: perplexity

Long-Range Language Modeling:
- Dataset: LAMBADA
- Evaluation Metric: perplexity / Accuracy

Story Completion:
- Dataset: StoryCloze & HellaSwag
- Evaluation Metric: Accuracy

Question Answering:
- Dataset: NaturalQS, WebQS & TriviaQA
- Evaluation Metric: Accuracy

Machine Translation:
- Dataset: WMT’14 (Fr↔En), WMT’16 (De↔En) & WMT’16 (Ro↔En).
- Evaluation Metric: BLEU

Winograd-Style Tasks: determining to which word a pronoun refers
- Dataset: Winograd & WinogradXL
- Evaluation Metric: Accuracy

Common Sense Reasoning:
- Dataset: PIQA, ARC, OpenBookQA
- Evaluation Metric: Accuracy

Reading Comprehension:
- Dataset: CoQA, DROP, QuAC, SQuADv2, RACE-h, RACE-m.
- Evaluation Metric: Accuracy for RACE-h & RACE-m, and F1 for the rest.