GPT-3
GPT-3 is an enormous model built on the transformer-decoder architecture published in 2020 by OpenAI in this paper: “Language Models are Few-Shot Learners” whose title is very indicative of what the paper wanted to show. The paper didn’t provide any new architecture, they used the same architecture as GPT-2. They just made it way bigger and trained over more data.
The whole purpose of this paper is to show that GPT-3 can be used with a variety of tasks using either zero-shot, or one-shot or a few-shots learning schemes and even reaching competitiveness with prior state-of-the-art fine-tuned models. Before getting into more details about the model, let’s first discuss what do I mean by these learning schemes and how they are different from fine-tuning:
- Few-shot (FS):
It’s the setting where the model is given K (usually from 10 to 100) examples of the task at inference time as conditioning, but no weight updates are allowed. As we can see in the following figure, GPT-3 was given three different examples along with the task description:
data:image/s3,"s3://crabby-images/e643b/e643b5a5f03a9061f1dfad91d6b7ab4ea67c0a74" alt=""
- One-shot (1S):
It’s the same as few-shot except that only one demonstration is allowed, in addition to the task description. The reason to distinguish one-shot from few-shot is that it most closely matches the way in which some tasks are communicated to humans:
data:image/s3,"s3://crabby-images/52856/528566eae7b849d5cfa2ae096ab2ec754e8fd4b1" alt=""
- Zero-shot (0S):
It’s the same as one-shot except that no demonstrations are allowed, just the task description. This method provides maximum potential for robustness but is also the most challenging setting even for humans.
data:image/s3,"s3://crabby-images/0080e/0080ee2afc5ad75a5095b2eea7c10be0afc9d856" alt=""
- Fine-Tuning (FT):
It has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. This setting lacks from poor generalization out-of-distribution:
data:image/s3,"s3://crabby-images/05b97/05b978b0e13e6acfc4e2636cb2db61c69e2e8161" alt=""
Model
As said earlier, they used the same model and architecture as GPT-2. To study the dependence of performance on model size, they trained 8 different sizes of model as shown in the following table
data:image/s3,"s3://crabby-images/0a558/0a5584f53ca85b724e2225ee53704df7d6b76182" alt=""
Where:
-
$n_{\text{params}}$: is the total number of trainable parameters.
-
$n_{\text{layers}}$: is the total number of layers.
-
$d_{\text{model}}$: is the number of units in each bottleneck layer (we always have the feed-forward layer four times the size of the bottleneck layer, $d_{\text{feedforward}} = 4 \times d_{\text{model}}$).
-
$n_{\text{heads}}$: is the number of attention heads/layers, since each layer has just one attention head.
-
$d_{head}$: is the dimension of each attention head.
As you can see, GPT3 is massive as its context-widow $n_{\text{ctx}} = 2048$ tokens wide with about 175 billion learnable parameters spread over 96 transformer-decoder layers.
data:image/s3,"s3://crabby-images/9b8be/9b8bec11ab4030d333a084eb72e1a27b7d7b4943" alt=""
The data used for this models are according to the following table
data:image/s3,"s3://crabby-images/f9878/f98780f91d30c818eda4af833ab5c767ec14cdd8" alt=""
And the following is a comparison between the training time taken to train BERT, RoBERTa, T5 and GPT-3. As we can see from the graph, it took almost 5000 days to train GPT-3.
data:image/s3,"s3://crabby-images/7d21c/7d21ca77c3dd8fde0f68c4ed1219e09c74da2446" alt=""
Results
The following is a comparison among the different learning schemes used with GPT-3 and the state or the art (fine-tuned) model on various tasks:
-
Language Modeling:
-
Dataset: Penn Tree Bank
-
Evaluation Metric: perplexity
-
data:image/s3,"s3://crabby-images/7ea98/7ea9887e8b8873cfdf5ea6df5619f310a38997ab" alt=""
-
Long-Range Language Modeling:
-
Dataset: LAMBADA
-
Evaluation Metric: perplexity / Accuracy
-
data:image/s3,"s3://crabby-images/af28c/af28cc58be3ffefdf1f36e7f56df68617e62daa5" alt=""
-
Story Completion:
-
Dataset: StoryCloze & HellaSwag
-
Evaluation Metric: Accuracy
-
data:image/s3,"s3://crabby-images/e614e/e614e84be8ff9f94550204f4d1dd06ae499664a7" alt=""
-
Question Answering:
-
Dataset: NaturalQS, WebQS & TriviaQA
-
Evaluation Metric: Accuracy
-
data:image/s3,"s3://crabby-images/be832/be832fa4e70816723c306053cfdcf85694e336f7" alt=""
-
Machine Translation:
-
Dataset: WMT’14 (Fr↔En), WMT’16 (De↔En) & WMT’16 (Ro↔En).
-
Evaluation Metric: BLEU
-
data:image/s3,"s3://crabby-images/86469/8646974be30021507a05aedbb924b889aacdb3c6" alt=""
-
Winograd-Style Tasks: determining to which word a pronoun refers
-
Dataset: Winograd & WinogradXL
-
Evaluation Metric: Accuracy
-
data:image/s3,"s3://crabby-images/5b752/5b752c50cb8d39614a652e2e9a6bc215b0131125" alt=""
-
Common Sense Reasoning:
-
Dataset: PIQA, ARC, OpenBookQA
-
Evaluation Metric: Accuracy
-
data:image/s3,"s3://crabby-images/700a4/700a4127e2bed66d0a92db0f7ce28f922ee75a50" alt=""
-
Reading Comprehension:
-
Dataset: CoQA, DROP, QuAC, SQuADv2, RACE-h, RACE-m.
-
Evaluation Metric: Accuracy for RACE-h & RACE-m, and F1 for the rest.
-
data:image/s3,"s3://crabby-images/d2d8b/d2d8b08c86976c6cdf04d6715eaa78782c38998a" alt=""