GPT-3 is an enormous model built on the transformer-decoder architecture published in 2020 by OpenAI in this paper: “Language Models are Few-Shot Learners” whose title is very indicative of what the paper wanted to show. The paper didn’t provide any new architecture, they used the same architecture as GPT-2. They just made it way bigger and trained over more data.

The whole purpose of this paper is to show that GPT-3 can be used with a variety of tasks using either zero-shot, or one-shot or a few-shots learning schemes and even reaching competitiveness with prior state-of-the-art fine-tuned models. Before getting into more details about the model, let’s first discuss what do I mean by these learning schemes and how they are different from fine-tuning:

  • Few-shot (FS):
    It’s the setting where the model is given K (usually from 10 to 100) examples of the task at inference time as conditioning, but no weight updates are allowed. As we can see in the following figure, GPT-3 was given three different examples along with the task description:
  • One-shot (1S):
    It’s the same as few-shot except that only one demonstration is allowed, in addition to the task description. The reason to distinguish one-shot from few-shot is that it most closely matches the way in which some tasks are communicated to humans:
  • Zero-shot (0S):
    It’s the same as one-shot except that no demonstrations are allowed, just the task description. This method provides maximum potential for robustness but is also the most challenging setting even for humans.
  • Fine-Tuning (FT):
    It has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. This setting lacks from poor generalization out-of-distribution:


As said earlier, they used the same model and architecture as GPT-2. To study the dependence of performance on model size, they trained 8 different sizes of model as shown in the following table


  • $n_{\text{params}}$: is the total number of trainable parameters.

  • $n_{\text{layers}}$: is the total number of layers.

  • $d_{\text{model}}$: is the number of units in each bottleneck layer (we always have the feed-forward layer four times the size of the bottleneck layer, $d_{\text{feedforward}} = 4 \times d_{\text{model}}$).

  • $n_{\text{heads}}$: is the number of attention heads/layers, since each layer has just one attention head.

  • $d_{head}$: is the dimension of each attention head.

As you can see, GPT3 is massive as its context-widow $n_{\text{ctx}} = 2048$ tokens wide with about 175 billion learnable parameters spread over 96 transformer-decoder layers.

The data used for this models are according to the following table

And the following is a comparison between the training time taken to train BERT, RoBERTa, T5 and GPT-3. As we can see from the graph, it took almost 5000 days to train GPT-3.


The following is a comparison among the different learning schemes used with GPT-3 and the state or the art (fine-tuned) model on various tasks:

  • Language Modeling:

    • Dataset: Penn Tree Bank

    • Evaluation Metric: perplexity

  • Long-Range Language Modeling:

    • Dataset: LAMBADA

    • Evaluation Metric: perplexity / Accuracy

  • Story Completion:

    • Dataset: StoryCloze & HellaSwag

    • Evaluation Metric: Accuracy

  • Question Answering:

    • Dataset: NaturalQS, WebQS & TriviaQA

    • Evaluation Metric: Accuracy

  • Machine Translation:

    • Dataset: WMT’14 (Fr↔En), WMT’16 (De↔En) & WMT’16 (Ro↔En).

    • Evaluation Metric: BLEU

  • Winograd-Style Tasks: determining to which word a pronoun refers

    • Dataset: Winograd & WinogradXL

    • Evaluation Metric: Accuracy

  • Common Sense Reasoning:

    • Dataset: PIQA, ARC, OpenBookQA

    • Evaluation Metric: Accuracy

  • Reading Comprehension:

    • Dataset: CoQA, DROP, QuAC, SQuADv2, RACE-h, RACE-m.

    • Evaluation Metric: Accuracy for RACE-h & RACE-m, and F1 for the rest.