March 14 was a huge day for the AI community. OpenAI released GPT-4, a multi-modal language model (MLLM) that has commonsense reasoning for both text and images while being able to operate with a context length of 32,000 tokens. Incredibly, GPT-4 was released less than one hour after Anthropic announced their own model, Claude. Claude is a text-only model with a context window of ~9,000 tokens.
Since GPT-4 can perceive images as well as text, it demonstrates impressive behavior such as visual question answering and image captioning. Having a longer context length (up from GPT-3’s 4,096) is of major practical significance; a single prompt can cover hundreds of pages. This could be enough to contain a legal contract, a short story, or a company’s internal documents.
GPT-4 isn’t the first MLLM that has vision capabilities. In early March 2023, Microsoft released KOSMOS-1 which is trained on interleaved text and images. DeepMind trained a similar model, Flamingo, in April 2022. These models can engage in dialogue on images, image captioning, and visual question answering in a zero-shot manner, meaning they can solve problems they were not explicitly trained to solve.
This release follows several models from OpenAI that have been of interest to the ML community recently, including DALLE-2, Whisper, and ChatGPT. For those interested, we previously posted a deep-dive into Whisper and how it works.
OpenAI has not revealed information about the size, architecture, or training specifics of GPT-4*. Therefore, the purpose of this post is to present what is known and explain how models with similar capabilities (KOSMOS-1 and Flamingo) work.
Language Modelling with Transformers: A Recap
A generative language model (LM) is trained to predict the next token** in a sequence, based on some tokens that have already been provided. Through this task, the model learns to use past context to build representations of language. The training process incentivizes the model to maximize the predicted probability of the “true” token, and to minimize the probabilities of the other tokens, using cross-entropy loss. Intuitively, a good LM would generate a high probability for the token “mat” when given the context “The cat sat on the”.
GPT style models are decoder-only transformers which take in a sequence of tokens (in the form of token embeddings) and generate a sequence of output tokens, one at a time. Concretely, token embeddings are converted to a sequence of features that represent the input sequence. Each layer of the model refines this representation of the input using the features learnt from the previous layer. Finally, the features of the final layer are used to generate a sequence of output tokens.
The key innovation of the transformer architecture is the use of the self-attention mechanism. Self-attention allows the model to process all tokens in the input sequence in parallel, rather than sequentially and ‘attend to’ (or share information between) different positions in the sequence.
If you’d like to understand how this works in more detail, we recommend this amazing illustrated example.
Multi-Modal Language Modelling
GPT-4 is trained on both text and images. Its dataset is likely similar to that of KOSMOS-1, which is summarized in Table 1. GPT-3 was trained on text corpora totaling roughly 300 billion tokens. The implications of DeepMind’s Chinchilla LM showed that increasing the amount of data to 1.4 trillion tokens, as well as increasing parameter count, is necessary for improving performance. We speculate that OpenAI scaled up the dataset for GPT-4 to a similar size as used by Chinchilla, or more.
|The Pile, Common Crawl
|Image caption pairs
|LAION-2B, LAION-400M, COYO-700M
|<s><image>Image Embedding </image> </s>
|Interleaved text and images
|Collected from Common Crawl HTML
|<s>Text<image>Image Embedding </image> </s>
Table 1: The modalities, sources, and input structure of KOSMOS-1's training data.
For training, each modality must be converted to a representation in the same embedding space. In other words, we need a sequence of same-length vectors that are generated from text and images.
For text, this is straightforward since the tokens are already discretized. In KOSMOS-1, each token is assigned an embedding learned during training, the consequence being that words of similar semantic meaning become closer in the embedding space.
KOSMOS-1 deals with images using the MetaLM approach. This provides a general-purpose interface supporting natural language interactions with other non-causal models. A pre-trained image encoder generates embeddings that are passed through a connector layer, which projects to the same dimension as the text embeddings. KOSMOS-1 can then handle image embeddings while predicting text tokens, as shown in Figure 1.
To do this, the model must learn the relationship between text and images. Each image consists of multiple embeddings (positional locations 7-12 in Figure 1) which are passed through the transformer. During training, only the embedding predicted after seeing all the image embeddings (e.g. x9 in Figure 1) is used to calculate the loss. When predicting this token, the transformer can still attend to all the image embeddings, thus allowing the model to learn a relationship between text and images.
Figure 1: Diagram of MetaLM: a general-purpose interface used by KOSMOS-1 to enable visual language modelling. KOSMOS-1 only outputs text tokens, so predictions corresponding to tokens at the same position of intermediate pre-trained model embeddings are ignored during training (e.g. positions 7-11). At position 12, the multimodal input finishes and we then use token prediction x9. (image source)
The architecture used for the image encoder is a pre-trained Vision Transformer (ViT) . This is common for image processing tasks. The ViT applies a series of convolutional layers to an image to generate a set of “patches”, as shown in Figure 2. These image patches are flattened and transformed into a sequence of tokens, which are processed by the transformer to produce an output embedding. The ViT is encoder-only.
The approach used to train the ViT is the Contrastive Language-Image Pre-Training (CLIP) task. Roughly speaking, images and text share an embedding space, and the model is trained such that matching image-text pairs have a high cosine similarity.
During KOSMOS-1 training, the ViT parameters are frozen, except for the last layer. The exact model is CLIP ViT-L/14. GPT-4 may use this approach as well. Alternatively, it’s not unreasonable that with enough data, the image encoder can be trained from scratch.
Figure 2: Overview of the Vision Transformer (ViT) architecture used as the image encoder of KOSMOS-1. The image is split into patches and processed by the transformer. (image source)
Flamingo uses a different approach to multimodal language modelling. This could be a more likely architecture for GPT-4 since it was released in April 2022, and OpenAI’s GPT-4 pre-training was completed in August.
Flamingo also relies on a pre-trained image encoder, but instead uses the generated embeddings in cross-attention layers that are interleaved in a pre-trained LM (Figure 3).
Figure 3: Flamingo’s text and image processing. Each image embedding is generated by a pre-trained image encoder. These are passed through a resampler that ensures a fixed-length representation. The embeddings are used in cross attention layers that are inserted into a pre-trained LM. (image source)
To learn more about vision language models, we recommend this HuggingFace blog.
Aligning LMs with User Intent
Scaling LMs does not necessarily make them safer or more useful. This is because the next token prediction is not the same as “produce a helpful and harmless output”.
To align LMs with user intent, GPT-4 has been fine-tuned using Reinforcement Learning from Human Feedback (RLHF). OpenAI first applied this to LMs in InstructGPT. The method in this paper consists of three key steps:
1) Supervised LM training on hand-labeled examples, designed to demonstrate “good” behavior.
2) Gather human-labeled preference data on example outputs from the LM. Use these to train a reward model (RM).
3) Fine-tune the LM to maximize reward.
Steps 2 and 3 are iterated upon until the finetuning yields diminishing results.
There is little doubt that massive real-world usage of ChatGPT has allowed OpenAI to gain vast amounts of preference data. It has also been confirmed that GPT-4 is the model behind Bing’s AI-powered search engine.
Anthropic’s recent paper on Constitutional AI takes this further. The key idea is that human supervision comes from a set of principles – called a constitution – that we want an AI to follow. This is first used during a supervised learning phase, which has the following steps:
1) Provide the model with several prompts that are likely to produce undesirable (e.g. malicious) answers.
2) Ask the model to critique its response according to the constitution and then revise the original response.
3) Fine-tune the LM on the revised responses.
A second RL stage is then used to tune the model further. This mimics steps 2 & 3 of RLHF, except human preferences are replaced by a mixture of human and AI preferences (for more details, see the original paper). The key benefit of Constitutional AI over RLHF is that it substantially reduces the amount of human labeling required. Anthropic have confirmed that Claude was fine-tuned using this approach.
Despite this research, major issues remain. ChatGPT users will be familiar with hallucinations, wildly incorrect answers (delivered with high confidence), and the potential for offensive and/or dangerous responses when given an adversarial prompt. Indeed, examples of Bing Chat’s sometimes extremely aggressive outputs have become commonplace on social media. Nobody currently understands how to concretely align LMs with human values (or what this means in practice). See “What Next” for more discussion on AI safety.
Capabilities of GPT-4
The capabilities of GPT-4 can be broken down into two domains: text-only and text-vision tasks.
Since the performance of GPT-3.5 is so impressive, the improvements obtained by GPT-4 may not be immediately obvious to a user. However, OpenAI’s technical report provides a performance comparison on a variety of academic exams, as shown in Figure 4.
Figure 4: Performance of GPT-3.5 and GPT-4 on a variety of academic exams, taken from the GPT-4 technical report. (image source)
The technical report also provides evidence that GPT-4 “considerably outperforms existing language models” on traditional benchmarks language modeling benchmarks.
The extent of GPT-4's visual reasoning capabilities is less clear. OpenAI has not made image inputs available for public use, and the only production environment in which they’ve been deployed is in a partnership with Be My Eyes. The technical report is vague, describing the model as having “similar capabilities as it does on text-only inputs”, and providing a few examples. Figure 5 shows one taken from the appendix.
Figure 5: An example output of GPT-4's visual reasoning with an image input. (image source)
The model is capable of both image captioning and visual question answering, like KOSMOS-1 as shown in Figure 6.
Figure 6: Demonstrations of KOSMOS-1's capabilities. (image source)
This type of capability is typically measured on a Visual Question-Answering dataset. However, publicly available data on GPT-4 and KOSMOS-1 accuracies only overlap on one dataset, making direct comparison difficult.
Yet, in the developer live stream, Greg Brockman (OpenAI President) provided some more demonstrations of GPT-4's visual reasoning. We found the most impressive example to be that GPT-4 could take a hand-drawn mockup of a website and turn it into functioning HTML, including buttons to show jokes it had created (see this Tweet). The new playground also allows the user to specify the behavior of the system, separately from providing the main prompt. For example, Greg gives the system the following instructions:
This promotes chain-of-thought reasoning, which helps to boost performance for certain tasks.
We’ve learnt about how GPT-4 works. But what capabilities will future models possess?
The next generation of GPT models will likely be trained to understand audio, allowing the model to identify sounds or perform transcription. The MetaLM framework (Figure 1) allows the ability to add audio representations from a pre-trained audio encoder, such as that used by Whisper. With enough compute, it is possible to also train on video.
Multi-modal modelling can be extended further to generate images, audio and video. This requires that each signal is discretized into tokens which can be converted back to a coherent signal. Importantly, the lossy compression must not throw away significant information, otherwise, it would diminish the quality of the reconstructed signal. A VQ-VAE, such as that used by OpenAI’s Jukebox, allows the audio to be converted to tokens which can be modelled. VideoBERT uses hierarchical K-means clustering to generate tokens from visual features. This could allow future GPT models to be able to generate art (like DALL-E 2 ) or music (like AudioLM ) from text or speech prompts. Questions from the user can even be answered with memes or GIFs. Moreover, you’d be able to have a conversation and ask it to respond in Morgan Freeman’s voice.
As the growth of capabilities accelerates, there must be renewed focus on AI safety. Foundation models such as GPT-4 are good at generalizing unseen tasks – something which has traditionally been restricted to humans. If companies naïvely give systems agency without proper consideration, they could start to optimize for a goal we didn’t intend. This could lead to unintended and potentially harmful consequences.
This problem of “intent alignment” is a substantial open problem that we, as a community, need to solve. We need to ensure that capabilities grow in tandem with AI safety. This is likely a big reason why OpenAI has not released a paper with detailed implementation details for GPT-4.
Speechmatics are planning to utilize large LMs to extract useful information from transcription. In our latest release, Ursa, we deliver the world’s most accurate speech-to-text system by scaling our self-supervised model to over 2 billion parameters. Find out more and try it for free with just one click.