Oct 13, 2023

Are LLMs the Future of Image Generation?

A look at why visual tokenizers like MAGVIT-v2 make language-model-based image and video generation more competitive with diffusion models.

In image generation, diffusion models like Dalle or Midjourney are still considered SOTA. But, what if we used Large Language Models for image generation instead?

Over the past two years, various approaches to using LLMs for image generation have emerged. Yet, none have managed to surpass diffusion models. That was until a recent study titled Language Model Beats Diffusion - Tokenizer is Key to Visual Generation was published. This paper introduced an innovative visual tokenizer that has exhibited great performance, allowing it to catch up with the state-of-the-art on certain benchmarks.

Tokenizer is Key to Visual Generation

The authors explain that the main issue in image generation with LLMs was the lack of a strong visual representation that could effectively imitate our natural language system and faithfully model the visual world. More specifically, they highlight struggles that previous approaches meet when training models over a large vocabulary. They addressed this by proposing a new visual tokenizer called MAGVIT-v2 that uses a lookup-free quantization method. The task of tokenizer is to map videos and images into compact tokens (numerical representation). This tokenizer joints previous works on MAGVIT with the VQ-VAE framework, introducing a slight alteration in how the representation is stored. The authors show that it is beneficial to decrease the size of the embedding dimension while increasing the vocabulary size.

Technical details

MAGVIT-v2 is a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Authors show that LLM equipped with this new tokenizer outperforms diffusion models on standard image and video generation benchmarks. They explored two designs that combine different techniques, including using a causally temporal 3D convolution and incorporating temporal convolutional subsampling. In the end, the empirical comparisons showed that the causal 3D CNN performs the best among the different designs. Additionally, they make architectural modifications to improve the MAGVIT model, such as changing the encoder downsamplers and adding adaptive group normalization layers.

Promising, Yet Challenges Ahead

While this discovery is promising, it's essential to remember that surpassing benchmarks doesn't always mean a better-performing model in all scenarios. Nevertheless, this week has seen other intriguing developments in the LLM and diffusion models:

Ferret: Refer and Ground Anything Anywhere at Any Granularity introduces a multimodal large language model (MLLM) aimed at spatial language understanding within images.
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Model explores the use of pre-trained vision-and-language models (VLMs) and LLMs for visual commonsense reasoning.

In conclusion, the discussion around the potential of LLMs in image generation is gaining momentum. The role of innovative technologies like MAGVIT-v2 and the exploration of multimodality and commonsense reasoning underline a very interesting path forward in the field of LLMs.

← AI explained