Nov 8, 2023
The Technology Behind Illusion-Like Generated Images
A quick explanation of ControlNet and how diffusion models can generate QR-code and illusion-like images.
Recently, there has been a surge of interest generation of illusion-like images. Despite reaching its zenith in popularity over the last few weeks, it all traces back to the publication of a paper titled "Adding Conditional Control to Text-to-Image Diffusion Models" (https://arxiv.org/abs/2302.05543), which made its debut earlier this year in February. Subsequently, around April, a new model was uploaded to HuggingFace Model, accompanied by a popular Reddit Post. Both model and Reddit post presented the use of diffusion models for crafting creative QR codes. More recently, the internet buzzed with high-quality and attention-grabbing illusion-like images, ranging from intricate swirls to chessboards, all meticulously generated using these models.
How does it work?
The mentioned paper introduced a "supporting" model named ControlNet. As the name suggests, the role of a ControlNet is to guide and control the image generation process, e.g. to direct the diffusion model into generating photographs based on provided condition image. In terms of architecture, ControlNet is a trainable copy of a diffusion model that is connected with the original diffusion model through "zero convolution" layers. So, in essence, ControlNet is just an additional part of the diffusion model.
In my opinion, the huge advantage of a ControlNet is that it is a separate module that can be easily connected or detached from the diffusion model, and we can control (with the parameters) how much the ControlNet can influence the output. Such architecture resulted in a potent and controllable generation model.
How it was trained?
First, let's look into how the training dataset was built.
- Dataset: We need a pair of images: a condition image that will work as an input and and the other intended to be generated based on that condition. Original authors showed that we can automatically generate datasets by reverse engineering e.g. take a photograph and extract edges from it (e.g. with Canny method). Or simply take segmentation masks from a selected open dataset.
- Training: During the training we freeze the original diffusion model weights and train only the ControlNet. This way we get a model that can control the original diffusion model to generate realistic-looking photographs that will follow the shape of drawn edges.
Possibilities and Limitations
In conclusion, the ability to control generated outputs through prompts and also with input images opens up exciting possibilities not only in terms of art or marketing but also in training data generation or modification. This technology allows a user to generate artistic and still scannable QR codes, illusion-like images, generating photographs from scribbles. However, to achieve great results we still have to experiment a lot with prompt engineering and parameter tuning. I've tried a couple of times to generate our own QR Code for the KeyGen but I ended up either with uninteresting or not-readable QR codes. I guess the technology is promising but still requires additional work to make it easier to use.
If you are interested in trying it yourself, I've prepared a Colab notebook for you [https://colab.research.google.com/drive/1MDNHwsWG_G357wRIXilbZqmfBpSEUkOc?usp=sharing]. Maybe you will have more luck and your illusion-like image will become viral?
Original post: LinkedIn