Diffusion models understood as water in search for the sea

Aná - Lógos

The title above is a transcription of the Greek root word for analogy: ἀναλογία. It meant back in those times, a correspondence or a proportion: "ana" = according to, "logos" = proportion, reason. I am an avid searcher for analogies, and I think the process of searching for them is also valuable. I want to share a few analogies here that have helped me understand AI Diffusion models when I first encountered them.

A few analogies have already been imagined to help us think about how diffusion models work when creating images. One of those analogies was the initial inspiration to come up with the idea for diffusion models: the physical process involved in heat dissipation known as Brownian motion. I don't think I would have grasped these models, or be interested in them in the first place, if it wasn't for some of those analogies that make the maths become an image in my mind. And I do enjoy getting into the fine details of the mathematical formulas and the code, but I need the visual analogies in order to be able to follow those details and to remember them when I need to.

You will find below a summary of other analogies and metaphors that I have thought with, alongside my own contribution with an analogy based on water. When I studied electronics in my late teens, water was always used to explain all sorts of electric processes, like battery charging, capacitance, resistance, voltage. Like all analogies, it breaks at some point (is it even an analogy if it never breaks?), but it can take you along for a good part of the journey. I tried to do the same for diffusion models and I believe it works well.

A numbers' game

In order to follow any analogy, we will need at least an overall knowledge of the process we are trying to ana-log: to find a correspondence we need to know what we are trying to correspond to. If you already know how diffusion models work at a high-level, you can skip it the next three paragraphs.

As with any machine-learning process, there are two main parts: training and inference. During training the models learn to remove noise from images, comparing their best guess of how an image would look like with less noise with an image that has already been de-noised. With many millions of passes, the model learns how to remove the noise in steps so it eventually can get from a pure noise image (or a pure black image if that is easier to think with) to any type of image of the kinds it has been trained with, say, a cat on a matt. This "learning" basically means that all its moving parts (a collection of billions of numbers that serve as knobs to adjust it) are set in the best positions possible to de-noise images effectively. So eventually, if we ask for an image of a cat, it knows how change pixels from a noisy/blank image to slowly turn them in the direction of a cat image. Too abstract? That's expected, bear with me...

Let's talk about the second part in more detail, the inference: again, say you ask it for "a cat on a matt", that sentence is translated into a bunch of numbers that convey the concept of your query (a mathematical vector). They point to the area of numbers in the training data that can somewhat be translated to "cats on matts", or cats on sofas, or puppies on matts, ... it is a pointer in the right direction, but it doesn't specify any details of the image.

Now the model starts the creation of the image of the cat on a matt from a pure noise image (or again, a pure black square, if that's easier), and removes noise from it. What does removing noise mean?, you could ask, as I did: it means that pixels that initially have random values need to be changed to have meaningful values that start creating a visible image of a cat on a matt. While moving the whole image in the direction of your prompt, the values for each pixel (red, green and blue values) are adjusted to look like lines, and surfaces, and shapes. At the beginning those shapes will look nothing like a cat or a matt, but little by little the pixels will converge into more defined shapes and eventually a cat and a matt. This is the part that, for me, it was harder to grasp, how do pixels know how to converge into anything?

Gravity and water

Now if you are anything like me, you might understand all that and still find it impossible to visualise how it works. What does moving a picture in the direction of a concept mean? How are pixels manipulated when there is no picture at all to recreate? Let's bring water into it.

My water analogy is to be found on the hills right after a morning of heavy rain. The water is running downhill, as we would expect. It wants to find the streams, the river banks, canyons and gorges, estuaries, and eventually the sea. But like pixels, this water knows nothing about streams or gravity. That rain water is the initial noisy image, the pixels with random values, which are pulled by gravity, that is, your prompt, into shapes (streams).

Remember, your prompt for the model is another bunch of numbers (a vector), and those numbers point to what you want to get (cat on a matt). That is the gravity force pulling from the random image as it pulls from water drops on the ground towards the nearest stream. Gravity is the concept or desired image identified in your prompt. The water on the ground is the initial noisy, random image, and it wants to go downhill, pulled by gravity. Gravity takes the water slowly into tiny streams, larger ones, rivers later, with gorges and canyons and estuaries perhaps.

In the same way that gravity pulls water into those shapes (streams, banks, canals,... ) all running downhill, the model also pulls from the random numbers of the initial noise/blank image towards the vector that represents your desired output, the cat on the matt, and the pixels come together into lines and shapes like water drops come together into streams. Step by step, the random numbers take the shapes forced by the path they have to take towards that gravity pull of the concept of a cat on a matt.

Remember above when we said that the model learns to de-noise images in training? What it learns is how to guide those images to where they need to go, it learns how to be like gravity for water on the hills. That gravity force will be different depending on the training data and the algorithms used, but they all behave in similar ways. Once trained, the model can now lead the numbers of the initially random noisy image down to the path towards the images (numbers, pixel values) that look like cats on matts.

Like water comes together into a stream, pixels come together into a line, or the edge of a shape, or a gradient. The model doesn't know what the image is going to be like at any step, it is the pulling and pushing of the forces of those numbers that shape the pixels, like gravity shapes the random water drops into streams and rivers. The gravity force is the overall concept of your prompt, it gives us the direction of travel; water drops are individual pixels, that just do what they need to do to travel in that direction, getting around pebbles and roots and puddles, to finally come together into a stream: a shape, a line, a gradient.

To think is to think-with

This water analogy did it for me, I could finally put all the pieces in place and remember them when needed, to make sense of some parameter, or to think of how to improve an image that wasn't behaving as wanted. But we all think differently and you might be better served by different analogies, or a combination of them. Thinking, I believe, has to be done with tools, sometimes physical ones, like pen and paper, sometimes procedural tools, like a long walk or meditation, and sometimes conceptual thinking tools, like analogies. Try a few if it hasn't clicked yet for you. Here are some other analogies for diffusion models that I've come across: