The way someone explained it to me is that text-to-image models are essentially just de-noisers.
They train them by taking an image with a label, ie, "cat", and then adding some noise to it, run a training step, add more noise, run another step, and so on until the image is total (or near total) noise and still being told it's a cat.
Then, when you want to generate "cat", you start with noise, and it finds a cat in the noise and cancels some of the noise repeatedly. If you're able to watch an image get generated, sometimes you'll even see two cats on top of each other, but one ends up fading away.
Turns out, these denoisers don't require that many parameters, and if your resulting image has a few pixels that are just a tiny bit off color, you won't even notice.
They train them by taking an image with a label, ie, "cat", and then adding some noise to it, run a training step, add more noise, run another step, and so on until the image is total (or near total) noise and still being told it's a cat.
Then, when you want to generate "cat", you start with noise, and it finds a cat in the noise and cancels some of the noise repeatedly. If you're able to watch an image get generated, sometimes you'll even see two cats on top of each other, but one ends up fading away.
Turns out, these denoisers don't require that many parameters, and if your resulting image has a few pixels that are just a tiny bit off color, you won't even notice.