The issue with image models is that their style becomes identifiable and stale quite quickly, so you’ll need a fresh intake of different, newer, styles every so often and that’s going to be harder and harder to get.
The style becoming identifiable and stale has mostly to do with CFG and almost nothing with the dataset, the heavy use of CFG by most models trades diversity with coherency. You don't need a costant intake of new images and styles, it's like saying that an image created two years ago is stale because it doesn't follow a new style or something.
There is the problem of literal style though. The aesthetics of say clothes do evolve overtime, not year to year big changes, but every 3-5? Sure. Just laughing at the thought of the model where any image generated is say stuck in 1990s grunge attire.
Jonathan Ho, one of the authors of the CFG paper, now works for Ideogram, and Ideogram 2 is one of the very few models (or perhaps the only one) where I don't see the artifacts caused by the CFG, maybe he has achieved a breakthrough.
> Built on one of Mistral’s text models, Nemo 12B, the new model can answer questions about an arbitrary number of images of an arbitrary size given either URLs or images encoded using base64, the binary-to-text encoding scheme. Similar to other multimodal models such as Anthropic’s Claude family and OpenAI’s GPT-4o, Pixtral 12B should — at least in theory — be able to perform tasks like captioning images and counting the number of objects in a photo.
This is a not a diffusion model -- it doesn't create images, it answers questions.
The issue is getting the data on newer aesthetic styles.
The more and more platforms lock down access to their data, the harder it’ll be for models to stay up to date on art trends.
We just haven’t had image gen around long enough to witness a major style change like the skeuomorphic iPhone icons of old to the new modern flat ones.
If an artist born today develops their own style that takes the world by storm in 20years, the image generators of the time (for this thought experiment, imagine we’re using the same image gen techniques as today) would not know about it. They wouldn’t be able to replicate it until they get enough training data on that style.