I am not sure it will be possible to get enough training data that way. I don't ...

I am not sure it will be possible to get enough training data that way.

I don't know enough about diffusion models but if LLMs (of current size) have to use only public domain, they will be undertrained and we will see significant degradation in performance. Not to mention that Codex will be effectively dead.