I believe that the corpus of video data to train on with video far exceeds that of 3D data. It's also much cheaper to produce video data. So I'd expect that this is probably the quickest way forward from a current world state perspective.
Additionally, video seems like a pretty forward output shape to me - 2D image with a time component. If we were talking 3D assets and animations I wouldn't even know where to start with modeling that as input data for training. That seems really hard to model as a fixed input size problem to me.
If there was comparable 3D data available for training, I'd guess that we'd see different issues with different approaches.
A couple of examples that I could think of quickly: Using these to build games, might be easier if we could interact with the underlying "assets". Getting photorealistic results with intricate detail (e.g. hair, vegetation) might be easier with video based solutions.
If the fidelity of the video is high enough, you could use SFM to build point clouds from the generated video frames and essentially do photogrammatry on the assets from a genie video.
well actually image output is fixed and there s lots of training data. Neural networks can learn anything in their latent space so there is no need to impose 3D rendering constraints, and it s not evident that it's less efficient (for the model).
3D model rendering would be useful however for interfacing with robots.
You often view 3D games on a 2D screen. That doesn’t mean that a game is natively 2D and the 3D world is an inconvenient step that can be bypassed. Actually the opposite, the 2D representation on screen is just a projection.
In VR, for example, the same 3D scene will be rendered twice, once for each eye, from two viewpoints 10-15cm apart.
If you don’t have an internal 3D representation of the world, the AI would need to generate exactly the same scene from a very slightly different perspective for each eye, without any discrepancies or artefacts.
And that’s not even discussing physics, collisions or any form of consistent world logic that happens off-screen. Or multiplayer!