Hi HN, I hope you enjoy our research preview of interactive video!
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
All of this is really great work, and I'm excited to see great labs pushing this research forward.
From our perspective, what separates our work is two things:
1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.
2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.
If I had to choose one, I'd easily say maintaining video coherence over long periods of time. The typical failure case of world models that's attempting to generate diverse pixels (i.e. beyond a single video game) is that they degrade to a mush of incoherent pixels after 10-20 seconds of video.
We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!
> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.
In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.
Hi! CEO of Odyssey here. Thanks for giving this a shot.
To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.
The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.
> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.
> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.
> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.
Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models? If a world model was truly managing to model a manifold of a 3D scene, it should be pretty easy to extract a mesh or SDF from it and drop that into an engine where you could then impose more concrete rules or sanity check the output of the model. Then you could actually model player movement inside of the 3D engine instead of trying to train the world model to accept any kind of player input you might want to do now or in the future.
Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel is - is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
> Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models?
I absolutely think there's going to be super cool startups that accelerate film and game dev as it is today, inside existing 3D engines. Those workflows could be made much faster with generative models.
That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
> Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel
To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
> is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
Yes, exactly. The model just learns better this way (instead of breaking it down into discrete components) and I think the end experience will be weirder and more wonderful for it.
> Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
Isn’t the entire aim of world models (at least, in this particular case) to learn a very high quality 3D representation from 2D video data? My point is if that you manage to train a navigable world model for a particular location, that model has managed to fit a very high quality 3D representation of that location. There’s lots of research dealing with NERFs that demonstrate how you can extract these 3D scenes as meshes once a model has managed to fit it. (NERFs are another great example of learning a high quality 3D representation from sparse 2D data.)
>That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
There’s a lot of focus in the material on your site about the models learning physics by training on real world video - wouldn’t that imply that you’re trying to converge on a physically accurate world model? I imagine that would make weirdness and wackiness rather difficult
> To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
I see! Do you have any ideas about the kinds of experiences that you would want to see or experience personally? For me it’s hard to imagine anything that substantially deviates from navigating and interacting with a 3D engine, especially given it seems like you want your world models to converge to be physically realistic. Maybe you could prompt it to warp to another scene?
> wouldn’t that imply that you’re trying to converge on a physically accurate world model?
I'm not the CEO or associated with them at all, but yes, this is what most of these "world model" researchers are aiming for. As a researcher myself, I do not think this is the way to develop a world model and I'm fairly certain that this cannot be done through observations alone. I explain more in my response to the CEO[0]. This is a common issue is many ways that ML is experimenting, and you simply cannot rely on benchmarks to get you to AGI. Scaling of parameters and data only go so far. If you're seeing slowing advancements, it is likely due to over reliance on benchmarks and under reliance on what benchmarks intend to measure. But this is a much longer conversation (I think I made a long comment about it recently, I can dig up).
This is super cool. I love how you delivered this as an experience, with very cool UI, background music, etc. It was a real trip. Different music and also an ambient mode where it’s atmosphere designed for the particular world would level this up artistically. But I gotta say, I haven’t been this intrigued by anything like this online in a while.
Does it stores a model of the world (like, some memory of the 3d structure) that goes beyond the pixels that are shown?
Or is the next frame a function of just the previous frame and the user input? Like (previous frame, input) -> next frame
I'm asking because, if some world has two distinct locations that look exactly the same, will the AI distinguish them, or will they get coalesced into one location?
> one of the greatest challenges of world models is maintaining coherence of video over long time periods
To be honest most of the appeal to me of this type of thing is the fact that it gets incoherent and morph-y and rotating 360 degrees can completely change the scenery. It's a trippy dreamlike experience whereas this kind of felt like a worse version of existing stuff.
Thanks for the reply and adding some additional context. I'm also a vision researcher, fwiw (I'll be at CVPR if you all are).
(Some of this will be for benefit of other HN non-researcher readers)
I'm hoping you can provide some more. Are these training on single video moving through these environments, where the camera is not turning? What I am trying to understand is what is being generated vs what is being recalled.
It may be a more contentious view, but I do not think we're remotely ready to call these systems "world models" if they are primarily performing recall. Maybe this is the bias from an education in physics (I have a degree), but world modeling is not just about creating consistent imagery, but actually being capable of recovering the underlying physics of the videospace (as opposed to reality which the videos come from). I've yet to see a demonstration of a model that comes anywhere near this or convinces me we're on the path towards this.
The key difference here is are we building Doom which has a system requirements of 100MB disk and 8MB RAM with minimal computation or are we building a extremely decompressed version that requires 4GB of disk and a powerful GPU to run only the first level and can't even get critical game dynamics right like shooting the right enemy (GameNGen).
The problem is not the ability to predict future states based on previous ones, the problem is the ability to recover /causal structures/ from observation.
Critically, a p̶h̶y̶s̶i̶c̶s̶ world model is able to process a counterfactual.
Our video game is able to make predictions, even counterfactual predictions, with its engine. Of course, this isn't generated by observation and environment interaction, it is generated through directed programming and testing (where the testing includes observing and probing the environment). If the goal was just that, then our diffusion models would comparatively be a poor contender. It's the wrong metric. The coherence is a consequence of the world modeling (i.e. game engine) but it coherence can also be developed from recall. Recall alone will be unable to make a counterfactual.
Certainly we're in research phase and need to make tons of improvements, but we can't make these improvements if we're blindly letting our models cheat the physics and are only capable of picking up "user clicks fire" correlating with "monster usually dies when user shoots". LLMs have tons of similar problems with making such shortcuts and the physics will tell you that you are not going to be able to pick up such causal associations without some very specific signals to observe. Unfortunately, causality can not be determined from observation alone (a well known physics result![0]). You end up with many models that generate accurate predictions, and these become non-differentiable without careful factorization, probing, and often requiring careful integration of various other such models. It is this much harder and more nuanced task that is required of a world model rather than memory.
Essentially, do we have "world models" or "cargo cult world models" (recall or something else).
That's the context of my data question. To help us differentiate the two. Certainly the work is impressive and tbh I do believe there is quite a bit of utility in the cargo cult setting, but we should also be clear about what is being claimed and what isn't.
I'm also interested in how you're trying to address the causal modeling problem.
[0] There is much discussion on the Duhem-Quine thesis, which is a much stronger claim than I stated. There's the famous Michelson-Morley experiment, which actually did not rule out an aether, but rather only showed that it had no directionality. Or we could even use the classic Heisenberg Uncertainty Principle which revolutionized quantum mechanics showing that there are things that are unobservable, leading to Schrodinger's Cat (some weird hypotheses of multiverses). And we even have String Theory, which the main gripe remains that it is non-differentiable from other TOES due to differences in predictions being non-observable.
We've been working on the prediction problem for some time. We should have a blog post coming out about our approach (it's pretty neat and in-line with this post) soon.
That's great to hear! I think upon fully reading the blog post it's clear that you meant the next leap forward it solving prediction, to the degree that perception has supposedly been solved. It will be interesting to see how y'all do. I for one want my self driving car yesterday.
Voyage’s mission is to super-charge communities with driverless vehicles. Our fleets power essential, everyday services designed to enhance each resident’s quality of living. At Voyage, we strive to become part of every community we serve.
Voyage’s first product is an autonomous taxi service located within a 160,000 resident retirement community in Florida. Here, our fleet delivers on the promise of autonomous driving - solving the mobility needs of residents who need it most. Whether a resident faces mobility restrictions, or just wants to take a ride, we take pride in getting every Voyage passenger to their destination safely, efficiently, and affordably.
We're a team of 40 engineers that have raised $23m from world-class VC's to build a massive and meaningful transportation company. We're growing the team rapidly, and are searching for engineers across multiple disciplines (machine learning, robotics, consumer software, devops, and more). If you love to ship, I think you'll love working at Voyage.
We are going to market with autonomous vehicles in a very different way, focusing on large private cities first and foremost. We intend The Villages, Florida to be the first (retirement) city that's traversable end-to-end (all 750 miles of road) in an autonomous vehicle.
We'll eventually make the leap to public cities, and it will feel gradual when it does happen.
We think about our technology quite differently, leaning on lots of partners for the infrastructure (mapping, simulation, sensors, tele-operation, middleware, and more) behind the scenes. This enforces a real focus on the un-solved autonomous algorithms.
We'll also be sharing later this year a project we're in the middle of that's dramatically different technologically to what we've seen elsewhere, utilizing the community itself to make a leap in autonomous performance.
>From what I understand, you pick canonical routes inside private communities.
We design our autonomous systems to traverse _any_ point-to-point route within an entire private (retirement) city. We intentionally don't just focus on a single, shuttle-like route. It turns out that pretty much any route in a place like The Villages is far less complex than other city-like environments, but that the business opportunity is just as large.
>What prevents Google from coming in and mapping the area in a week and run you out of business?
Voyage has exclusivity clauses in our agreements with our communities, where we also grant the community a slice of Voyage in the form of equity. Contracts are unfortunately meant to be broken, which means that we put a lot of effort into making sure relationships with these locations are great. We frequently host Town Halls and make sure the community is heard. This is crucial.
>I'm a self driving car engineer, why would I pick Voyage over other big players who have a lot more capital and much bigger team with a lot more people like Drew Gray?
It's a lot of fun here. Contrary to the hype, there's relatively few full-stack self-driving car startups at the Series A level. We believe our people, our technology, and go-to-market to be the best of that group.
Most importantly, when searching for new Voyage team members, we don't optimize for specific degrees or backgrounds. One of our greatest strengths is the team we've built with that philosophy.
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
Once you've taken the research preview for a whirl, you can learn a lot more about our technical work behind this here (https://odyssey.world/introducing-interactive-video).