What I usually try to test with is try to get them do full scalable SaaS application from scratch... It seemed very impressive in how it did the early code organization using Antigravity, but then at some point, all of sudden it started really getting stuck and constantly stopped producing and I had to trigger continue, or babysit it. I don't know if I could've been doing something better, but that was just my experience. Seemed impressive at first, but otherwise at least vs Antigravity, Codex and Claude Code scale more reliably.
Just early anecdote from trying to build that 1 SaaS application though.
It sounds like an API issue more than anything. I was working with it through cursor on a side project, and it did better than all previous models at following instructions, refactoring, and UI-wise it has some crazy skills.
What really impressed me was when I told it that I wanted a particular component’s UI to be cleaned up but I didn’t know how exactly, just wanted to use its deep design expertise to figure it out, and it came up with a UX that I would’ve never thought of and that was amazing.
Another important point is that the error rate for my session yesterday was significantly lower than when I’ve used any other model.
Today I will see how it does when I use it at work, where we have a massive codebase that has particular coding conventions. Curious how it does there.
Just early anecdote from trying to build that 1 SaaS application though.