This has been my experience too. Gemini might be better for vibe coding or architecture or whatever, but Claude consistently feels better for serious coding. That is, when I know exactly how I want something implemented in a large existing codebase, and I go through the full cycle of implementation, refinement, bug fixing, and testing, guiding the AI along the way.
It also seems to be better at incorporating knowledge from documentation and existing examples when provided.
My experience has been exactly the opposite - Sonnet did fine on trivial tasks, but couldn't e.g. fix a bug end-to-end (from bug description in the tracker to implementing the fix and adding tests) properly because it couldn't understand how the relevant code worked, whereas Gemini would consistently figure out the root cause and write decent fix & tests.
Perhaps this is down to specific tools and their prompts? In my case, this was Cursor used in agent mode.
Or perhaps it's about the languages involved - my experiments were with TypeScript and C++.
> Gemini would consistently figure out the root cause and write decent fix & tests.
I feel like you might be using it differently to me. I generally don't ask AI to find the cause of a bug, because it's quite bad at that. I use it to identify relevant parts of the code that could be involved in the bug, and then I come up with my own hypotheses for the cause. Then I use AI to help write tests to validate these hypotheses. I mostly use Rust.
I used to use them mostly in "smart code completion" mode myself until very recently. But with all the AI IDEs adding agentic mode, I was curious to see how well that fares if I let it drive.
And we aren't talking about trivial bugs here. For TypeScript, the most impressive bug it handled to date was an async race condition due to missing await causing a property to be overwritten with invalid value. For that one I actually had to do some manual debugging and tell it what I observed, but given that info, it was able to locate the problem in the code all by itself and fix it correctly and come up with a way to test it as well.
For C++, the codebase in question was gdb, the bug was a test issue, and it correctly found problematic code based solely on the test log (but I had to prod it a bit in the right direction for the fix).
I should note that this is Gemini Pro 2.5 specifically. When I tried Google's models previously (for all kinds of tasks), I was very unimpressed - it was noticeably worse than other SOTA models, so I was very skeptical going into this. Indeed, I started with Sonnet precisely because my past experience indicated that it was the best option, and I only tried Gemini after Sonnet fumbled.
I use it for basically everything I can, not just code completion, including end-to-end bug fixes when it makes sense. But most of the time even the current Gemini and Claude models fail with the hard things.
It might be because most bugs that you would encounter in other languages don't occur in the first place in Rust because of the stronger type system. The race condition one you mentioned wouldn't be possible for example. If something like that would occur, it's a compiler error and the AI fixes it while still in the initial implementation stage by looking at the linter errors. I also put a lot of effort into trying to use coding patterns that do as much validation as possible within the type system. So in the end all that's left are the more difficult bugs where a human is needed to assist (for now at least, I'm confident that the models are only going to get better).
Race conditions can span across processes (think async process communication).
That said I do wonder if the problems you're seeing are simply because there isn't that much Rust in the training set for the models - because, well, there's relatively little of it overall when you compare it to something like C++ or JS.
I've found that I need to point it to the right bit of logs or test output and narrow its attention by selectively adding to it's context. Claude 3.7 at least works well this way. If you don't it'll fumble around. Gemini hasn't worked as well for me though.
I partly wonder if different peoples prompt styles will lead to better results with different models.
It also seems to be better at incorporating knowledge from documentation and existing examples when provided.