That doesn't make sense. The free part is the marketing, the more people like it, the faster it spreads. I run a freemium business and all the motivation internally is to increase growth by improving the free product. Once you achieve a good conversion to pro, any more will slow down growth. At that point, all you care about is improving the product for free users to generate word of mouth, and building features that will do so.
I have had a few epic refactoring failures with Gemini relative to Claude.
For example: I asked both to change a bunch of code into functions to pass into a `pipe` type function, and Gemini truly seemed to have no idea what it was supposed to do, and Claude just did it.
Maybe there was some user error or something, but after that I haven’t really used Gemini.
I’m curious if people are using Gemini and loving it are using it mostly for one-shotting, or if they’re working with it more closely like a pair programmer? I could buy that it could maybe be good at one but bad at the other?
This has been my experience too. Gemini might be better for vibe coding or architecture or whatever, but Claude consistently feels better for serious coding. That is, when I know exactly how I want something implemented in a large existing codebase, and I go through the full cycle of implementation, refinement, bug fixing, and testing, guiding the AI along the way.
It also seems to be better at incorporating knowledge from documentation and existing examples when provided.
My experience has been exactly the opposite - Sonnet did fine on trivial tasks, but couldn't e.g. fix a bug end-to-end (from bug description in the tracker to implementing the fix and adding tests) properly because it couldn't understand how the relevant code worked, whereas Gemini would consistently figure out the root cause and write decent fix & tests.
Perhaps this is down to specific tools and their prompts? In my case, this was Cursor used in agent mode.
Or perhaps it's about the languages involved - my experiments were with TypeScript and C++.
> Gemini would consistently figure out the root cause and write decent fix & tests.
I feel like you might be using it differently to me. I generally don't ask AI to find the cause of a bug, because it's quite bad at that. I use it to identify relevant parts of the code that could be involved in the bug, and then I come up with my own hypotheses for the cause. Then I use AI to help write tests to validate these hypotheses. I mostly use Rust.
I used to use them mostly in "smart code completion" mode myself until very recently. But with all the AI IDEs adding agentic mode, I was curious to see how well that fares if I let it drive.
And we aren't talking about trivial bugs here. For TypeScript, the most impressive bug it handled to date was an async race condition due to missing await causing a property to be overwritten with invalid value. For that one I actually had to do some manual debugging and tell it what I observed, but given that info, it was able to locate the problem in the code all by itself and fix it correctly and come up with a way to test it as well.
For C++, the codebase in question was gdb, the bug was a test issue, and it correctly found problematic code based solely on the test log (but I had to prod it a bit in the right direction for the fix).
I should note that this is Gemini Pro 2.5 specifically. When I tried Google's models previously (for all kinds of tasks), I was very unimpressed - it was noticeably worse than other SOTA models, so I was very skeptical going into this. Indeed, I started with Sonnet precisely because my past experience indicated that it was the best option, and I only tried Gemini after Sonnet fumbled.
I use it for basically everything I can, not just code completion, including end-to-end bug fixes when it makes sense. But most of the time even the current Gemini and Claude models fail with the hard things.
It might be because most bugs that you would encounter in other languages don't occur in the first place in Rust because of the stronger type system. The race condition one you mentioned wouldn't be possible for example. If something like that would occur, it's a compiler error and the AI fixes it while still in the initial implementation stage by looking at the linter errors. I also put a lot of effort into trying to use coding patterns that do as much validation as possible within the type system. So in the end all that's left are the more difficult bugs where a human is needed to assist (for now at least, I'm confident that the models are only going to get better).
Race conditions can span across processes (think async process communication).
That said I do wonder if the problems you're seeing are simply because there isn't that much Rust in the training set for the models - because, well, there's relatively little of it overall when you compare it to something like C++ or JS.
I've found that I need to point it to the right bit of logs or test output and narrow its attention by selectively adding to it's context. Claude 3.7 at least works well this way. If you don't it'll fumble around. Gemini hasn't worked as well for me though.
I partly wonder if different peoples prompt styles will lead to better results with different models.
Two things interest me about Claude being better than GPT-4:
1) We are all breathless that it is better. But a year has passed since GPT4. It’s like we’re excited that someone beat Usain Bolt’s 100 meter time from when he was 7. Impressive, but … he’s twenty now, has been training like a maniac, and we’ll see what happens when he runs his next race.
2) It’s shown AI chat products have no switching costs right now. I now use mostly Claude and pay them money. Each chat is a universe that starts from scratch, so … very easy to switch. Curious if deeper integrations with my data, or real chat memory, will change that.
The current version of GPT-4 is 3 months old not 1 year old. Anthropic are legitimately ahead on performance for cost right now. But their API latency I don’t think matches OpenAI
We’ll see what GPT4.5 looks like in the next 6 months.
I don't think it's just that Claude-3 seems on par with GPT-4, but rather the development timescales involved.
Anthropic as a company was only created, with some of the core LLM team members from OpenAI, around the same time GPT-3 came out (Anthropic CEO Dario Amodei's name is even on the GPT-3 "few-shot learners" paper). So, roughly speaking, in same time it took OpenAI (big established company, with lots of development momentum) to go from GPT-3 to GPT-4, Anthropic have gone from start-up with nothing to Claude-3 (via 1 & 2) which BEATS GPT-4. Clearly the pace of development at Anthropic is faster than that at OpenAI, and there is no OpenAI magic moat in play here.
Sure GPT-4 is a year old at this point, and OpenAI's next release (GPT-4.5 or 5) is going to be better than GPT-4 class models, but given Anthropic's momentum, the more interesting question is how long it will take Anthropic to match it or take the lead?
Inference cost is also an interesting issue... OpenAI have bet the farm on Microsoft, and Anthropic have gone with Amazon (AWS), who have built their own ML chips. I'd guess Athropic's inference cost is cheaper, maybe a lot cheaper. Can OpenAI compete with the cost of Claude-3 Haiku, which is getting rave reviews? It's input tokens are crazy cheap - $300 to input every word you'll ever speak in your entire life!
Claude may be beat GPT-4 right now, but I remember ChatGPT in March 2023 being leagues better. Over the past year, it’s gotten regressive, but faster.
Claude is also lacking web browsing and code interpreter. I’m sure those will come, but where will GPT be by then? ChatGPT also offers an extensive free tier with voice. Claude’s free plan caps you as a few messages every few hours.
Of course GPT-next should take the lead for a while, but with Anthropic, from a standing start, putting out 3 releases in same time it took OpenAI to put out 1, then how long is this lead going to last ?
It'll be interesting to see if Anthropic choose to match OpenAI feature-for-feature or just follow their own path.
Yeah, it's a good point, but I think that our intuitions are different on this one. I don't have a horse in this race, but my assumption is that the next OpenAI release will be a massive leap, that makes GPT 4/Claude 3 Opus look like toys. Perhaps you're right though, and Anthropic's curves with bend upward even more quickly, so that they get to that they'll start catching up more quickly, until eventually be they're ahead.
Honestly who knows, but outside of Q-star rumors there's no indication that either company is doing anything much different from the other one, so I'd not expect any long-lasting difference in capability to open up. Maybe it will, though!
FWIW, Sam Altman has fairly recently said that the jump from GPT-4 to GPT-5 will be similar to that from GPT-3 to GPT-4, and also (recent Lex Fridman interview) that their goal is explicitly NOT to have releases that are shocking - but rather they want to have ones of incremental capability to give society time to adapt. Could be misdirection - who knows.
Amodei for his part has said that what Anthropic will release in 2024 will be a "sharper, more refined" (or words to that effect) version of what they have now, and not a "reality bender" (which he seemed to be implying maybe is coming, but not for a year or two).
They're comparing against gpt-4-0125-preview, which was released at the end of January 2024. So they really are beating the market leader for this test.
What matters here is that what I can use today. I can either use Claude 3 or GPT 4. If the Claude is better, it is best on the market. Let’s see what the story is tomorrow.
Go ahead, no one is saying to stay with GPT4. But its disingenuous to compare a gpt-4-march-update to a completely new pretrained model like Claude 3 Opus.
It is not that disingenuous. We can only make claims based on the current data.
There can be even bigger competitors in the market, but because they stay quiet and do not publish results, we do not know about their capabilities. Who knows what Apple has been doing all this time? They sure have capabilities. Even if they make some random comments about the use of Gemini.
Until the data and proof has been provided, it is accurate to claim "the best model on the market". Everything else is hypothetical.
So you think whatever process produces a GPT4 update is completely equivalent to pretraining and RLHF'ing a brand new model with new architecture, more data, etc??
ChatGPT does have at least a year head start so this doesn't seem surprising. This proves that OpenAI doesn't really have any secret sauce that others can't reproduce.
I suppose size will become the moat eventually but atm it looks like it could become anyone's game.
Size is absolutely not going to become the moat unless there's some hardware revolution that makes running big models very very cheap, but that requires a very large up-front capital cost to deploy. Big models are inefficient, and as smaller models improve there will be very few use cases where the big models are worth the compute.
I imagine that going forward, the typical approach would be a multi-level LLM, such that there's a relatively small and quick model in front of the user, which can in turn decide to consult an "expert" larger model as part of its "system 2".
Absolutely, that is 100% the way things are going to go. What's going to happen is that eventually there will be an online model directory that a local agent knows how to query to identify other models to call in order to build up an answer. Local agents will be empowered with online learning since it won't be possible to pre-train on the model catalog.
> as smaller models improve there will be very few use cases where the big models are worth the compute
I see very little evidence of this so far. The use cases I'm interested in just barely works on GPT-4 and lesser models give mostly garbage. I.e. function calling and inferring stuff like SQL queries. If there are smaller models that can do passable work on such use cases I'd be very interested to know.
Claude Haiku can do a LOT of the things you'd think you need GPT4 for. It's not as good at complex code and really tricky language use/abstractions, but it's very close for more superficial things, and you can call haiku like 60 times for each gpt4 call.
I bet you could do multiple prompt variations with haiku and then do answer combining to compete with GPT4-T/Opus at a fraction of the price.
Interesting! I just discovered that Anthropic indeed officially support commercial API access in (at least) some EU countries. They just don't support GUI access in all those countries:
> We are all breathless that it is better. But a year has passed since GPT4. It’s like we’re excited that someone beat Usain Bolt’s 100 meter time from when he was 7.
Sounds like some sort of siding with closedAI (openAi), when I need to use an llm, I use whatever performs the best at the moment. It doesn’t matter who’s behind it to me, at the moment it is Claude.
I am not going to stick to ChatGPT because closedAi have been pioneers or because their product was one of the best.
I hope I didn’t sound too harsh, excuse me in that case.
Is this supposed to be clever? It's like saying M$ back in the 90s. Yeah, OpenAI doesn't deserve its own name, but maybe we can let that dead horse rest.
Claude has way too many safeguards for what it believes is correct to talk about and what isn’t. Not saying ChatGPT is better, it also got dumbed down a lot, but Claude is very heavy on being politically correct on everything.
Ironically the one I find the best for responses currently is Gemini Advanced.
I agree with you that there is no switching cost currently, I bounce between them a lot
If you’re on macOS, give BoltAI[0] a try. Other than supporting multiple AI services and models, BoltAI also allows you yo create your own AI tools. Highlight the text, press a shortcut key then run a prompt against that text.
I use an app called MindMac for macOS that works with nearly "all" of the APIs. I currently am using OpenAI, Anthropic and Mistral API keys with it, but it seems to support a ton of others as well.
MSFT trying to hedge their bets makes it seems like there's a decent chance OpenAI might have hit a few roadblocks (either technical or organizational)
I agree with your analogy. Also, there is a quite a bit of "standing on the shoulders of giants" kind of thing going on. Every company's latest release will/should be a bit better than the models released before it. AI enthusiasts are getting a bit annoying - "we got a new leader boys!!!!*!" for each new model released.
GRUB is a common bootloader for Linux systems: it gives you a menu of boot options when you turn on your machine, and boots whichever installed operating system you choose.
So with this theme, that menu for choosing which OS to boot looks like the Minecraft menu!
"The Art of Action" It has changed how I approach everything, not just how I run my business. Is applying approaches used in militaries to organize action around the leader's intent, taking action in the right general direction, and delegating the "what" (the intent) but not the "how" of the way it is actually achieved.
My best friend's dad was high up in Woodstock 99 corporate, which means he was responsible for making much of it run (yeah, I guess there were some issues ). We had backstage passes, a house, and a ... weird experience. Did some stupid things. Some highlights:
- We were on stage in the famous closing Red Hot Chili Peppers show (like in the wings, looking out on the audience from behind the band). That was surreal. We could see the fires start in the distance, stuff getting ripped down, things starting to sort of go to hell. All while, the most amazing band of the moment is playing the most amazing songs 30 feet away from us. Quite a contrast between awesomeness and ... whatever the strange mix of emotions riots give.
- We learned that there was no Mountain Dew in the whole event because Coke got the contract, so we went out and bought many cases, drove it in through our special entrance, and basically auctioned it off. People were willing to pay $5 a can. We made a lot of money. Pretty sketchy.
- Worse/weirder, we had a backpacks full of ice to hold the Mountain Dew in while we walked around the crowd selling it, and people started to offer to buy ice from us to cool off. So once we ran out of Mountain Dew, we started yelling "Ice is nice! We got ice!" and selling it. That ... well, I feel like that was me experiencing a real microcosm of capitalism and the allure of artificial scarcity ... and not acting the way I would hope . Give away the damn ice man.
- We kind of just walked around backstage and there were so many super famous people that none of them felt very special, and they'd just kind of talk to you while you were in line for food. Ice Cube had some cool sneakers that my friend chatted with him about. George Clinton was chill. I think we talked with Erykah Badu for a while at some point. (Stars are not at all like this backstage at a normal concert, btw, which we also did a lot because of his father. A their normal concerts, these same performers are the center of the universe, and don't have time to chat with a bunch of high schoolers running around.)
Anyway, it was a very strange event, and we had a very strange vantage point.
Honestly, no. Your type of attitude is how bad products get made: products that cater to the absolute lowest tech illiterate user. Products that hide and simplify every useful tool to the point of annoyance and disfunction. Github's warnings are plenty enough. It doesn't matter what Github did; this post would still be made, and you people would still be thinking of even more absurd ways to make the user not do something. At a certain point, a tool has to do the function you asked it to do.
It was really fun, but maybe throw a few combos closer in weight in? I played for maybe 5 minutes and didn't get any wrong. I would have played longer if I got some wrong.