More

nullbio · 2025-12-02T17:01:30 1764694890

Benchmarks are never to be believed, and that has been the case since day 1.

nullbio · 2025-12-02T17:00:04 1764694804

Google games benchmarks more than anyone, hence Gemini's strong bench lead. In reality though, it's still garbage for general usage.

nullbio · 2025-12-02T16:58:16 1764694696

Anyone else find that despite Gemini performing best on benches, it's actually still far worse than ChatGPT and Claude? It seems to hallucinate nonsense far more frequently than any of the others. Feels like Google just bench maxes all day every day. As for Mistral, hopefully OSS can eat all of their lunch soon enough.

apexalpha · 2025-12-02T17:00:56 1764694856

No, I've been using Gemini for help while learning / building my onprem k8s cluster and it has been almost spotless.

Granted, this is a subject that is very well present in the training data but still.

Synthetic7346 · 2025-12-02T17:17:36 1764695856

I found gemini 3 to be pretty lackluster for setting up an onprem k8s cluster - sonnet 4.5 was more accurate from the get go, required less handholding

mvkel · 2025-12-02T17:02:49 1764694969

Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose. Their value is as a structural check on the power of proprietary systems; they guarantee a competitive floor. They’re essential to the ecosystem, but they’re not chasing SOTA.

cmrdporcupine · 2025-12-02T17:47:37 1764697657

This may be the case, but DeepSeek 3.2 is "good enough" that it competes well with Sonnet 4 -- maybe 4.5 -- for about 80% of my use cases, at a fraction of the cost.

I feel we're only a year or two away from hitting a plateau with the frontier closed models having diminishing returns vs what's "open"

troyvit · 2025-12-02T20:35:42 1764707742

I think you're right, and I feel the same about Mistral. It's "good enough", super cheap, privacy friendly, and doesn't burn coal by the shovel-full. No need to pay through the nose for the SOTA models just to get wrapped into the same SaaS games that plague the rest of the industry.

barrell · 2025-12-02T17:06:49 1764695209

I can attest to Mistral beating OpenAI in my use cases pretty definitively :)

theshrike79 · 2025-12-03T12:18:14 1764764294

In my use cases mistral has been next to useless.

Granted my uses have been programming related. Mistral prints the answer almost immediately and is also completely and utterly hallucinating everything and producing just something that looks like code but could never even compile...

re-thc · 2025-12-02T17:22:55 1764696175

> Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose.

Do things ever work that way? What if Google did Open source Gemini. Would you say the same? You never know. There's never "supposed" and "purpose" like that.

lowkey_ · 2025-12-02T17:51:44 1764697904

Not the above poster, but:

OpenAI went closed (despite open literally being in the name) once they had the advantage. Meta also is going closed now that they've caught up.

Open-source makes sense to accelerate to catch up, but once ahead, closed will come back to retain advantage.

mvkel · 2025-12-03T00:42:13 1764722533

I continue to be surprised that the supposed bastion of "safe" AI, anthropic, has a record of being the least-open AI company

pants2 · 2025-12-02T18:10:54 1764699054

> Their value is as a structural check on the power of proprietary systems

Unfortunately that doesn't pay the electricity bill

array_key_first · 2025-12-03T01:08:18 1764724098

It kind of does, because the proprietary systems are unacceptable for many usecases because they are proprietary.

There's a lot of businesses who do not want to hand over their sensitive data to hackers, employees of their competitors, and various world governments. There's inherent risk in choosing a propreitary option, and that doesn't just go for LLMs. You can get your feet swept up from underneath you.

dchest · 2025-12-02T18:11:26 1764699086

Nope, Gemini 3 is hallucinating less than GPT-5.1 for my questions.

mrtksn · 2025-12-02T17:29:38 1764696578

Yep, Gemini is my least favorite and I’m convinced that the hype around it isn’t organic because I don’t see the claimed “superiority”, quite the opposite.

cmrdporcupine · 2025-12-02T17:49:39 1764697779

I think a lot of the hype around Gemini comes down to people who aren't using it for coding but for other things maybe.

Frankly, I don't actually care about or want "general intelligence" -- I want it to make good code, follow instructions, and find bugs. Gemini wasn't bad at the last bit, but wasn't great at the others.

They're all trying to make general purpose AI, but I just want really smart augmentation / tools.

erichocean · 2025-12-07T20:04:45 1765137885

I exclusively use Gemini Pro for coding, and it's been writing ~100% of the code I produce since July.

It's great.

tootie · 2025-12-02T18:26:21 1764699981

No? My recent experience with Gemini was terrific. The last big test I gave of Claude it spun an immaculate web of lies before I forced it to confess.

llm_nerd · 2025-12-02T17:39:57 1764697197

What does your comment have to do with the submission? What a weird non-sequitur. I even went looking at the linked article to see if it somehow compares with Gemini. It doesn't, and only relates to open models.

In prior posts you oddly attack "Palantir-partnered Anthropic" as well.

Are things that grim at OpenAI that this sort of FUD is necessary? I mean, I know they're doing the whole code red thing, but I guarantee that posting nonsense like this on HN isn't the way.

cmrdporcupine · 2025-12-02T17:37:43 1764697063

I also had bad luck when I finally tried Gemini 3 in the gemini CLI coding tool. I am unclear if it's the model or their bad tooling/prompting. It had, as you said, hallucination problems, and it also had memory issues where it seemed to drop context between prompts here and there.

It's also slower than both Opus 4.5 and Sonnet.

bluecalm · 2025-12-02T17:31:15 1764696675

My experience is the opposite although I don't use it to write code but to explore/learn about algorithms and various programming ideas. It's amazing. I am close to cancelling my ChatGPT subscription (I would only use Open Router if it had nicer GUI and dark mode anyway).

minimaxir · 2025-12-02T17:30:49 1764696649

For noncoding tasks, Gemini atleast allows for easier grounding with Google Search.

alfalfasprout · 2025-12-02T17:01:10 1764694870

If anything it's a testament to human intelligence that benchmarks haven't really been a good measure of a model's competence for some time now. They provide a relative sorting to some degree, within model families, but it feels like we've hit an AI winter.

gunalx · 2025-12-03T08:56:44 1764752204

Have used gemini3 to GEW shot a few problems GPT5 struggled on.

moffkalast · 2025-12-02T18:15:37 1764699337

Yes, and likewise with Kimi K2. Despite being on the top of open source benches it makes up more batshit nonsense than even Llama 3.

Trust no one, test your use case yourself is pretty much the only approach, because people either don't run benchmarks correctly or have the incentive not to.

VeejayRampay · 2025-12-03T04:49:42 1764737382

no, I find Gemini to be the best

nullbio · 2025-11-25T01:41:33 1764034893

When will folks stop trusting Palantir-partnered Anthropic is probably a better question.

Anthropic has weaponized the safety narrative into a marketing and political tool, and it is quite clear that they're pushing this narrative both for publicity from media that love the doomer narrative because it brings in ad-revenue, and for regulatory capture reasons.

Their intentions are obviously self-motivated, or they wouldn't be partnering with a company that openly prides itself on dystopian-level spying and surveillance of the world.

OpenAI aren't the good guys either, but I wish people would stop pretending like Anthropic are.

khafra · 2025-11-25T06:15:52 1764051352

All of the leading labs are on track to kill everyone, even Anthropic. Unlike the other labs, Anthropic takes reasonable precautions, and strives for reasonable transparency when it doesn't conflict with their precautions; which is wholly inadequate for the danger and will get everyone killed. But if reality graded on a curve, Anthropic would be a solid B+ to A-.

nullbio · 2025-11-25T01:24:31 1764033871

You're forgetting the step where they write a nefarious paper for their marketing team about the "world-ending dangers" of the capabilities they've discovered within their new model, and push it out to their web of media companies who make bank from the ad-revenue from clicks on their doomsday articles while furthering the regulatory capture goals of the hypocritically Palantir-partnered Anthropic.

bn-l · 2025-11-25T04:40:12 1764045612

And then Dario gives an interview on why open source models should be banned due to _____.

rvz · 2025-11-25T04:45:19 1764045919

The predictable business of selling fear; with the goal of control and regulatory capture.

nullbio · 2025-11-25T01:19:54 1764033594

Novel solutions require some combination of guided brute-force search over a knowledge-database/search-engine (NOT a search over the models weights and NOT using chain of thought), combined with adaptive goal creation and evaluation, and reflective contrast against internal "learned" knowledge. Not only that, but it also requires exploration of the lower-probability space, i.e. results lesser explored, otherwise you're always going to end up with the most common and likely answers. That means being able to quantify what is a "less-likely but more novel solution" to begin with, which is a problem in itself. Transformer architecture LLMs do not even come close to approaching AI in this way.

All the novel solutions humans create are a result of combining existing solutions (learned or researched in real-time), with subtle and lesser-explored avenues and variations that are yet to be tried, and then verifying the results and cementing that acquired knowledge for future application as a building block for more novel solutions, as well as building a memory of when and where they may next be applicable. Building up this tree, to eventually satisfy an end goal, and backtracking and reshaping that tree when a certain measure of confidence stray from successful goal evaluation is predicted.

This is clearly very computationally expensive. It is also very different to the statistical pattern repeaters we are currently using, especially considering that their entire premise works because the algorithm chooses the next most probable token which is a function of the frequency of which that token appears in the training data. In other words, the algorithm is designed explicitly NOT to yield novel results, but rather return the most likely result. Higher temperature results tend to reduce textual coherence rather than increase novelty, because token frequency is a literal proxy for textual coherence in coherent training samples, and there is no actual "understanding" happening, nor reflection of the probability results at this level.

I'm sure smart people have figured a lot of this out already - we have general theory and ideas to back this, look into AIXI for example, and I'm sure there is far newer work. But I imagine that any efficient solutions to this problem will permanently remain in the realm of being a computational and scaling nightmare. Plus adaptive goal creation and evaluation is a really really hard problem, especially if text is your only modality of "thinking". My guess would be that it would require the models to create simulations of physical systems in text-only format, to be able to evaluate them, which also means being able to translate vague descriptions of physical systems into text-based physics sims with the same degrees of freedom as the real world - or at least the target problem, and then also imagine ideal outcomes in that same translated system, and develop metrics of "progress" within this system, for the particular target goal. This is a requirement for the feedback loop of building the tree of exploration and validation. Very challenging. I think these big companies are going to chase their tails for the next 10 years trying to reach an ever elusive intelligence goal, before begrudgingly conceding that existing LLM architectures will not get them there.

nullbio · 2025-11-25T00:08:59 1764029339

You're right, that is the underlying problem. Not only of this, but of our entire economy.

nullbio · 2025-10-27T04:55:32 1761540932

I do cardio for the majority of the day.

measurablefunc · 2025-10-27T19:25:45 1761593145

Then you have nothing to worry about.

nullbio · 2025-05-25T05:38:51 1748151531

No. The entirety of an LLMs output is predicated on the frequencies of patterns in its training data, moulded by the preferences of its trainers through RLHF. They're not capable of reasoning, but they can hallucinate language that sounds and flows like reasoning. If those outputs are fed into an interpreter, that can result in automated behavior. They're not capable of out of distribution behavior or generation (yet), despite what the AI companies would like you to believe. They can only borrow and use concepts from that which they've been trained on, which is why despite LLMs seemingly getting progressively more advanced, we haven't really seen them invent anything novel of note.

comp_throw7 · 2025-05-25T06:10:55 1748153455

This is a non-answer that doesn't explain any difference in capabilities between GPT-3.5 and Claude 4 Opus.

andrewflnr · 2025-05-25T05:58:41 1748152721

Yes, I too am familiar with the 101 level of understanding, but I've also heard of LLMs doing things that stretch that model. Perhaps that's just a matter of combining things in their training data in unexpected ways, hence the second half of my question.

nullbio · 2025-04-12T23:58:16 1744502296

Can we please outlaw advertising with AI chatbots before it becomes a plague? Once it starts, there is no turning back. But if we can get ahead of this now based on what we've already learned about the internet then we can perhaps prevent the carnage that is going to happen.

zipmapfoldright · 2025-04-13T00:02:43 1744502563

what we need is not more regulation