Mistral 3 family of models released

barrell · 2025-12-02T16:04:42 1764691482

I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and follows formatting instructions to the letter. I was (and still am) super super impressed. Even if it does not hold up in benchmarks, it still outperformed in practice.

I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.

Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models

mrtksn · 2025-12-02T16:39:56 1764693596

Some time ago I canceled all my paid subscriptions to chatbots because they are interchangeable so I just rotate between Grok, ChatGPT, Gemini, Deepseek and Mistral.

On the API side of things my experience is that the model behaving as expected is the greatest feature.

There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.

The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.

barrell · 2025-12-02T16:51:18 1764694278

Yep I spent 3 days optimizing my prompt trying to get gpt-5 to work. Tried a bunch of different models (some Azure some OpenRouter) and got a better success rate with several others without any tailoring of the prompt.

Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable

distalx · 2025-12-03T20:09:46 1764792586

What tools or process do you use to optimize your prompts?

amy_petrik · 2025-12-03T23:09:55 1764803395

usually either use Grok to optimize a mistral prompt, or you can use gemini to optimize a chatGPT prompt. It's best to keep those pairs of AIs and not cross streams!

barbazoo · 2025-12-02T16:49:39 1764694179

> I guess they hope I forget to cancel.

Business model of most subscription based services.

viking123 · 2025-12-03T07:43:12 1764747792

For me it's just that I am too lazy to start switching from my GPT subscription, I use it with codex and it's very good for my use-case. And the price at least here in Asia is not expensive at all for the plus tier. The amount of tokens are so much that I usually cannot even spend the weekly quota, although I use context smartly and know my codebase so I can always point it to right place right away.

I feel like at least for normies if they are familiar with ChatGPT, it might be hard to make them switch especially if they are subscribed.

b3ing · 2025-12-03T13:54:38 1764770078

I estimate at 10% of meetup runs like that

acuozzo · 2025-12-02T18:14:44 1764699284

> because they are interchangeable

What is your use-case?

Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.

My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.

My experience is that they're all very, very different from one another.

mrtksn · 2025-12-02T18:49:06 1764701346

my use case is Google replacement, things that I can do by myself so I can verify and things that are not important so I don’t have to verify.

Sure, they produce different output so sometimes I will run the same thing on a few different models when Im not sure or happy but I’d don’t delegate the thinking part actually, I always give a direction in my prompts. I don’t see myself running 30min queries because I will never trust the output and will have to do all the work myself. Instead I like to go step by step together.

giancarlostoro · 2025-12-02T18:15:03 1764699303

Maybe give Perplexity a shot? It has Grok, ChatGPT, Gemini, Kimi K2, I dont think it has Mistral unfortunately.

mrtksn · 2025-12-02T18:51:00 1764701460

I like perplexity actually but haven’t been using it since some time. Maybe I should give it a go :)

ecommerceguy · 2025-12-02T23:58:09 1764719889

I use their browser called Comet for finance related research. Very nice. I use pretty much all of the main ai's, chat, deep, gem, claude - all i have found little niche use case that i'm sure will rotate at some point in an upgrade cycle. there are so many ai's i don't see the point in paying for one. I'm convinced they will need ads to survive.

excited to add mistral to the rotation!

giancarlostoro · 2025-12-03T02:47:33 1764730053

Oh man I use Comet nearly daily, I tried setting perplexity as my new tab page on other browsers and for some reason its not the same. I mostly use it that boring way too.

VHRanger · 2025-12-03T02:57:00 1764730620

Kagi has Mistral as well

druskacik · 2025-12-02T17:06:41 1764695201

This is my experience as well. Mistral models may not be the best according to benchmarks and I don't use them for personal chats or coding, but for simple tasks with pre-defined scope (such as categorization, summarization, etc.) they are the option I choose. I use mistral-small with batch API and it's probably the best cost-efficient option out there.

leobg · 2025-12-03T08:45:05 1764751505

Did you compare it to gemini-2.0-flash-lite?

leobg · 2025-12-03T11:01:57 1764759717

Answering my own question:

Artificial Analysis ranks them close in terms of price (both 0.3 USD/1M tokens) and intelligence (27 / 29 for gemini/mistral), but ranks gemini-2.0-flash-lite higher in terms of speed (189 tokens/s vs. 130).

So they should be interchangeable. Looking forward to testing this.

[0] https://artificialanalysis.ai/?models=o3%2Cgemini-2-5-pro%2C...

druskacik · 2025-12-03T22:14:45 1764800085

I did some vibe-evals only and it seemed slightly worse for my use case, so I didn't change it.

mbowcut2 · 2025-12-02T17:46:04 1764697564

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

pants2 · 2025-12-02T18:02:49 1764698569

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

airstrike · 2025-12-02T18:15:11 1764699311

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

pants2 · 2025-12-02T20:00:01 1764705601

Generally, the easiest:

1. Sample a set of prompts / answers from historical usage.

2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

dotancohen · 2025-12-03T12:42:57 1764765777

How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?

pants2 · 2025-12-05T04:49:38 1764910178

Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.

1. https://openrouter.ai/models?order=top-weekly&fmt=table

dotancohen · 2025-12-05T07:59:06 1764921546

Thank you! I'll see about building a test suite.

Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.

Advice welcome!

pants2 · 2025-12-05T16:53:51 1764953631

Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.

dotancohen · 2025-12-05T17:30:50 1764955850

Thank you. I will google Group Relative Policy Optimization to learn about that and the other training methods. If you have any resources handy that I should be reading, that would be appreciated as well. Have a great weekend.

pants2 · 2025-12-05T19:44:54 1764963894

Nothing off the top of my head! If you find anything good let me know. GRPO is a training technique likely not exactly what you'd do for benchmarking, but it's interesting to read about anyway. Glad I cuold help

Legend2440 · 2025-12-02T20:53:08 1764708788

I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

The only exception I can think of is models trained on synthetic data like Phi.

pembrook · 2025-12-02T19:19:15 1764703155

If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

astrange · 2025-12-03T18:16:14 1764785774

Americans have an opposing bias via the phenomenon of "safe edgy", where for obvious reasons they're uncomfortable with being biased towards anyone who looks like a US minority, and redirect all that energy towards being racist to the French. So it's all balanced.

mentalgear · 2025-12-02T17:28:16 1764696496

Thanks for sharing your use case of the mistral models, which are indeed top-notch ! I had a look at phrasing.app, and while a nice website, I found the copy of "Hand-crafted. Phrasing was designed & developed by humans, for humans." somewhat of a false virtue given your statements here of advanced lllm usage.

barrell · 2025-12-02T17:35:56 1764696956

I don't see the contention. I do not use llms in the design, development, copywriting, marketing, blogging, or any other aspect of the crafting of the application.

I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.

basilgohar · 2025-12-02T17:46:27 1764697587

I admire and respect this stance. I have been very AI-hesitant and while I'm using it more and more, I have spaces that I want to definitely keep human-only, as this is my preference. I'm glad to hear I'm not the only one like this.

barrell · 2025-12-02T18:06:33 1764698793

Thank you :) and you're definitely not the only one.

Full transparency, the first backend version of phrasing was 'vibe-coded' (long before vibe coding was a thing). I didn't like the results, I didn't like the experience, I didn't feel good ethically, and I didn't like my own development.

I rewrote the application (completely, from scratch, new repo new language new framework) and all of the sudden I liked the results, I loved the process, I had no moral qualms, and I improved leaps and bounds in all areas I worked on.

Automation has some amazing use cases (I am building an automation product at the end of the day) but so does doing hard things yourself.

Although most important is just to enjoy what you do; or perhaps do something you can be proud of.

metadat · 2025-12-02T16:31:23 1764693083

Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

barrell · 2025-12-02T16:47:33 1764694053

Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

data-ottawa · 2025-12-02T17:28:16 1764696496

With gpt5 did you try adjusting the reasoning level to "minimal"?

I tried using it for a very small and quick summarization task that needed low latency and any level above that took several seconds to get a response. Using minimal brought that down significantly.

Weirdly gpt5's reasoning levels don't map to the OpenAI api level reasoning effort levels.

barrell · 2025-12-02T18:25:09 1764699909

Reasoning was set to minimal and low (and I think I tried medium at some point). I do not believe the timeouts were due to the reasoning taking to long, although I never streamed the results. I think the model just fails often. It stops producing tokens and eventually the request times out.

barbazoo · 2025-12-02T16:50:40 1764694240

Hard to gauge what gibberish is without an example of the data and what you prompted the LLM with.

barrell · 2025-12-02T16:59:43 1764694783

If you wanted examples, you needed only ask :)

These are screenshots from that week: https://x.com/barrelltech/status/1995900100174880806

I'm not going to share the prompt because (1) it's very long (2) there were dozens of variations and (3) it seems like poor business practices to share the most indefensible part of your business online XD

barbazoo · 2025-12-02T17:45:52 1764697552

Surely reads like someone's brain transformed into a tree :)

Impressive, I haven't seen that myself yet, I've only used 5 conversationally, not via API yet.

barrell · 2025-12-02T18:11:28 1764699088

Heh it's a quote from Archer FX (and admittedly a poor machine translation, it's a very old expression of mine).

And yes, this only happens when I ask it to apply my formatting rules. If you let GPT format itself, I would be surprised if this ever happens.

sandblast · 2025-12-02T17:07:09 1764695229

XD XD

acuozzo · 2025-12-02T18:08:54 1764698934

I have a need to remove loose "signature" lines from the last 10% of a tremendous e-mail dataset. Based on your experience, how do you think mistral-3-medium-0525 would do?

barrell · 2025-12-02T18:18:04 1764699484

What's your acceptable error rate? Honestly ministral would probably be sufficient if you can tolerate a small failure rate. I feel like medium would be overkill.

But I'm no expert. I can't say I've used mistral much outside of my own domain.

acuozzo · 2025-12-02T19:15:09 1764702909

I'd prefer for the error rate to be as close to 0% as possible under the strict requirement of having to use a local model. I have access to nodes with 8xH200, but I'd prefer to not tie those up with this task. I'd, instead, prefer to use a model I can run on an M2 Ultra.

barrell · 2025-12-02T19:50:28 1764705028

If I cannot tolerate a failure rate, I do not use LLMs (or and ML models).

But in that case the larger the better. If mistral medium can run on your M2 Ultra then it should be up to the task. Should eek out ministral and be just shy of the biggest frontier models.

But I wouldn’t even trust GPT-5 or Claude Opus or Gemini 3 Pro to get close to a zero percent success rate, and for a task such as this I would not expect mistral medium to outperform the big boys

mackross · 2025-12-03T12:32:40 1764765160

Cool app. I couldn’t see a way to report an error in one of the default expressions.

msp26 · 2025-12-02T16:33:46 1764693226

The new large model uses DeepseekV2 architecture. 0 mention on the page lol.

It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3".

---

vllm/model_executor/models/mistral_large_3.py

```

from vllm.model_executor.models.deepseek_v2 import DeepseekV3ForCausalLM

class MistralLarge3ForCausalLM(DeepseekV3ForCausalLM):

```

"Science has always thrived on openness and shared discovery." btw

Okay I'll stop being snarky now and try the 14B model at home. Vision is good additional functionality on Large.

Jackson__ · 2025-12-02T21:01:54 1764709314

So they spent all of their R&D to copy deepseek, leaving none for the singular novel added feature: vision.

To quote the hf page:

>Behind vision-first models in multimodal tasks: Mistral Large 3 can lag behind models optimized for vision tasks and use cases.

Ey7NFZ3P0nzAe · 2025-12-02T21:12:29 1764709949

Well, behind "models" not "langual models".

Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities

make3 · 2025-12-02T19:50:03 1764705003

Architecture difference wrt vanilla transformers and between modern transformers are a tiny part of what makes a model nowadays

halJordan · 2025-12-02T21:43:22 1764711802

I don't think it's fair to demand everything be open and then get mad when they open-ness is used. It's an obsessive and harmful double standard.

simonw · 2025-12-02T17:41:10 1764697270

The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPU

Pelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/

troyvit · 2025-12-02T19:38:37 1764704317

I'm reading this post and wondering what kind of crazy accessibility tools one could make. I think it's a little off the rails but imagine a tool that describes a web video for a blind user as it happens, not just the speech, but the actual action.

GaggiX · 2025-12-02T20:02:21 1764705741

This is not local but Gemini models can process very long videos and provide description with timestamps if asked for.

https://ai.google.dev/gemini-api/docs/video-understanding#tr...

embedding-shape · 2025-12-02T21:30:37 1764711037

Nor would it be describing things as they happen, but instead needing pre-processing, so in the end, very different :)

user_of_the_wek · 2025-12-03T07:35:30 1764747330

> The image depicts and older man...

Ouch

mythz · 2025-12-02T16:05:05 1764691505

Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.

Mistral had the best small models on consumer GPUs for a while, hopefully Ministral 14B lives up to their benchmarks.

rvz · 2025-12-02T16:14:09 1764692049

All thanks to the US VCs that acutally have money to fund Mistral's entire business.

Had they gone to the EU, Mistral would have gotten a miniscule grant from the EU to train their AI models.

amarcheschi · 2025-12-02T16:54:59 1764694499

Mistral biggest investor is asml, although it became so later than other vcs

crimsoneer · 2025-12-02T16:17:43 1764692263

I mean, one is a government, the other are VCs (also, I would be shocked if there isn't some French gov funding somewhere in the massive mistral pile).

kergonath · 2025-12-03T08:05:32 1764749132

> I would be shocked if there isn't some French gov funding somewhere in the massive mistral pile

There is a bit of it, yes, although how much exactly is difficult to know. It’s not all tax breaks and subventions; several public agencies are using it, including in the army so finding out the details is not trivial.

whiplash451 · 2025-12-02T16:23:07 1764692587

1. so what 2. asml

rvz · 2025-12-02T16:36:10 1764693370

1. It matters.

2. Did ASML invest in Mistral in their first round of venture funding or was it US VCs all along that took that early risk and backed them from the very start?

Risk aversion is in the DNA and in almost every plot of land in Europe such that US VCs saw something in Mistral before even the european giants like ASML did.

ASML would have passed on Mistral from the start and Mistral would have instead begged to the EU for a grant.

apexalpha · 2025-12-02T16:30:18 1764693018

1. Big problem

2. ASML was propped up by ASM and Philips, stepping in as "VCs"

didibus · 2025-12-02T16:37:02 1764693422

For VC don't you need a lot of capital and people with too much money?

Isn't that then a chicken and egg?

JumpCrisscross · 2025-12-02T19:01:05 1764702065

> and people with too much money?

No. VC’s historical capital has come from institutional investors. Pensions. Endowments. Foundations.

didibus · 2025-12-03T03:19:03 1764731943

Interesting, is that still the case? And how is the decision to take those high risk investments made for things like pensions and such?

timpera · 2025-12-02T15:11:36 1764688296

Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.

Youden · 2025-12-02T15:58:05 1764691085

They mentioned LMArena, you can get the results for that here: https://lmarena.ai/leaderboard/text

Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.

jampekka · 2025-12-02T16:42:34 1764693754

1491 vs 1418 ELO means the stronger model wins about 60% of the time.

supermatt · 2025-12-02T16:50:56 1764694256

Probably naive questions:

Does that also mean that Gemini-3 (the top ranked model) loses to mistral 3 40% of the time?

Does that make Gemini 1.5x better, or mistral 2/3rd as good as Gemini, or can we not quantify the difference like that?

esafak · 2025-12-02T16:54:10 1764694450

Yes, of course.

uejfiweun · 2025-12-03T01:28:18 1764725298

Wow. If all the trillions only produces that small of a diff... that's shocking. That's the sort of knowledge that could pop the bubble.

JustFinishedBSG · 2025-12-03T10:10:25 1764756625

I wouldn't trust LMArena results much. They measure user preference and users are highly skewed by style, tone etc.

You can litteraly "improve" your model on LMArena by just adding a bunch of emojis.

qznc · 2025-12-02T16:01:00 1764691260

I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.

constantcrying · 2025-12-02T15:29:16 1764689356

The lack of the comparison (which absolutely was done), tells you exactly what you need to know.

bildung · 2025-12-02T16:48:59 1764694139

I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.

The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.

Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.

adam_patarino · 2025-12-02T19:08:10 1764702490

We're seeing the same thing for many companies, even in the US. Exposing your entire codebase to an unreliable third party is not exactly SOC / ISO compliant. This is one of the core things that motivated us to develop cortex.build so we could put the model on the developer's machine and completely isolate the code without complicated model deployments and maintenance.

leobg · 2025-12-03T14:01:46 1764770506

Does your company use Microsoft Teams?

BoorishBears · 2025-12-02T17:37:24 1764697044

Mistral is founded by multiple Meta engineers, no?

Funded mostly by US VCs?

Hosted primarily on Azure?

Do you really have to go out of your way to start calling their competition "data leeches" for out-executing them?

sofixa · 2025-12-02T19:37:22 1764704242

Mistral are mostly focusing on b2b, and for customers that want to self-host (banks and stuff). So their founders being from Meta, or where their cloud platform are hosted, are entirely irrelevant to the story.

BoorishBears · 2025-12-02T19:49:11 1764704951

The fact they would not exist without the leeches and built their business on the leeches is irrelevant.

Pan-nationalism is a hell of a drug: a company that does not know you exist puts out an objectively awful release, and people take frank discussion of it as a personal slight.

Fnoord · 2025-12-03T05:37:19 1764740239

Those who crawled the web without consent, and then put their LLM in a blackbox without attribution, with secret prompt and secret weights -- ie. all of this without giving back, while creating tons of Co2. Those are the leeches.

BoorishBears · 2025-12-04T09:14:28 1764839668

Ah, so "crawled the web without consent, and then put their LLM in a blackbox without attribution" is not being a leech once you release the weights of an underperforming model using someone else's arch.

I knew y'all's standards were lower but geez!

Fnoord · 2025-12-04T09:58:27 1764842307

At the very least it is a step in the right direction. Can't say the same for these proprietary models. And guess which country has all these proprietary models? USA.

BoorishBears · 2025-12-05T07:03:08 1764918188

Thank goodness for that, otherwise all we might have is useless copies of Deepseek.

baq · 2025-12-02T21:34:50 1764711290

If you want to allocate capital efficiently planet-scale you have to ignore nations to the largest extent possible.

sofixa · 2025-12-02T20:00:05 1764705605

> The fact they would not exist without the leeches and built their business on the leeches is irrelevant.

How so?

bildung · 2025-12-03T13:16:25 1764767785

I didn't mean to imply US bad EU good. As such, this isn't about which passport the VCs have, but about local hosting and open weight models. A closed model from a US company always comes with the risk of data exfiltration either for training or thanks to CLOUD Act etc (i.e. industrial espionage).

And personally I don't care at all about the performance delta - we are talking about a difference of 6 to at most 12 months here, between closed source SOTA and open weight models.

troyvit · 2025-12-02T19:55:07 1764705307

It's wayyyy to early in the game to say who is out-executing whom.

I mean why do you think those guys left Meta? It reminds me of a time ten years ago I was sitting on a flight with a guy who works for the natural gas industry. I was (cough still am) a pretty naive environmentalist, so I asked him what he thought of solar, wind, etc. and why should we be investing in natural gas when there are all these other options. His response was simple. Natural gas can serve as a bridge from hydrocarbons to true green energy sources. Leverage that dense energy to springboard the other sources in the mix and you build a path forward to carbon free energy.

I see Mistral's use of US VCs the same way. Those VCs are hedging their bets and maybe hoping to make a few bucks. A few of them are probably involved because they're buddies with the former Meta guys "back in the day." If Mistral executes on their plan of being a transparent b2b option with solid data protections then they used those VCs the way they deserve to be used and the VCs make a few bucks. If Europe ever catches up to the US in terms of data centers, would Mistral move off of Azure? I'd bet $5 that they would.

popinman322 · 2025-12-02T16:10:17 1764691817

They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

kalkin · 2025-12-02T16:59:49 1764694789

Scale AI wrote a paper a year ago comparing various models performance on benchmarks to performance on similar but held-out questions. Generally the closed source models performed better, and Mistral came out looking pretty badly: https://arxiv.org/pdf/2405.00332

extr · 2025-12-02T16:41:54 1764693714

??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.

We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)

crimsoneer · 2025-12-02T15:36:59 1764689819

If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).

constantcrying · 2025-12-02T15:50:34 1764690634

Completely agree, that there are legitimate reasons to prefer comparison to e.g. deepeek models. But that doesn't change my point, we both agree that the comparisons would be extremely unfavorable.

Lapel2742 · 2025-12-02T16:10:22 1764691822

> that the comparisons would be extremely unfavorable.

Why should they compare apples to oranges? Ministral3 Large costs ~1/10th of Sonnet 4.5. They clearly target different users. If you want a coding assistant you probably wouldn't choose this model for various reasons. There is place for more than only the benchmark king.

constantcrying · 2025-12-02T16:22:00 1764692520

Come on. Do you just not read posts at all?

esafak · 2025-12-02T16:31:45 1764693105

Which lightweight models do these compare unfavorably with?

tarruda · 2025-12-02T16:00:04 1764691204

Here's what I understood from the blog post:

- Mistral Large 3 is comparable with the previous Deepseek release.

- Ministral 3 LLMs are comparable with older open LLMs of similar sizes.

constantcrying · 2025-12-02T16:03:15 1764691395

And implicit in this is that it compares very poorly to SOTA models. Do you disagree with that? Do you think these Models are beating SOTA and they did not include the benchmarks, because they forgot?

saubeidl · 2025-12-02T16:32:26 1764693146

Those are SOTA for open models. It's a separate league from closed models entirely.

supermatt · 2025-12-02T17:09:53 1764695393

> It's a separate league from closed models entirely.

To be fair, the SOTA models aren't even a single LLM these days. They are doing all manner of tool use and specialised submodel calls behind the scenes - a far cry from in-model MoE.

tarruda · 2025-12-02T16:10:58 1764691858

> Do you disagree with that?

I think that Qwen3 8B and 4B are SOTA for their size. The GPQA Diamond accuracy chart is weird: Both Qwen3 8B and 4B have higher scores, so they used this weid chart where "x" axis shows the number of output tokens. I missed the point of this.

meatmanek · 2025-12-02T19:40:31 1764704431

Generation time is more or less proportional to tokens * model size, so if you can get the same quality result with fewer tokens from the same size of model, then you save time and money.

kergonath · 2025-12-03T08:11:52 1764749512

Thanks. That was not obvious to me either.

rvz · 2025-12-02T16:17:19 1764692239

> I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,

Why would they? They know they can't compete against the heavily closed-source models.

They are not even comparing against GPT-OSS.

That is absolutely and shockingly bearish.

mrinterweb · 2025-12-02T19:51:56 1764705116

I don't like being this guy, but I think Deepseek 3.2 stole all the thunder yesterday. Notice that these comparisons are to Deepseek 3.1. Deepseek 3.2 is a big step up over 3.1, if benchmarks are to be believed. Just unfortunate timing of release. https://api-docs.deepseek.com/news/news251201

hiddencost · 2025-12-03T05:11:26 1764738686

Idk. They look like they're ahead on the saturated benchmarks and behind on the unsaturated ones. Looks more like that over fit to the benchmarks.

yvoschaap · 2025-12-02T15:28:13 1764689293

Upvoting for Europe's best efforts.

sebzim4500 · 2025-12-02T15:44:23 1764690263

That's unfair to Europe. A bunch of AI work is done in London (Deepmind is based here for a start)

p2detar · 2025-12-02T15:56:59 1764691019

That's ok. How could they know that there are companies like Aleph Alpha, Helsing or the famous DeepL. European companies are not that vocal, but that doesn't mean they aren't making progress in the field.

edit: typos

Glemkloksdjf · 2025-12-02T15:58:35 1764691115

Thats not the point.

Deepmind is not an UK company, its google aka US.

Mistral is a real EU based company.

gishh · 2025-12-02T16:34:16 1764693256

Using US VC dollars. Where their desks are isn’t really important.

data-ottawa · 2025-12-02T17:38:46 1764697126

Increasingly where the desks and servers are is critical.

The cloud act and the current US administration doing things like sanctioning the ICC demonstrate why the locations of those desks is important.

cycomanic · 2025-12-02T19:01:34 1764702094

That's such a silly argument. X, OpenAI and others have large Saudi investments. In the grant scheme of things the US is largely indebted to China and Japan.

vintermann · 2025-12-02T16:53:27 1764694407

Currency is interchangeable. Location might not be.

Glemkloksdjf · 2025-12-03T09:55:18 1764755718

An EU Company pays taxes in EU, has a EU mindset (worker laws etc.), focuses more on EU than other countries.

And an EU company can't be forced by the US Gov to hand over data.

GaggiX · 2025-12-02T15:45:58 1764690358

London is not part of Europe anymore since Brexit /s

ot · 2025-12-02T15:49:34 1764690574

Is it so hard for people to understand that Europe is a continent, EU is a federation of European countries, and the two are not the same?

usrnm · 2025-12-02T15:54:41 1764690881

Europe isn't even a continent and has no real definition (none that would make any sense, anyway), so the whole thing is confusing by design

rc1 · 2025-12-03T19:50:36 1764791436

If Europe isn’t a continent, on what continent are the EU member states sitting on?

rkomorn · 2025-12-03T19:52:57 1764791577

Eurasia is the widely accepted answer.

denysvitali · 2025-12-02T17:17:38 1764695858

I honestly think it is. The amount of people who thinks Europe and EU are the same thing is really concerning.

And no, it's not only americans. I keep hearing this thing from people living in Europe as well (or better, in the EU). I also very often hear phrases like "Switzerland is not in Europe" to indicate that the country is not part of the European Union.

MadDemon · 2025-12-02T18:23:50 1764699830

Switzerland has such close ties to the EU that I would consider them half in.

lostmsu · 2025-12-02T17:08:33 1764695313

Isn't London on an island, mr. Pedantic?

TulliusCicero · 2025-12-02T17:39:04 1764697144

So I guess Japan isn't Asian then?

layer8 · 2025-12-03T01:42:16 1764726136

While Japan is part of Asia, and Asia is a continent, Japan is also separated from the Asian continent: https://en.wikipedia.org/wiki/Geography_of_Japan#Location

lostmsu · 2025-12-03T02:56:05 1764730565

What's more interesting is that the comment you are replying to mistakenly asked me instead of asking the parent.

GaggiX · 2025-12-02T15:51:24 1764690684

I think you missed the joke

tmoravec · 2025-12-02T18:36:48 1764700608

Drifted to the Caribbean.

colesantiago · 2025-12-02T16:11:50 1764691910

Deepmind doesn't exist anymore.

Google DeepMind does exist.

LunaSea · 2025-12-02T16:54:41 1764694481

Upvoting Windows 11 as the US's best effort at Operating Systems development.

DarmokJalad1701 · 2025-12-02T17:21:21 1764696081

Wouldn't that be macOS? Or BSD? Or Unix? CentOS?

LunaSea · 2025-12-02T19:11:46 1764702706

What's the market share of those compared to Windows and Linux?

DarmokJalad1701 · 2025-12-02T23:55:23 1764719723

"best effort at Operating Systems development" doesn't imply anything about the market share.

simgt · 2025-12-02T15:27:27 1764689247

I still don't understand what the incentive is for releasing genuinely good model weights. What makes sense however is OpenAI releasing a somewhat generic model like gpt-oss that games the benchmarks just for PR. Or some Chinese companies doing the same to cut the ground from under the feet of American big tech. Are we really hopeful we'll still get decent open weights models in the future?

mirekrusin · 2025-12-02T15:54:16 1764690856

Because there is no money in making them closed.

Open weight means secondary sales channels like their fine tuning service for enterprises [0].

They can't compete with large proprietary providers but they can erode and potentially collapse them.

Open weights and research builds on itself advancing its participants creating environment that has a shot at proprietary services.

Transparency, control, privacy, cost etc. do matter to people and corporations.

[0] https://mistral.ai/solutions/custom-model-training

talliman · 2025-12-02T15:42:48 1764690168

Until there is a sustainable, profitable and moat-building business model for generative AI, the competition is not to have the best proprietary model, but rather to raise the most VC money to be well positioned when that business model does arise.

Releasing a near stat-of-the-art open model instanly catapults companies to a valuation of several billion dollars, making it possible raise money to acquire GPUs and train more SOTA models.

Now, what happens if such a business model does not emerge? I hope we won't find out!

mirekrusin · 2025-12-02T15:58:04 1764691084

Explained well in this documentary [0].

[0] https://www.youtube.com/watch?v=BzAdXyPYKQo

simgt · 2025-12-02T16:21:24 1764692484

I was fully expecting that but it doesn't get old ;)

memming · 2025-12-02T15:54:11 1764690851

It’s funny how future money drive the world. Fortunately it’s fueling progress this time around.

NitpickLawyer · 2025-12-02T15:50:27 1764690627

> gpt-oss that games the benchmarks just for PR.

gpt-oss is killing the ongoing AIME3 competition on kaggle. They're using a hidden, new set of problems, IMO level, handcrafted to be "AI hardened". And gpt-oss submissions are at ~33/50 right now, two weeks into the competition. The benchmarks (at least for math) were not gamed at all. They are really good at math.

lostmsu · 2025-12-02T17:11:35 1764695495

Are they ahead of all other recent open models? Is there a leaderboard?

NitpickLawyer · 2025-12-02T17:23:03 1764696183

There is a leaderboard [1] but we'll have to wait till april for the competition to end to know what models they're using. The current number 3 on there (34/50) has mentioned in discussions that they're using gpt-oss-120b. There were also some scores shared for gpt-oss-20b, in the 25/50 range.

The next "public" model is qwen30b-thinking at 23/50.

Competition is limited to 1 H100 (80GB) and 5h runtime for 50 problems. So larger open models (deepseek, larger qwens) don't fit.

[1] https://www.kaggle.com/competitions/ai-mathematical-olympiad...

data-ottawa · 2025-12-02T17:45:54 1764697554

I find the qwen3 models spend a ton of thinking tokens which could hamstring them on the runtime limitations. Gpt-oss 120b is much more focused and steerable there.

The token use chart in the OP release page demonstrates the Qwen issue well.

Token churn does help smaller models on math tasks, but for general purpose stuff it seems to hurt.

prodigycorp · 2025-12-02T15:37:32 1764689852

gpt-oss are really solid models. by far the best at tool calling, and performant.

nullbio · 2025-12-02T17:00:04 1764694804

Google games benchmarks more than anyone, hence Gemini's strong bench lead. In reality though, it's still garbage for general usage.

tucnak · 2025-12-02T15:46:19 1764690379

If the claims on multilingual and pretraining performance are accurate, this is huge! This may be the best-in-class multilingual stuff since the more recent Gemma's, where they used to be unmatched. I know Americans don't care much about the rest of the world, but we're still using our native tongues thank you very much; there is a huge issue with i.e. Ukrainian (as opposed to Russian) being underrepresented in many open-weight and weight-available models. Gemma used to be a notable exception, I wonder if it's still the case. On a different note: I wonder why scores on TriviaQA vis-a-vis 14b model lags behind Gemma 12b so much; that one is not a formatting-heavy benchmark.

NitpickLawyer · 2025-12-02T16:05:27 1764691527

> I wonder why scores on TriviaQA vis-a-vis 14b model lags behind Gemma 12b so much; that one is not a formatting-heavy benchmark.

My guess is the vast scale of google data. They've been hoovering data for decades now, and have had curation pipelines (guided by real human interactions) since forever.

nullbio · 2025-12-02T16:58:16 1764694696

Anyone else find that despite Gemini performing best on benches, it's actually still far worse than ChatGPT and Claude? It seems to hallucinate nonsense far more frequently than any of the others. Feels like Google just bench maxes all day every day. As for Mistral, hopefully OSS can eat all of their lunch soon enough.

apexalpha · 2025-12-02T17:00:56 1764694856

No, I've been using Gemini for help while learning / building my onprem k8s cluster and it has been almost spotless.

Granted, this is a subject that is very well present in the training data but still.

Synthetic7346 · 2025-12-02T17:17:36 1764695856

I found gemini 3 to be pretty lackluster for setting up an onprem k8s cluster - sonnet 4.5 was more accurate from the get go, required less handholding

mvkel · 2025-12-02T17:02:49 1764694969

Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose. Their value is as a structural check on the power of proprietary systems; they guarantee a competitive floor. They’re essential to the ecosystem, but they’re not chasing SOTA.

cmrdporcupine · 2025-12-02T17:47:37 1764697657

This may be the case, but DeepSeek 3.2 is "good enough" that it competes well with Sonnet 4 -- maybe 4.5 -- for about 80% of my use cases, at a fraction of the cost.

I feel we're only a year or two away from hitting a plateau with the frontier closed models having diminishing returns vs what's "open"

troyvit · 2025-12-02T20:35:42 1764707742

I think you're right, and I feel the same about Mistral. It's "good enough", super cheap, privacy friendly, and doesn't burn coal by the shovel-full. No need to pay through the nose for the SOTA models just to get wrapped into the same SaaS games that plague the rest of the industry.

barrell · 2025-12-02T17:06:49 1764695209

I can attest to Mistral beating OpenAI in my use cases pretty definitively :)

theshrike79 · 2025-12-03T12:18:14 1764764294

In my use cases mistral has been next to useless.

Granted my uses have been programming related. Mistral prints the answer almost immediately and is also completely and utterly hallucinating everything and producing just something that looks like code but could never even compile...

re-thc · 2025-12-02T17:22:55 1764696175

> Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose.

Do things ever work that way? What if Google did Open source Gemini. Would you say the same? You never know. There's never "supposed" and "purpose" like that.

lowkey_ · 2025-12-02T17:51:44 1764697904

Not the above poster, but:

OpenAI went closed (despite open literally being in the name) once they had the advantage. Meta also is going closed now that they've caught up.

Open-source makes sense to accelerate to catch up, but once ahead, closed will come back to retain advantage.

mvkel · 2025-12-03T00:42:13 1764722533

I continue to be surprised that the supposed bastion of "safe" AI, anthropic, has a record of being the least-open AI company

pants2 · 2025-12-02T18:10:54 1764699054

> Their value is as a structural check on the power of proprietary systems

Unfortunately that doesn't pay the electricity bill

array_key_first · 2025-12-03T01:08:18 1764724098

It kind of does, because the proprietary systems are unacceptable for many usecases because they are proprietary.

There's a lot of businesses who do not want to hand over their sensitive data to hackers, employees of their competitors, and various world governments. There's inherent risk in choosing a propreitary option, and that doesn't just go for LLMs. You can get your feet swept up from underneath you.

dchest · 2025-12-02T18:11:26 1764699086

Nope, Gemini 3 is hallucinating less than GPT-5.1 for my questions.

mrtksn · 2025-12-02T17:29:38 1764696578

Yep, Gemini is my least favorite and I’m convinced that the hype around it isn’t organic because I don’t see the claimed “superiority”, quite the opposite.

cmrdporcupine · 2025-12-02T17:49:39 1764697779

I think a lot of the hype around Gemini comes down to people who aren't using it for coding but for other things maybe.

Frankly, I don't actually care about or want "general intelligence" -- I want it to make good code, follow instructions, and find bugs. Gemini wasn't bad at the last bit, but wasn't great at the others.

They're all trying to make general purpose AI, but I just want really smart augmentation / tools.

tootie · 2025-12-02T18:26:21 1764699981

No? My recent experience with Gemini was terrific. The last big test I gave of Claude it spun an immaculate web of lies before I forced it to confess.

llm_nerd · 2025-12-02T17:39:57 1764697197

What does your comment have to do with the submission? What a weird non-sequitur. I even went looking at the linked article to see if it somehow compares with Gemini. It doesn't, and only relates to open models.

In prior posts you oddly attack "Palantir-partnered Anthropic" as well.

Are things that grim at OpenAI that this sort of FUD is necessary? I mean, I know they're doing the whole code red thing, but I guarantee that posting nonsense like this on HN isn't the way.

cmrdporcupine · 2025-12-02T17:37:43 1764697063

I also had bad luck when I finally tried Gemini 3 in the gemini CLI coding tool. I am unclear if it's the model or their bad tooling/prompting. It had, as you said, hallucination problems, and it also had memory issues where it seemed to drop context between prompts here and there.

It's also slower than both Opus 4.5 and Sonnet.

bluecalm · 2025-12-02T17:31:15 1764696675

My experience is the opposite although I don't use it to write code but to explore/learn about algorithms and various programming ideas. It's amazing. I am close to cancelling my ChatGPT subscription (I would only use Open Router if it had nicer GUI and dark mode anyway).

minimaxir · 2025-12-02T17:30:49 1764696649

For noncoding tasks, Gemini atleast allows for easier grounding with Google Search.

gunalx · 2025-12-03T08:56:44 1764752204

Have used gemini3 to GEW shot a few problems GPT5 struggled on.

alfalfasprout · 2025-12-02T17:01:10 1764694870

If anything it's a testament to human intelligence that benchmarks haven't really been a good measure of a model's competence for some time now. They provide a relative sorting to some degree, within model families, but it feels like we've hit an AI winter.

moffkalast · 2025-12-02T18:15:37 1764699337

Yes, and likewise with Kimi K2. Despite being on the top of open source benches it makes up more batshit nonsense than even Llama 3.

Trust no one, test your use case yourself is pretty much the only approach, because people either don't run benchmarks correctly or have the incentive not to.

VeejayRampay · 2025-12-03T04:49:42 1764737382

no, I find Gemini to be the best

arnaudsm · 2025-12-02T16:00:11 1764691211

Geometric mean of MMMLU + GPQA-Diamond + SimpleQA + LiveCodeBench :

- Gemini 3.0 Pro : 84.8

- DeepSeek 3.2 : 83.6

- GPT-5.1 : 69.2

- Claude Opus 4.5 : 67.4

- Kimi-K2 (1.2T) : 42.0

- Mistral Large 3 (675B) : 41.9

- Deepseek-3.1 (670B) : 39.7

The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.

jasonjmcghee · 2025-12-02T16:15:33 1764692133

How is there such a gap between Gemini 3 vs GPT 5.1/Opus 4.5? What is Gemini 3 crushing the others on?

arnaudsm · 2025-12-02T16:50:30 1764694230

Could be optimized for benchmarks, but Gemini 3 has been stellar for my tasks so far.

Maybe an architectural leap?

netdur · 2025-12-02T19:28:48 1764703728

I believe it is the system instructions that make the difference for Gemini, as I use Gemini on AI Studio with my system prompts to get it to do what I need it to do, which is not possible with gemini.google.com's gems

gishh · 2025-12-02T16:35:16 1764693316

Gamed tests?

rdtsc · 2025-12-02T16:43:11 1764693791

I always joke that Google pays for a dedicated developer to spend their full time just to make pelicans on bicycles look good. They certainly have the cash to do it.

tootyskooty · 2025-12-02T16:52:16 1764694336

Since no one has mentioned it yet: note that the benchmarks for large are for the base model, not for the instruct model available in the API.

Most likely reason is that the instruct model underperforms compared to the open competition (even among non-reasoners like Kimi K2).

esafak · 2025-12-02T16:04:46 1764691486

Well done to the France's Mistral team for closing the gap. If the benchmarks are to be believed, this is a viable model, especially at the edge.

nullbio · 2025-12-02T17:01:30 1764694890

Benchmarks are never to be believed, and that has been the case since day 1.

hnuser123456 · 2025-12-02T15:32:01 1764689521

Looks like their own HF link is broken or the collection hasn't been made public yet. The 14B instruct model is here:

https://huggingface.co/mistralai/Ministral-3-14B-Instruct-25...

The unsloth quants are here:

https://huggingface.co/unsloth/Ministral-3-14B-Instruct-2512...

janpio · 2025-12-02T15:43:30 1764690210

Seems fixed now:

https://huggingface.co/collections/mistralai/mistral-large-3

https://huggingface.co/collections/mistralai/ministral-3

andhuman · 2025-12-02T15:32:35 1764689555

This is big. The first really big open weights model that understands images.

yoavm · 2025-12-02T15:47:16 1764690436

How is this different from Llama 3.2 "vision capabilities"?

https://www.llama.com/docs/how-to-guides/vision-capabilities...

Havoc · 2025-12-02T16:01:10 1764691270

Guessing GP commenter considers Apache more "open" than Meta's license. Which to be fair isn't terrible but also not quite as clean as straight apache

mesebrec · 2025-12-02T18:48:39 1764701319

Llama's license explicitly disallows its usage in the EU.

If that doesn't even meet the threshold for "terrible", then what does?

CamperBob2 · 2025-12-02T21:55:54 1764712554

Why does it disallow usage in the EU?

Terretta · 2025-12-05T14:57:20 1764946640

You'd have to ask EU's regulators why they wanted Meta to disallow it.

Much like you'd have to ask UK lawmakers why they wanted UK citizens to be unable to keep their own Apple iCloud backups secure.

trvz · 2025-12-02T17:00:24 1764694824

Sad to see they've apparently fully given up on releasing their models via torrent magnet URLs shared on Twitter; those will stay around long after Hugging Face is dead.

ThrowawayTestr · 2025-12-02T17:54:11 1764698051

How does HF manage to serve such big files?

nikcub · 2025-12-02T19:13:08 1764702788

s3 + cloudfront

https://huggingface.co/blog/rearchitecting-uploads-and-downl...

ThrowawayTestr · 2025-12-02T21:28:42 1764710922

I meant more how do they pay for all that bandwidth. I can download a 20gb model in like 2 minutes

accrual · 2025-12-03T16:15:11 1764778511

Congrats on the release, Mistral team!

I haven't used Mistral much until today but am impressed. I normally use Gemma 3 27B locally, but after regenerating some responses with Mistral 3 14B, the output quality is very similar despite generating much faster on my hardware.

The vision aspect also worked fine, and actually was slightly better on the same inputs versus qwen3 VL 8B.

All in all impressive small dense model, looking forward to using it more.