There are many lists, but I find all of them outdated or containing wrong information or missing the actual benchmarks I'm looking for.
I was thinking, that maybe it's better to make my own benchmarks with the questions/things I'm interested in, and whenever a new model comes out run those tests with that model using open-router.
But in the end, isn't this the same idea with the MoE?
Where we have more specialized "jobs", which the model is actually trained for.
I think the main difference with agents swarm is the ability to run them in parallel. I don't see how this adds much compared to simply sending multiple API calls in parallel with your desired tasks. I guess the only difference is that you let the AI decide how to split those requests and what each task should be.
Nope. MoE is strictly about model parameter sparsity. Agents are about running multiple small-scale tasks in parallel and aggregating the results for further processing - it saves a lot of context length compared to having it all in a single session, and context length has quadratic compute overhead so this matters. You can have both.
One positive side effect of this is that if subagent tasks can be dispatched to cheaper and more efficient edge-inference hardware that can be deployed at scale (think nVidia Jetsons or even Apple Macs or AMD APU's) even though it might be highly limited in what can fit on the single node, then complex coding tasks ultimately become a lot cheaper per token than generic chat.
My point was that this is just a different way of creating specialised task solvers, the same as with MoE.
And, as you said, with MoE it's about the model itself, and it's done at training level so that's not something we can easily do ourselves.
But with agent swarm, isn't it simply splitting a task in multiple sub-tasks and sending each one in a different API call? So this can be done with any of the previous models too, only that the user has to manually define those tasks/contexts for each query.
Or is this at a much more granular level than this, which would not be feasible to be done by hand?
I was already doing this in n8n, creating different agents with different system prompts for different tasks. I am not sure if automating this (with swarm) would work well in my most cases, I don't see how this fully complements Tools or Skills
MoE has nothing whatsoever to do with specialized task solvers. It always operates per token within a single task, you can think of it perhaps as a kind of learned "attention" for model parameters as opposed to context data.
An LLM model only outputs tokens, so this could be seen as an extension of tool calling where it has trained on the knowledge and use-cases for "tool-calling" itself as a sub-agent.
Sort of. It’s not necessarily a single call. In the general case it would be spinning up a long-running agent with various kinds of configuration — prompts, but also coding environment and which tools are available to it — like subagents in Claude Code.
it's basically a cli for controlling a browser. the idea is that an agent like claude code would use it for validating something that it just did like changing something on the UI
What browser? My question comes from security, adding that skills just provides a line of bash, with no further info. I checked the .md file but it just lists a list of commands with agent-browser.
Agent browser is more lightweight than playwright mcp. Claude Chrome requires some manual setup, and works better in cases requires your actual browser not a headless one.
Every time I've tried to actually use gpt-oss 20b it's just gotten stuck in weird feedback loops reminiscent of the time when HAL got shut down back in the year 2001. And these are very simple tests e.g. I try and get it to check today's date from the time tool to get more recent search results from the arxiv tool.
It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
This happened a few months ago, but I just found out about it, so I thought it was worth sharing. I was surprised to see this, as I was under the impression the project was doing well and growing (I did assume though their revenue lacked, because most people who knew about the platform were self-hosters).
Yes, there was much depression amongst InvokeAI users when that shift happened back in January. Luckily, in the past two months a rather genius contributor has added the superb Z-Image turbo model, so the open source app is now on a tear.
https://www.reddit.com/r/StableDiffusion/comments/1q3ruuo/re...
"Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver"
It is far more than an LLM, and math != "language".
> Aristotle integrates three main components (...)
The second one being backed by a model.
> It is far more than an LLM
It's an LLM with a bunch of tools around it, and a slightly different runtime that ChatGPT. It's "only" that, but people - even here, of all places - keep underestimating just how much power there is in that.
Transformer != LLM. See my edited top-level post. Just because Aristotle uses a transformer doesn't mean it is an LLM, just as Vision Transformers and AlphaFold use transformers but are not LLMs.
LLM = Large Language Model. Large refers to both the number of parameters (and in practice, depth) of the model, and also implicitly the amount of data used for training, and "language" means human (i.e. written, spoken) language. A Vision Transformer is not an LLM because it is trained on images, and AlphaFold is not an LLM because it is trained molecular configurations.
Aristotle works heavily with formalized LEAN statements and expressions. While you can certainly argue this is a language of sorts, it is not at all the same "language" as the "language" in LLMs. Calling Aristotle an "LLM" just because it has a transformer is more misleading than truthful, because every other single aspect of it is far more clever and involved.
Sigh. If I start with a pre-trained LLM architecture, and then do extensive further training / fine-tuning with different data and loss functions and custom similarity metrics for specialized search and specialized training procedures, and use feedback from other automated systems, we are far, far more than an LLM. That's the point. Calling something like this an LLM is as deeply misleading as calling AlphaFold an LLM. These tools goes far beyond simple LLMs. The special losses and metrics are really so important here and are why these tools can be so game-changing.
In this context, we're not even talking about "math" (as a broad, abstract concept). We're strictly talking about converting English to Lean. Both are just languages. Lean isn't just something that can be a language. It's a language.
There is no reason or framing where you can say Aristotle isn't a language model.
That's true, and a good fundamental point. But here it's much simpler than that: math is a language the same way code is, and if there's one thing LLMs excel at, it's reading and writing code and translating back and forth between code and natural language.
reply