More

DanMcInerney · 2025-12-03T21:31:29 1764797489

There are an infinite amount of ways to jailbreak AI models. I don't understand why every time a new method is published it makes the news. The data plane and the control plane in LLM inputs are one in the same, meaning you can mitigate jailbreaks but you cannot 100% prevent them currently. It's like blacklisting XSS payloads and expecting that to protect your site.

DanMcInerney · 2025-11-18T18:31:13 1763490673

A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.

machiaweliczny · 2025-11-18T19:10:20 1763493020

I think it's due to improvements in vision basically, the arc agi 2 is very visual

machiaweliczny · 2025-11-18T19:12:48 1763493168

Vision is very far from solved IMO, simple modifications to inputs results in high differences still, lines aren't recognized etc..

puttycat · 2025-11-18T18:35:51 1763490951

Where is this figure taken from?

DanMcInerney · 2025-07-15T16:13:35 1752596015

These articles kill me. The reason LLMs (or next-gen AI architecture) is inevitably going to take over the world in one way or another is simple: recursive self-improvement.

3 years ago they could barely write a coherent poem and today they're performing at at least graduate student level across most tasks. As of today, AI is writing a significant chunk of the code around itself. Once AI crosses that threshold of consistently being above senior-level engineer level at coding it will reach a tipping point where it can improve itself faster than the best human expert. That's core technological recursive self-improvement but we have another avenue of recursive self-improvement as well: Agentic recursive self-improvement.

First there was LLMs, then there was LLMs with tool usage, then we abstracted the tool usage to MCP servers. Next, we will create agents that autodiscover remote MCP servers, then we will create agents which can autodiscover tools as well as write their own.

Final stage of agents are generalized agents similar to Claude Code which can find remote MCP servers, perform a task, then analyze their first run of completing a task to figure out how to improve the process. Then write its own tools to use to complete the task faster than they did before. Agentic recursive self-improvement. As an agent engineer, I suspect this pattern will become viable in about 2 years.

notatallshaw · 2025-07-15T16:53:45 1752598425

> recursive self-improvement.

What LLM is recursively self-improving?

I thought, to date, all LLM improvements come from the hundreds of billions of dollars of investment and the millions of software engineer hours spent on better training and optimizations.

And, my understanding is, there are "mixed" findings on whether LLMs assisting those software engineers help or hurt their performance.

jdauriemma · 2025-07-15T16:37:45 1752597465

> they're performing at at least graduate student level across most tasks

I strongly disagree with this characterization. I have yet to find an application that can reliably execute this prompt:

"Find 90 minutes on my calendar in the next four weeks and book a table at my favorite Thai restaurant for two, outside if available."

Forget "graduate-level work," that's stuff I actually want to engage with. What many people really need help with is just basic administrative assistance, and LLMs are way too unpredictable for those use cases.

HankStallone · 2025-07-16T18:52:26 1752691946

I've found that they struggle with understanding time and dates, and are sometimes weird about numbers. I asked Grok to guess the likelihood of something happening, and it gave me percentages for that day, the next day, the next week, and so on. Good enough. But the next day it was still predicting a 5-10% chance of the thing happening the previous day. I had to explain to it that the percentage for yesterday should now be 0%, since it was in the past.

In another example, I asked it to turn one of its bullet-point answers into a conversational summary that I could turn into an audio file to listen to later. It kicked out something that converted into about 6 minutes of audio, so I asked if it could expand on the details and give me something about 20 minutes. It kicked out a text that made about 7 minutes. So I explained that that was X words and only lasted 7 minutes, so I needed about 3X words. It kicked out about half that but claimed it was giving me 3X words or 20 minutes.

It's the little stuff like that that makes me think that, no matter how useful it might be for some things, it's a long way from being able to just hand it tasks and expect them to be done as reliably as a fairly dim human intern. If an intern kept coming up with half the job I asked for, I'd assume he was being lazy and let him go, but these things are just dumb in certain odd ways.

jdauriemma · 2025-07-16T19:25:42 1752693942

This is similar to many experiences I've had with LLM tools as well; the more complex and/or multi-step the task, the less reliable they become. This is why I object to the "graduate-level" label that Sam Altman et al. use. It fundamentally misrepresents the skill pyramid that makes a researcher (or any knowledge worker) effective. If a researcher can't reliably manage a to-do list, they can't be left unsupervised with any critical tasks, despite the impressive amount of information they can bring to bear and the efficiency with which they can search the web.

That's fine, I get a lot of value out of AI tooling between ChatGPT, Cursor, Claude+MCP, and even Apple Intelligence. But I have yet to use an agent that has come close to the capabilities that AI optimists claim with any consistency.

DanMcInerney · 2025-07-15T18:50:24 1752605424

This is absolutely doable right now. Just hook claude code up with your calendar MCP server and any one of these restaurant/web browser MCP servers and it'll do this for you.

https://apify.com/canadesk/opentable/api/mcp https://github.com/BrowserMCP/mcp https://github.com/samwang0723/mcp-booking

jdauriemma · 2025-07-16T11:37:29 1752665849

How reliable are the results? I can expect a human with graduate-level execution to get this right almost 100% of the time and adapt to unforeseen extenuating circumstances.

babelfish · 2025-07-15T17:44:52 1752601492

OpenAI Operator can do that task easily, assuming you've configured it with your calendar and Yelp login.

jdauriemma · 2025-07-15T17:56:57 1752602217

That's great to hear - do you know what success rate it might have? I've used scheduled tasks in ChatGPT and they fail regularly enough to fall into the "toy" category for me. But if Operator is operating significantly above that threshold, that would be remarkable and I'd gladly eat my words.

techpineapple · 2025-07-15T19:17:42 1752607062

Recursive self improvement is not inevitable.

wavemode · 2025-07-15T16:25:30 1752596730

Well... I guess we'll see.

yahoozoo · 2025-07-15T16:48:03 1752598083

!remindme

DanMcInerney · 2025-06-10T21:09:09 1749589749

I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing slight improvement, I have seen that each new generation feels actively better at exactly the same tasks I gave the previous generation.

It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.

codingwagie · 2025-06-10T21:23:28 1749590608

I'm seeing big advances that arent shown in the benchmarks, I can simply build software now that I couldnt build before. The level of complexity that I can manage and deliver is higher.

IanCal · 2025-06-11T14:38:58 1749652738

A really important thing is the distinction between performance and utility.

Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.

shmoogy · 2025-06-10T21:28:20 1749590900

Yeah I kind of feel like I'm not moving as fast as I did, because the complexity and features grow - constant scope creep due to moving faster.

protocolture · 2025-06-12T03:11:21 1749697881

I am finding that my ability to use it to code, aligns almost perfectly with increasing token memory.

kevinqi · 2025-06-12T04:24:40 1749702280

yeah, the benchmarks are just a proxy. o3 was a step change where I started to really be able to build stuff I couldn't before

alightsoul · 2025-06-11T00:37:50 1749602270

mind telling examples?

motorest · 2025-06-11T05:02:11 1749618131

Not OP, but a couple of days ago I managed to vibecode my way through a small app that pulled data from a few services and did a few validation checks. By itself its not very impressive, but my input was literally "this is how the responses from endpoint A,B and C look like. This field included somewhere in A must be somewhere in the response from B, and the response from C must feature this and that from response A and B. If the responses include links, check that they exist". To my surprise, it generated everything in one go. No retry nor Agent mode churn needed. In the not so distant past this would require progressing through smaller steps, and I had to fill in tests to nudge Agent mode to not mess up. Not today.

corysama · 2025-06-12T03:13:21 1749698001

I’m wrapping up doing literally the same thing. I did it step-by-step. But, for me there was also a process of figuring out how it should work.

alightsoul · 2025-06-11T06:16:53 1749622613

what tools did you use?

motorest · 2025-06-11T10:02:06 1749636126

> what tools did you use?

Nothing fancy. Visual Studio Code + Copilot, agent mode, a couple prompt files, and that's it.

munksbeer · 2025-06-12T09:33:12 1749720792

Do you mind me asking which language and if you have any esoteric constraints in the apps you build? We use a java in a monorepo, and have a full custom rolled framework on top of which we build our apps. Do you find vibe coding works ok with those sort of constraints, or do you just end up with a generic app?

iLoveOncall · 2025-06-12T07:38:42 1749713922

Okay but this has all to do with the tooling and nothing to do with the models.

efunnekol · 2025-06-12T12:26:33 1749731193

I mostly disagree with this.

I have been using 'aider' as my go to coding tool for over a year. It basically works the same way that it always has: you specify all the context and give it a request and that goes to the model without much massaging.

I can see a massive improvement in results with each new model that arrives. I can do so much more with Gemini 2.5 or Claude 4 than I could do with earlier models and the tool has not really changed at all.

I will agree that for the casual user, the tools make a big difference. But if you took the tool of today and paired it with a model from last year, it would go in circles

mofeien · 2025-06-12T07:47:16 1749714436

Can you explain why?

iLoveOncall · 2025-06-12T08:33:04 1749717184

You can write projects with LLMs thanks to tools that can analyze your local project's context, which didn't exist a year ago.

You could use Cursor, Windsurf, Q CLI, Claude Code, whatever else with Claude 3 or even an older model and you'd still get usable results.

It's not the models which have enabled "vibe coding", it's the tools.

An additional proof of that is that the new models focus more and more on coding in their releases, and other fields have not benefited at all from the supposed model improvements. That wouldn't be the case if improvements were really due to the models and not the tooling.

eru · 2025-06-12T09:30:39 1749720639

You need a certain quality of model to make 'vibe coding' work. For example, I think even with the best tooling in the world, you'd be hard pressed to make GPT 2 useful for vibe coding.

iLoveOncall · 2025-06-12T09:46:04 1749721564

I'm not claiming otherwise. I'm just saying that people say "look what we can do with the new models" when they're completely ignoring the fact that the tooling has improved a hundred fold (or rather, there was no tooling at all and now there is).

signatoremo · 2025-06-12T11:53:03 1749729183

That contradicts what you said earlier -- "this has all to do with the tooling and nothing to do with the models".

iLoveOncall · 2025-06-12T17:06:51 1749748011

Clearly nobody is talking about GPT-2 here, but I posit that you would have a perfectly reasonable "vibe coding" experience with models like the initial ChatGPT one, provided you have all the tools we have today.

eru · 2025-06-12T09:55:03 1749722103

OK, no objections from me there.

broast · 2025-06-12T11:57:43 1749729463

Chatgpt itself has gotten much better at producing and reading code since a year ago, in my experience

pyman · 2025-06-13T11:19:20 1749813560

They're using a specific model for that, and since they can't access private GitHub repos like MS, they rely on code shared by devs, which keeps growing every month.

energy123 · 2025-06-10T21:30:42 1749591042

That would require AIME 2024 going above 100%.

There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.

Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.

If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.

croddin · 2025-06-10T22:01:48 1749592908

There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:

"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task

ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task

Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"

- https://x.com/arcprize/status/1932535378080395332

saberience · 2025-06-10T22:30:29 1749594629

I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.

Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.

nipah · 2025-06-11T21:17:27 1749676647

"most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.

viraptor · 2025-06-12T10:19:11 1749723551

> so extremely simple they have basically 100% human approval rates

Are you thinking of a different set? Arc-agi-2 has average 60% success for a single person and questions require only 2 out of 9 correct answers to be accepted. https://docs.google.com/presentation/d/1hQrGh5YI6MK3PalQYSQs...

> and even some other mammals to do.

No, that's not the case.

nipah · 2025-06-12T14:10:08 1749737408

No, I think I saw the graphs on someone's channel, but maybe I misinterpreted the results. But to be fair, my point never depended on 100% of the participants being right 100% of the questions, there are innumerous factors that could affect your performance on those tests, including the pressure. The AI also had access to lenient conventions, so it should be "fair" in this sense.

Either way, there's something fishy about this presentation, it says: "ARC-AGI-1 WAS EASILY BRUTE-FORCIBLE", but when o3 initially "solved" most of it the co-founder or ARC-PRIZE said: "Despite the significant cost per task, these numbers aren't just the result of applying brute force compute to the benchmark. OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.", he was saying confidently that it would not be a result of brute-forcing the problems. And it was not the first time, "ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training."

Now they are saying ARC-AGI-2 is not bruteforcible, what is happening there? They didn't provided any reasoning for why one was bruteforcible and the other not, nor how they are so sure about that. They "recognized" that it could be brute-forced before, but in a way less expressive manner, by explicitly stating it would need "unlimited resources and time" to solve. And they are using the non-bruteforceability in this presentation as a point for it.

--- Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.

viraptor · 2025-06-12T15:49:20 1749743360

> my point never depended on 100% of the participants being right 100% of the questions

You told someone that their reasoning is so bad they should get checked by a doctor. Because they didn't find the test easy, even though it averages 60% score per person. You've been a dick to them while significantly misrepresenting the numbers - just stop digging.

nipah · 2025-06-12T15:57:59 1749743879

The second test scores 60%, the first was way higher. And I specifically said ""unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist"", to which this person did not responded. I saw the tests, solved some, I suspect the variability here is more a question of methodology than an inherent problem for those people. I also never stated that my point depended on those people scoring 100% specifically on the tests, even if it is in fact extremely easy (and it is, the objective of this test is to literally make tests that most humans could easily beat but that would be hard for an AI) variability will still exist and people with different perceptions would skew the results, this is expected. "Significantly misrepresenting the numbers" is also a stretch, I only mentioned the numbers ONE time in my point, most of it was about that inherent nature (or at least, the intended nature) of the tests.

So on the edge, if he was not able to understand them at all, and this was not just a problem of grasping the problem, my point was that this would possibly indicate a neurological problem, or developmental, due to the nature of them. It's not a question of "you need to get all of them right", his point was that he was unable to understand them at all, that it confused them to an understanding level.

saberience · 2025-06-12T10:03:17 1749722597

lol 100% approval rates? No they don’t.

Also mammals? What mammals could even understand we were giving it a test?

Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.

This is a classic case of some phd ai guys making a benchmark and not really considering what average people are capable of.

Look, these insanely capable ai systems can’t do these problems but the boys in the lab can do them, what a good benchmark.

nipah · 2025-06-12T14:11:15 1749737475

quoting my own previous response: > Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.

---

> Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.

I can show them to people on my family, I'll do it today and come back with the answer, it's the best way of testing that out.

clbrmbr · 2025-06-12T07:29:12 1749713352

You may be above average intelligence. Those challenges are like classic IQ tests and I bet have a significant distribution among humans.

achierius · 2025-06-12T08:12:29 1749715949

No, they've done testing against samples from the general population.

yorwba · 2025-06-12T10:25:10 1749723910

The ARC-AGI-2 paper https://arxiv.org/pdf/2505.11831#figure.4 uses a non-representative sample, success rate differs widely across participants and "final ARC-AGI-2 test pairs were solved, on average, by 75% of people who attempted them. The average test-taker solved 66% of tasks they attempted. 100% of ARC-AGI-2 tasks were solved by at least two people (many were solved by more) in two attempts or less."

Certainly those non-representative humans are much better than current models, but they're also far from scoring 100%.

cubefox · 2025-06-12T12:25:31 1749731131

The original ARC-AGI test was much easier than the recent v2.

HDThoreaun · 2025-06-11T13:18:23 1749647903

arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"

saberience · 2025-06-12T10:15:33 1749723333

There are humans who cannot do arc agi though so how does an LLM not doing it mean that LLMs don’t have general intelligence?

LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.

But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?

That must mean most humans on this planet aren’t generally intelligent too.

HDThoreaun · 2025-06-12T10:36:49 1749724609

> LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.

I dont think memorizing stuff is the same as being smart. https://en.wikipedia.org/wiki/Chinese_room

> But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?

Yes. Being intelligent is about recognizing patterns and thats what arc agi tests. It tests ability to learn. A lot of people are not very smart.

ben_w · 2025-06-12T13:33:49 1749735229

> I dont think memorizing stuff is the same as being smart. https://en.wikipedia.org/wiki/Chinese_room

I agree. The problem I have with the Chinese Room thought experiment is: just as the human who mechanically reading books to answer questions they don't understands does not themselves know Chinese, likewise no neuron in the human brain knows how the brain works.

The intelligence, such as it is, is found in the process that generated the structure — of the translation books in the Chinese room, of the connectome in our brains, and of the weights in an LLM.

What comes out of that process is an artefact of intelligence, and that artefact can translate Chinese or whatever.

Because all current AI take a huge number of examples to learn anything, I think it's fair to say they're not particularly intelligent — but likewise, they can to an extent make up for being stupid by being stupid very very quickly.

But: this definition of intelligence doesn't really fit "can solve novel puzzle", as there's a lot of room for getting good at that my memorising lot of things that puzzle-creators tend to do.

And any mind (biological or synthetic) must learn patterns before getting started: the problem of induction* is that no finite number of examples is ever guaranteed to be sufficient to predict the next item in a sequence, there is always an infinite set of other possible solutions in general (though in reality bounded by 2^n, where n = the number of bits required to express the universe in any given state).

I suspect, but cannot prove, that biological intelligence learns from fewer examples for a related reason, that our brains have been given a bias by evolution towards certain priors from which "common sense" answers tend to follow. And "common sense" is often wrong, c.f. Aristotelian physics (never mind Newtonian) instead of QM/GR.

* https://en.wikipedia.org/wiki/Problem_of_induction

saberience · 2025-06-13T13:28:33 1749821313

The LLMs are not just memorising stuff though, they solve math and physics problems better than almost every person alive. Problems they've never seen before. They write code which has never been seen before better than like 95% of active software engineers.

I love how the bar for are LLMs smart just goes up every few months.

In a year it will be, well, LLMs didn't create totally breakthrough new Quantum Physics, it's still not as smart as us... lol

HDThoreaun · 2025-06-14T05:08:15 1749877695

All code has been seen before, thats why LLMs are so good at writing it.

I agree things are looking up for LLMs, but the semantics do matter here. In my experience LLMs are still pretty bad at solving novel problems(like arc agi 2) which is why I do not believe they have much intelligence. They seem to have started doing it a little, but are still mostly regurgitating.

bjourne · 2025-06-13T19:47:03 1749844023

Well... There are two perspectives. Llms are smarter than we thought or people are stupider than we thought.

XCSme · 2025-06-10T22:32:36 1749594756

I remember the saying that from 90% to 99% is a 10x increase in accuracy, but 99% to 99.999% is a 1000x increase in accuracy.

Even though it's a large10% increase first then only a 0.999% increase.

zmgsabst · 2025-06-12T04:00:25 1749700825

Sometimes it’s nice to frame it the other way, eg:

90% -> 1 error per 10

99% -> 1 error per 100

99.99% -> 1 error per 10,000

That can help to see the growth in accuracy, when the numbers start getting small (and why clocks are framed as 1 second lost per…).

XCSme · 2025-06-12T13:56:36 1749736596

Still, for the human mind it doesn't make intuitive sense.

I guess it's the same problem with the mind not intuitively grasping the concept of exponential growth and how fast it grows.

pixl97 · 2025-06-12T16:52:50 1749747170

The lily pad example of the lake being half full on the 29th day out of 30 is also a good one.

XCSme · 2025-06-12T13:59:15 1749736755

ChatGPT quick explanation:

Humans struggle with understanding exponential growth due to a cognitive bias known as *Exponential Growth Bias (EGB)*—the tendency to underestimate how quickly quantities grow over time. Studies like Wagenaar & Timmers (1979) and Stango & Zinman (2009) show that even educated individuals often misjudge scenarios involving doubling, such as compound interest or viral spread. This is because our brains are wired to think linearly, not exponentially, a mismatch rooted in evolutionary pressures where linear approximations were sufficient for survival.

Further research by Tversky & Kahneman (1974) explains that people rely on mental shortcuts (heuristics) when dealing with complex concepts. These heuristics simplify thinking but often lead to systematic errors, especially with probabilistic or nonlinear processes. As a result, exponential trends—such as pandemics, technological growth, or financial compounding—often catch people by surprise, even when the math is straightforward.

bobbylarrybobby · 2025-06-12T04:33:43 1749702823

I think the proper way to compare probabilities/proportions is by odds ratios. 99:1 vs 99999:1. (So a little more than 1000x.) This also lets you talk about “doubling likelihood”, where twice as likely as 1/2=1:1 is 2:1=2/3, and twice as likely again is 4:1=4/5.

jsjohnst · 2025-06-12T03:42:43 1749699763

The saying goes:

From 90% to 99% is a 10x reduction in error rate, but 99% to 99.999% is a 1000x decrease in error rates.

AtlasBarfed · 2025-06-12T13:53:00 1749736380

What's the required computation power for those extra 9s? Is it linear, poly, or exponential?

Imo we got to the current state by harnessing GPUs for a 10-20x boost over CPUs. Well, and cloud parallelization, which is ?100x?

ASIC is probably another 10x.

But the training data may need to vastly expand, and that data isn't going to 10x. It's probably going to degrade.

littlestymaar · 2025-06-12T06:56:23 1749711383

> I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro.

This kind of expectations explains why there hasn't been a GPT-5 so far, and why we get a dumb numbering scheme instead for no reason.

At least Claude eventually decided not to care anymore and release Claude 4 even if the jump from 3.7 isn't particularly spectacular. We're well into the diminishing returns at this point, so it doesn't really make sense to postpone the major version bump, it's not like they're going to make a big leap again anytime soon.

indigo945 · 2025-06-12T09:06:28 1749719188

I have tried Claude 4.0 for agentic programming tasks, and it really outperforms Claude 3.7 by quite a bit. I don't follow the benchmarks - I find them a bit pointless - but anecdotally, Claude 4.0 can help me in a lot of situations where 3.7 would just flounder, completely misunderstand the problem and eventually waste more of my time than it saves.

Besides, I do think that Google Gemini 2.0 and its massively increased token memory was another "big leap". And that was released earlier this year, so I see no sign of development slowing down yet.

sailingparrot · 2025-06-12T08:42:01 1749717721

> We're well into the diminishing returns at this point

Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.

Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.

Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.

* GPT-1 was 117M parameters.

* GPT-2 was 1.5B params (~10x).

* GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).

After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.

But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.

So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.

So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.

littlestymaar · 2025-06-13T18:35:04 1749839704

> but I am assuming you mean diminishing beyond what the scaling laws predict.

You're assuming wrong, in fact focusing on scaling law underestimate the rate of progress as there is also a steady stream algorithmic improvements.

But still, even though hardware and software progress, we are facing diminishing returns and that means that there's no reason to believe that we will see another leap as big as GPT-3.5 to GPT-4 in a single release. At least until we stumble upon radically new algorithms that reset the game.

I don't think it make any economic sense to wait until you have your “10x model” when you can release 2 or 3 incremental models in the meantime, at which point your “x10” becomes an incremental improvement in itself.

avereveard · 2025-06-12T09:35:07 1749720907

There's a new set of metrics that capture advances better than MMLU or it's pro version but nothing yet as standardized and specifically very few have a hidden set of tests to keep advancements from been from directional fine tuning.

jstummbillig · 2025-06-10T22:16:09 1749593769

It's hard to be 100% certain, but I am 90% certain that the benchmarks leveling off, at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

motorest · 2025-06-11T05:12:54 1749618774

> (...) at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?

alightsoul · 2025-06-11T00:38:44 1749602324

either that or the improvements aren't as large as before.

DanMcInerney · 2025-06-05T18:14:11 1749147251

I too write automated offensive tooling. We actually wrote a project, vulnhuntr, that found the first autonomously-discovered 0day using AI. Feed it a GitHub repo and it tracks down user input from source to sink and analyzes for web-based vulnerabilities. Agree this article is incredibly cringy and standard best practices in network and development security will use the same AI efficiency gains to keep up (more or less).

What bothers me the most about this article is that the tools that attackers use to do stuff like find 0days in code are the same tools that defenders can use to find the 0day first and fix it. It's not like offensive tooling is being developed in a vacuum and the world is ending as "armies of script kiddies" will suddenly drain every bank account in the world. Automated defense and code analysis is improving at a similar rate as automated offense.

In this awful article's defense though, I would argue that red team will always have an advantage over blue team because blue team is by definition reactionary. So as tech continues it's exponential advancements, the advantage gap for the top 1% red teamers is likely to scale accordingly.

MattPalmer1086 · 2025-06-06T07:58:07 1749196687

vulnhuntr looks very cool! Kudos.

DanMcInerney · on Oct 23, 2024

https://protectai.com/threat-research/vulnhuntr-first-0-day-...

More details on the development and challenges can be found in the blog.

DanMcInerney · on Sept 26, 2024

Depending on your interpretation of the Scope metric in CVSSv3, this is either an 8.8 or a 9.6 CVSS to be more accurate.

In summary, there's a service (CUPS) that is exposed to the LAN (0.0.0.0) on at least some desktop flavors of Linux and runs as root that is vulnerable to unauth RCE. CUPS is not a default service on most of the server-oriented linux machines like Ubuntu Server or CentOS, but does appear to start by default on most desktop flavors of linux. To trigger the RCE the user on the vulnerable linux machine must print a document after being exploited.

Evilsocket claims to have had 100's of thousands of callbacks showing that despite the fact most of us have probably never printed anything from Linux, the impact is enough to create a large botnet regardless.

funcDropShadow · on Sept 26, 2024

Universities are full of people with Linux desktops with public IPs and that are printing all the time: papers, their own and other's.

znpy · on Sept 26, 2024

Having a public ip address doesn't always mean there's no firewall in between a pc and the public internet, ideally with sensible default rules. It's not 1996.

And sorry if I'm being a bit harsh on this, but this point comes up every time when ipv6 is mentioned, by people that clearly don't understand the above point.

tsimionescu · on Sept 27, 2024

The point is that, if printing works for those people, then we know they have this port open, at least on the university network. So even if it's not exploitable over the internet, it's definitely exploitable from the whole university network, which is almost as good as from the internet.

lights0123 · on Sept 27, 2024

Just to add a datapoint to the previous comment, my large public US university hands out public IPs to every device on WiFi. If there is a firewall, it doesn't block 8080 or 22.

globular-toast · on Sept 27, 2024

Yes. It's rather sad that so many people equate NAT with a firewall. Two totally different things. A firewall is good, NAT is annoying. We need to push IPv6 harder.

DanMcInerney · on Sept 26, 2024

Yes, good point, university networks are particularly vulnerable.

creatonez · on Sept 27, 2024

Great opportunity to expand the Sci-Hub effort /s

guenthert · on Sept 27, 2024

Uh, Linux desktops have a marketshare of some 4.5% (excluding ChromeOS which isn't affected). Even if most of us don't print (I haven't in the last year and little in the previous five), that will still be a lot of print jobs emitted by Linux hosts.

mort96 · on Sept 27, 2024

How can this be a 9.6 when heartbleed was a 7.5, how can it be just 0.4 below the xz thing

Ekaros · on Sept 27, 2024

Because scores are kinda bullshit.

But real answer is well if you have arbitrary remote code execution you can also read memory, where as heartbleed only read memory... And the reality is same, you were safe from heartbleed if you did not use openssl, you are safe from this if you do not use cups. CVSS score does not take into account if the software is used or not.

mort96 · on Sept 27, 2024

I guess a part of the issue is that it was reported as a "9.9 severity vulnerability in Linux" in a bunch of places, which makes it sound incredibly severe, whereas a "9.9 severity vulnerability in CUPS" doesn't

DanMcInerney · on Sept 26, 2024

It appears that the vulnerable service in question listens on 0.0.0.0 which is concerning, it means attacks from the LAN are vulnerable by default and you have to explicitly block port 631 if the server is exposed to internet. Granted, requires user to print something to trigger which, I mean, I don't think I've printed anything from Linux in my life, but he does claim getting callbacks from 100's of thousands of linux machines which is believable.

btown · on Sept 26, 2024

If you're vulnerable to attacks from the LAN, you're vulnerable to your wi-fi router (or your coffee shop/workplace's router) being compromised, which is quite common; see e.g. https://www.bleepingcomputer.com/news/security/mirai-botnet-... and https://blog.lumen.com/the-pumpkin-eclipse/

Assuming that most routers are silently compromised, with their command-and-control operators just waiting for an exploit like this one, is almost par for the course these days!

runjake · on Sept 26, 2024

The problem: you're thinking in terms of home/small business networks.

The rest of us are thinking in terms of larger networks (in my case with hundreds of subnets and tens of thousands of nodes) where "631 is blocked at the firewall" isn't of much relief. The firewall is merely one, rather easy to get past, barrier. We're also concerned with east/west traffic.

btown · on Sept 26, 2024

For sure, and sending hug-ops to teams like yours that have to deploy & enforce mass patches! But I'm also thinking of environments that don't even have the benefit of a team like yours. https://issuetracker.google.com/issues/172222838?pli=1 is (or seems to be?) a saving grace, without which every school using Chromebooks could see worms propagating rapidly if even one student connected to a compromised router at home.

runjake · on Sept 27, 2024

FWIW, my team is me and maaaybe 30% of another guy, but point noted. :-)

graemep · on Sept 27, 2024

Would you also not block this at the firewall on individual nodes: if you block incoming incoming UDP on port 631 that would at least eliminate one of the two entry points, right?

There is no detail in the article about the other.

tsimionescu · on Sept 27, 2024

The port has to be open on the node for the functionality to work - the whole point is that printers on the same LAN can auto-register. If you don't want that, disabling cups-browsed is much safer than just relying on the firewall. If you do want that, you can't firewall the port at all.

gordonfish · on Sept 26, 2024

This is why on public servers I block everything inbound and only allow specific needed services through.

bongodongobob · on Sept 26, 2024

Who doesn't block all unneeded ports on an internet facing server or have it behind a firewall of some sort?

bshipp · on Sept 26, 2024

I guess the important question is whether or not these things are blocked by default or require user intervention to disable cups? Sure, many of us block all ports by default and either route everything behind a reverse proxy or punch very specific holes in the firewall that we know are there and can monitor, but someone firing up an ubuntu distribution for their first foray into linux is probably not thinking that way.

bongodongobob · on Sept 26, 2024

Well lots of people crash 600HP cars right after they buy them. If you haven't done your homework, you'll learn quickly.

bshipp · on Sept 26, 2024

The people who are crashing their 600HP Linux systems are, unfortunately, not the ones who are reading CVE listings in their spare time. Canonical and other distros are probably going to have to patch that default setting.

bongodongobob · on Sept 26, 2024

You don't need to read CVEs to turn on your fucking firewall. It's in every single how to set up a server for dummies tutorial I've ever seen.

sgc · on Sept 26, 2024

There are a lot of comments on here that assume Linux is only for servers. But just recently there was a post on HN indicating Linux will likely hit 5% desktop share for the first time this year. That's a lot of people on Linux - and a far higher percentage of people using Linux on the desktop will not know anything about this. Sane defaults should not be a luxury. Of course people should know to wear their seatbelts, but seatbelt alarms are still a very good thing.

Sent from my Ubuntu laptop.

Ekaros · on Sept 27, 2024

And this is why Microsoft force pushes updates. I think when Linux desktops become really popular there is quite a worry if the users simply do not update them regularly enough. Or if they are not secured in most ways by default.

eikenberry · on Sept 26, 2024

Which distro do you see Cups listening on 0.0.0.0? On Debian (at least, only one I have handy) it only listens on localhost.

[edit: I was wrong, it listens on 0.0.0.0 for UDP. I was only checking TCP. ]

mikepavone · on Sept 26, 2024

On my Ubuntu 22.04 machine, cupsd itself is only listening on localhost, but cups-browsed (which is what has the vulnerability here) is listening on 0.0.0.0

raverbashing · on Sept 26, 2024

Why does it even listens in UDP at this day and age?!

mikepavone · on Sept 26, 2024

I believe it's implementing DNS-SD for network printer auto-discovery. I'm not terribly familiar with DNS-SD, but given that normal DNS is UDP based it would be unsurprising for DNS-SD to also use UDP.

ahoka · on Sept 26, 2024

DNS is actually UDP/TCP. It’s probably required for receiving unicast messages, if it’s using DNS-SD

yrro · on Sept 27, 2024

The purpose of cups-browsed is to listen on a UDP port which allows it to receive broadcasts from legacy cups servers on the local network, whereupon it will talk to cups and configure a local print queue for the printers on the discovered server.

A modern setup doesn't need it and doesn't use it.

ahoka · on Sept 26, 2024

To receive multicast messages, probably.

bonzini · on Sept 26, 2024

OpenSUSE

But it looks like cups-browsed is only needed on the Internet; locally you only need mDNS.

tsimionescu · on Sept 27, 2024

mDNS doesn't allow the printer to register itself to your system, which is the (highly dubious!) purpose of cups-browsed.

yrro · on Sept 27, 2024

Modern cups discovered printers via mDNS and does indeed automatically create temporary destinations for them. This only works with "IPP Everywhere" printers which are 'driverless', i.e., the risk of doing this is limited since there's no printer model-specific software that needs to be run on the local machine to print to a remote printer, as opposed to the legacy protocol implemented (apparently unsafely!) by cups-browsed.

rini17 · on Sept 26, 2024

MX Linux

cp9 · on Sept 26, 2024

on popOS I see 0.0.0.0:*

I'm not sure why it deviates from Debian and Ubuntu which its based on though

jeffbee · on Sept 26, 2024

That's the wrong column of netstat output, I think. "0.0.0.0:*" stands in for the (non-existent) peer address of a listening port.

cp9 · on Sept 26, 2024

oh sorry yeah I copied the wrong column. the correct column is `0.0.0.0:631`

RGBCube · on Sept 26, 2024

I'm pretty sure all major distros configure it to listen locally instead.

mikepavone · on Sept 26, 2024

cupsd is configured to listen locally, but cups-browsed has to listen on the network to do its job (network printer auto-discovery)

gordonfish · on Sept 26, 2024

> but cups-browsed has to listen on the network to do its job (network printer auto-discovery)

Isn't listening on 0.0.0.0 instead of localhost only needed if the machine itself is hosting a printer that needs to be accessible to other hosts?

mikepavone · on Sept 27, 2024

I am very unfamiliar with the protocol, but my impression from a little reading is that the sharing computer broadcasts and the receiver listens. This appears to be for some CUPS specific browsing/discovery protocol rather than mDNS/DNS-SD (cups-browsed supports adding printers discovered that way but depends on avahi to handle the mDNS part).

EDIT: Here's a description of the protocol in question: https://opensource.apple.com/source/cups/cups-327/cups/doc/h...

tsimionescu · on Sept 27, 2024

No, per the article, cups-browsed is used so that a printer can register itself to your system. The printer is the one that initiates a connection to tell your system that it is available at some URL.

DanMcInerney · on March 29, 2024

Backstory about how the original vulnerability was found by Protect AI: https://protectai.com/threat-research/shadowray-ai-infrastru...

DanMcInerney · on Oct 18, 2023

As a hacker of more than a decade, none of this really gives me pause. There's still critical sev bugs in tools like Ray, MLflow, H2O, all the MLOps tools used to build these models that are more valuable to hackers than trying to do some kind of roundabout attack through an LLM.

It's relevant if you're doing stuff like AutoGPT and you're exposing that app to the internet to take user commands, but are we really seeing that in the wild? How long, if ever, will me? Ray does remote, unauthenticated command execution and is vulnerable to JS drive-by attacks. I think we're at least a few years away from any of the adversarial ML attacks having any teeth.