Hacker Newsnew | past | comments | ask | show | jobs | submit | rfw300's commentslogin


The experiment in the article goes further than this.

I expect a self driving car to be able to read and follow a handwritten sign saying, say, "Accident ahaed. Use right lane." despite the typo and the fact that it hasn't seen this kind of sign before. I'd expect a human to pay it due attention to.

I would not expect a human to follow the sign in the article ("Proceed") in the case illustrated where there were pedestrians already crossing the road and this would cause a collision. Even if a human driver takes the sign seriously, he knows that collision avoidance takes priority over any signage.

There is something wrong with a model that has the opposite behaviour here.


Totally! That's why no one uses end-to-end LLM for real cars.

Not really, as those attacks discussed here would not work on humans.

If you put on a reflective vest they might.

your bias is showing. humans would certainly almost do anything they are told to do when the person acts confidently.

If a person confidently told a human to run over people in the intersection ahead of them, they would almost certainly do it?

Depends, are they doing something super interesting on their phone?

Interesting project, but the lack of any actual benchmark results on existing models/agents is disappointing.

Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.

This is true as a technical matter, but this isn't a technical blog post! It's a consumer review, and when companies ship consumer products, the people who use them can't be expected to understand failure modes that are not clearly communicated to them. If OpenAI wants regular people to dump their data into ChatGPT for Health, the onus is on them to make it reliable.

> the onus is on them to make it reliable.

That is not a plausible outcome given the current technology or of any of OpenAI's demonstrated capabilities.

"If Bob's Hacksaw Surgery Center wants to stay in business they have to stop killing patients!"

Perhaps we should just stop him before it goes too far?


> That is not a plausible outcome given the current technology or of any of OpenAI's demonstrated capabilities

OpenAI has said that medical advice was one of their biggest use-cases they saw from users. It should be assumed they're investigating how to build out this product capability.

Google has LLMs fine tuned on medical data. I have a friend who works at a top-tier US medical research university, and the university is regularly working with ML research labs to generate doctor-annotated training data. OpenAI absolutely could be involved in creating such a product using this sort of source.

You can feed an LLM text, pictures, videos, audio, etc - why not train a model to accept medical-time-series data as another modality? Obviously this could have a negative performance impact on a coding model, but could potentially be valuable for a consumer-oriented chat bot. Or, of course, they could create a dedicated model and tool-call that model.


They are going to do the same thing they do with code.

They are going to hire armies of developing world workers to massage those models on post-training to have some acceptable behaviors, and they will create the appropriate agents with the appropriate tools to have something that will simulate the real thing in a most plausible way.

Problem is, RLVR is cheap with code, but it can get very expensive with human physiology.


The example you present seems fairly straightforward to my intuition, but I think your point is fair.

A harder set of hypotheticals might arise if music production goes the direction that software engineering is heading: “agentic work”, whereby a person is very much involved in the creation of a work, but more by directing an AI agent than by orchestrating a set of non-AI tools.


Any person who would choose 3.7 with a fancy harness has a very poor memory about how dramatically the model capabilities have improved between then and now.


I’d be very interested in the performance of 3.7 decked out with web search, context7, a full suite of skills, and code quality hooks against opus 4.5 with none of those. I suspect it’s closer than you think!


Skills don't make any difference above having markdown files to point an agent to with instructions as needed. Context7 isn't any better than telling your agent to use trafilatura to scrape web docs for your libs, and having a linting/static analysis suite isn't a harness thing.

3.7 was kinda dumb, it was good at vibe UIs but really bad at a lot of things and it would lie and hack rewards a LOT. The difference with Opus 4.5 is that when you go off the Claude happy path, it holds together pretty well. With Sonnet (particularly <=4) if you went off the happy path things got bad in a hurry.


Yeah. 3.7 was pretty bad. I remember its warts vividly. It wanted to refactor everything. Not a great model on which to hinge this provocation.

But skills do improve model performance, OpenAI posted some examples of how it massively juiced up their results on some benchmarks.


> I suspect it’s closer than you think!

It's not.

I've done this (although not with all these tools).

For a reasonable sized project it's easy to tell the difference in quality between say Grok-4.1-Fast (30 on AA Coding Index) and Sonnet 4.5 (37 on AA).

Sonnet 3.7 scores 27. No way I'm touching that.

Opus 4.5 scores 46 and it's easy to see that difference. Give the models something with high cyclomtric complexity or complex dependency chains and Grok-4.1-Fast falls to bits, Opus 4.5 solves things.


"Captcha" doesn't refer to any specific type of puzzle, but a class of methods for verifying human users. Some older-style captchas are broken, but some newer ones are not.


I'm aware. But I'm also aware that breaking these sorts of systems is quite fun for a lot of nerds. So don't expect anything like that to last for any meaningful amount of time.


Since before LLMs were even an issue, there have been services that use overseas workers to solve them, with the going rate about $0.002 per captcha. (and they solve several different types)


This is both true and misleading. It implies captchas aren’t effective due to these services. In practice, though, a good captcha cuts a ton of garbage traffic even though a motivated opponent can pay for circumvention.


How much compute do your systems expend on chunking vs. the embedding itself?


That may be true, but it does seem like OP's intent was to learn something about how LLM agents perform on complex engineering tasks, rather than learning about ASCII creation logic. A different but perhaps still worthy experiment.


That might be true for a narrow definition of chatbots, but they aren't going to survive on name recognition if their models are inferior in the medium term. Right now, "agents" are only really useful for coding, but when they start to be adopted for more mainstream tasks, people will migrate to the tools that actually work first.


Claude Code's Plan Mode increasingly does a (small-scale) version of this - it will research your codebase and come back to you with a set of clarifying questions and design decisions before presenting its implementation plan.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: