Hacker Newsnew | past | comments | ask | show | jobs | submit | mnk47's commentslogin

In my experience, the model's performance in silly tasks like these is usually (not always) correlated with its performance in other areas except tool use/agent stuff.


You can just use the official Claude Code, OpenAI Codex, and Gemini extensions on VS Code. You get diffs just like in Cursor now. The performance of these models can vary wildly depending on the agent harness they're on.

The official tools won't necessarily give you the best performance, but they're a safer bet for now. This is merely anecdotal as I haven't bothered to check rigorously, but I and others online have found that GPT-5-Codex is worse in Cursor than in the official CLI/extension/web UI.


Go is still going strong after 15 years. Dart, the language of Flutter, is 13 years old.


> In Obsidian, open the daily file amd copy the contents from yesterday

What's the point of this? Isn't it easier to just keep reusing the same note?


Did you read the article? All it basically says is that OpenAI faced struggles this past year -- specifically with GPT-5 aka Orion. And now they have o3, and other labs have made huge strides. So, yes, show me AI progress is slowing down!


Awesome now they can get me wrong answers even quicker :)


> So LLMs finally hit the wall

Not really. Throwing a bunch of unfiltered garbage at the pretraining dataset, throwing in RLHF of questionable quality during post-training, and other current hacks - none of that was expected to last forever. There is so much low-hanging fruit that OpenAI left untouched and I'm sure they're still experimenting with the best pre-training and post-training setups.

One thing researchers are seeing is resistance to post-training alignment in larger models, but that's almost the opposite of a wall, they're figuring it out as well.

> Now someone has to have a new idea

OpenAI already has a few, namely the o* series in which they discovered a way to bake Chain of Thought into the model via RL. Now we have reasoning models that destroy benchmarks that they previously couldn't touch.

Anthropic has a post-training technique, RLAIF, which supplants RLHF,and it works amazingly well. Combined with countless other tricks we don't know about in their training pipeline, they've managed to squeeze so much performance out of Sonnet 3.5 for general tasks.

Gemini is showing a lot of promise with their new Flash 2.0 and Flash 2.0-Thinking models. They're the first models to beat Sonnet at many benchmarks since April. The new Gemini Pro (or Ultra? whatever they call it now) is probably coming out in January.

> The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.

You would probably enjoy this talk [0], it's by an independent researcher who IIRC is a former employee of Deepmind or some other lab. They're exploring this exact idea. It's actually not hard to tell when a model is "confused" (just look at the probability distribution of likely tokens), the challenge is in steering the model to either get back to the right track or give up and say "you know what, idk"

[0] https://www.youtube.com/watch?v=4toIHSsZs1c


> Not really. Throwing a bunch of unfiltered garbage at the pretraining dataset, throwing in RLHF of questionable quality during post-training, and other current hacks - none of that was expected to last forever. There is so much low-hanging fruit that OpenAI left untouched and I'm sure they're still experimenting with the best pre-training and post-training setups.

Exactly! LLama3 and their .x iterations have shown that, at least for now, the idea of using the previous models to filter out the pre-training datasets and use a small amount of seeds to create synthetic datasets for post-training still holds. We'll see with L4 if it continues to hold.


>Then there is the matter of actually defining general intelligence. It may also be the definition of consciousness, or at least require it. But currently, there is no mutually agreed upon definition of "general intelligence".

Here lies the problem. We should have a rule that any time we discuss AGI, we preface with the arbitrary definition that we choose to operate on. Otherwise, these discussions will inevitably devolve into people talking past each other, because everyone has a different default definition of AGI, even within the SF AI scene.

If you ask Yann LeCun, he'll say that no LLM system is even close to being generally intelligent, and that the best LLMs are still dumber than a cat.

If you ask Sam Altman, he'll say that AGI = an AI system that can perform any task as well as the average human or better.

If you ask Dario Amodei, he'll say that he doesn't like that term, mostly because by his original definition AGI is already here, since AGI = AI that is meant to do any general task, as opposed to specialized AI (e.g. AlphaGo).


The definitions are one of the major sticking points.

We don't have good, clear definitions of either intelligence or consciousness.

They need to be generally agreeable. Include everything we accept as intelligent or conscious and exclude everything we accept as not intelligent or not conscious.




Sam Altman just replied: https://x.com/sama/status/1849661093083480123

> fake news out of control


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: