> I’ve been feeling pretty good about my benchmark! It should stay useful for a ...

simonw · 2025-06-08T20:07:57 1749413277

Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.

benmathes · 2025-06-16T21:41:38 1750110098

"personal" doing a lot of work there :-)

(And I'd be envious of your impact, of course)

Choco31415 · 2025-06-08T20:23:33 1749414213

Just tried that canard on GPT-4o and it failed:

"The word "strawberry" contains 2 letter r’s."

belter · 2025-06-09T16:10:00 1749485400

I tried

strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three

strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four

stawberrry -> DeepSeek, GeminiPro all correctly said three

ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)

Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's

And then asked if I meant "strawberry" instead and said because that one has 2 r's....

MattRix · 2025-06-08T19:55:33 1749412533

This is why things like the ARC Prize are better ways of approaching this: https://arcprize.org

whiplash451 · 2025-06-09T06:20:29 1749450029

Well, ARC-1 did not end well for the competitors of tech giants and it’s very unclear that ARC-2 won’t follow the same trajectory.

wolfmanstout · 2025-06-10T01:27:04 1749518824

This doesn’t make ARC a bad benchmark. Tech giants will have a significant advantage in any benchmark they are interested in, _especially_ if the benchmark correlates with true general intelligence.

lofaszvanitt · 2025-06-09T12:31:41 1749472301

You push sha512 hashes of things in a githup repo and a short sentence:

x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D

this way they won't know what to improve upon. of course they can buy access. ;P

when they finally solve your problem you can reveal what was the benchmark.