Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.



Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.


"personal" doing a lot of work there :-)

(And I'd be envious of your impact, of course)


Just tried that canard on GPT-4o and it failed:

"The word "strawberry" contains 2 letter r’s."


I tried

strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three

strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four

stawberrry -> DeepSeek, GeminiPro all correctly said three

ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)

Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's

And then asked if I meant "strawberry" instead and said because that one has 2 r's....


This is why things like the ARC Prize are better ways of approaching this: https://arcprize.org


Well, ARC-1 did not end well for the competitors of tech giants and it’s very unclear that ARC-2 won’t follow the same trajectory.


This doesn’t make ARC a bad benchmark. Tech giants will have a significant advantage in any benchmark they are interested in, _especially_ if the benchmark correlates with true general intelligence.


You push sha512 hashes of things in a githup repo and a short sentence:

x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D

this way they won't know what to improve upon. of course they can buy access. ;P

when they finally solve your problem you can reveal what was the benchmark.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: