| | LMArena Is a Cancer on AI (surgehq.ai) |
| 5 points by EvgeniyZh 11 days ago | past | 1 comment |
|
| | LMArena Is a Cancer on AI (surgehq.ai) |
| 3 points by jumploops 12 days ago | past | discuss |
|
| | LMArena Is a Cancer on AI (surgehq.ai) |
| 4 points by holdingunsteady 13 days ago | past | 1 comment |
|
| | LMArena Is a Plague on AI (surgehq.ai) |
| 3 points by cui 14 days ago | past | 1 comment |
|
| | RL Environments and the Hierarchy of Agentic Capabilities (surgehq.ai) |
| 4 points by echen 41 days ago | past |
|
| | Wall Street Experts Tested GPT-5 and Claude. Both Struggled – Even with Excel (surgehq.ai) |
| 5 points by holdingunsteady 46 days ago | past | 1 comment |
|
| | Is Sonnet 4.5 the best coding model in the world? (surgehq.ai) |
| 2 points by egillie 68 days ago | past |
|
| | A Product Take on Sonnet 4.5 (surgehq.ai) |
| 1 point by gk1 70 days ago | past |
|
| | Unsexy AI Failures: The PDF That Broke ChatGPT (surgehq.ai) |
| 4 points by gk1 80 days ago | past |
|
| | The Human/AI Frontier: A Conversation with Bogdan Grechuk (surgehq.ai) |
| 1 point by gk1 83 days ago | past |
|
| | Unsexy AI Failures: Still Confidently Hallucinating Image Text (surgehq.ai) |
| 2 points by gk1 3 months ago | past |
|
| | AI agents still can't solve 1/3 of SWE-Bench problems. Why not? (A Case Study) (surgehq.ai) |
| 1 point by egilliehhc 3 months ago | past |
|
| | SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations (surgehq.ai) |
| 22 points by landonxi 3 months ago | past | 1 comment |
|
| | Extracting text from a pdf broke ChatGPT (surgehq.ai) |
| 7 points by landonxi 3 months ago | past | 2 comments |
|
| | The PDF That Broke ChatGPT (surgehq.ai) |
| 2 points by jasong 3 months ago | past |
|
| | SurgeAI Blog: Human Evals vs. Academic Benchmarks (surgehq.ai) |
| 1 point by Olshansky 3 months ago | past |
|
| | Unsexy AI Failures: The PDF That Broke ChatGPT (surgehq.ai) |
| 1 point by pr337h4m 3 months ago | past |
|
| | Introduction to Reinforcement Learning with Human Feedback (surgehq.ai) |
| 1 point by CarrieLab on Jan 25, 2023 | past |
|
| | Explaining Reinforcement Learning with Human Feedback (RLHF) (surgehq.ai) |
| 11 points by echen on Jan 5, 2023 | past |
|
| | We Evaluated ChatGPT vs. Google on 500 Search Queries (surgehq.ai) |
| 25 points by amrrs on Dec 26, 2022 | past | 11 comments |
|
| | We Evaluated ChatGPT vs. Google on 500 Search Queries (surgehq.ai) |
| 5 points by holdingunsteady on Dec 23, 2022 | past |
|
| | ChatGPT vs. Google Search (surgehq.ai) |
| 3 points by antman on Dec 22, 2022 | past |
|
| | ChatGPT Crushes Google on Coding Queries, and Matches It at General Information (surgehq.ai) |
| 11 points by echen on Dec 21, 2022 | past | 1 comment |
|
| | AI Red Teams for Adversarial Training: Making ChatGPT and LLMs More Robust (surgehq.ai) |
| 9 points by echen on Dec 13, 2022 | past |
|
| | HellaSwag: 36% of this popular large language model benchmark contains errors (surgehq.ai) |
| 49 points by echen on Dec 6, 2022 | past | 8 comments |
|
| | The Violence, Racism, & Sexism Uncaught by Twitter's Content Moderation Systems (surgehq.ai) |
| 3 points by echen on Nov 17, 2022 | past |
|
| | Twitter’s Egregious Content Moderation Failures (surgehq.ai) |
| 15 points by JerryRorsch on Nov 10, 2022 | past |
|
| | How TikTok Is Evolving the Next Generation of Search (surgehq.ai) |
| 2 points by valehock on Nov 1, 2022 | past | 1 comment |
|
| | Move Over, Google: The TikTokification of Next-Gen Search (surgehq.ai) |
| 13 points by echen on Oct 26, 2022 | past | 4 comments |
|
| | Evaluating Image Generation Intelligence: Humans vs. Imagen vs. DALL-E (surgehq.ai) |
| 1 point by aenimel on Oct 8, 2022 | past |
|
|
| More |