Hacker Newsnew | past | comments | ask | show | jobs | submit | gjadi's commentslogin

IMHO, the real problem is that they create an even greater dissonance between online life and IRL.

Think about dating apps, pictures could be fake, and now words exchanged can be fake too.

You thought you were arguing with a gentle and smart colleague by chat and mails, too bad, when you meet then at a conference or at a restaurant you find them very unpleasant.


This.

For me, navigating with shortcuts feels like I can keep my inner monologue, it is part of it, maybe because I can spell it?

Dunno, but reaching for the mouse and navigating around breaks that, even if it can be more convenient for some actions.


Interesting argument.

But isn't the corrections of those errors that are valuable to society and get us a job?

People can tell they found a bug or give a description about what they want from a software, yet it requires skills to fix the bugs and to build software. Though LLMs can speedup the process, expert human judgment is still required.


I think there's different levels to look at it.

If you know that you need O(n) "contains" checks and O(1) retrieval for items, for a given order of magnitude, it feels like you've all the pieces of the puzzle needed to make sure you keep the LLM on the straight and narrow, even if you didn't know off the top of your head that you should choose ArrayList.

Or if you know that string manipulation might be memory intensive so you write automated tests around it for your order of magnitude, it probably doesn't really matter if you didn't know to choose StringBuilder.

That feels different to e.g. not knowing the difference between an array list and linked list (or the concept of time/space complexity) in the first place.


My gut feeling is that, without wrestling with data structures at least once (e.g. during a course), then that knowledge about complexity will be cargo cult.

When it comes to fundamentals, I think it's still worth the investment.

To paraphrase, "months of prompting can save weeks of learning".


I think the kind of judgement required here is to design ways to test the code without inspecting it manually line by line, that would be walking a motorcycle, and you would be only vibe-testing. That is why we have seen the FastRender browser and JustHTML parser - the testing part was solved upfront, so AI could go nuts implementing.

I partially agree, but I don’t think “design ways to test the code without inspecting it manually line by line” is a good strategy.

Tests only cover cases you already know to look for. In my experience, many important edge cases are discovered by reading the implementation and noticing hidden assumptions or unintended interactions.

When something goes wrong, understanding why almost always requires looking at the code, and that understanding is what informs better tests.


Another possibility is to implement the same spec twice, and do differential testing, you can catch diverging assumptions and clarify them.

Isn't that too much work?

Instead, just learning concepts with AI and then using HI (Human Intelligence) & AI to solve the problem at hand—by going through code line by line and writing tests - is a better approach productivity-, correctness-, efficiency-, and skill-wise.

I can only think of LLMs as fast typists with some domain knowledge.

Like typists of government/legal documents who know how to format documents but cannot practice law. Likewise, LLMs are code typists who can write good/decent/bad code but cannot practice software engineering - we need, and will need, a human for that.


It's ego won't get in the way but it's lack of intelligence will.

Whereas a junior might be reluctant at first, but if they are smart they will learn and get better.

So maybe LLM are better than not-so-smart people, but you usually try to avoid hiring those people in the first place.


That's exactly the thing. Claude Code with Opus 4.5 is already significantly better at essentially everything than a large percentage of devs I had the displeasure of working with, including learning when asked to retain a memory. It's still very far from the best devs, but this is the worse it'll ever be, and it already significantly raised the bar for hiring.

> but this is the worse it'll ever be

And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.

We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.

A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.

One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.

So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.


It's like pot.

Back in the day, it was much less concentrated and less dangerous than what you can get today.


It's much easier to forbid something to a subset of the population than to the population at large.

Imho it's because you worked before asking the LLM for input, thus you already have information and an opinion about what the code should look like. You can recognize good suggestions and quickly discard bad ones.

It's like reading, for better learning and understanding, it is advised that you think and question the text before reading it, and then again after just skimming it.

Whereas if you ask first for the answer, you are less prepared for the topic, is harder to form a different opinion.

It's my perception.


Its also because they are only as good as they are with their given skills. If you tell them "code <advandced project> and make no x and y mistakes" they will still make those mistakes. But if you say "perform a code review and look specifically for x and y", then it may have some notion of what to do. That's my experience with using it for both writing and reviewing the same code in different passes.

Have this shortcomings of llms been addressed by better models or by better integration with other tools? Like, are they better at coding because the models are truly better or because the agentic loops are better designed?


Fundamentally these shortcomings cannot be addressed.

They can and are improved (papered over) over time. For example by improving and tweaking the training data. Adding in new data sets is the usual fix. A prime example 'count the number of R's in Strawberry' caused quite a debacle at a time where LLM's were meant to be intelligent. Because they aren't they can trip up over simple problems like this. Continue to use an army of people to train them and these edge cases may become smaller over time. Fundamentally the LLM tech hasn't changed.

I am not saying that LLM's aren't amazing, they absolutely are. But WHAT they are is an understood thing so lets not confuse ourselves.


100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding.

The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better.

The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that.

So current models + minimal loop >> previous gen models with human written harnesses + lots of glue.

> Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!

[1] - https://github.com/SWE-agent/mini-swe-agent


I don't want my doctor to vibe script some conversion only to realize weeks or months later it made a subtle error in my prescription. I want both of them to have enough fund to hire someone to do it properly. But wanting is not enough unfortunately...


Humans make subtle errors all the time too though. AI results still need to be checked over for anything important, but it's on a vector toward being much more reliable than a human for any kind of repetitive task.

Currently, if you ask an LLM to do something small and self-contained like solve leetcode problems or implement specific algorithms, they will have a much lower rate of mistakes, in terms of implementing the actual code, than an experienced human engineer. The things it does badly are more about architecture, organization, style, and taste.


But with a software bug, the error becomes rapidly widespread and systematic, whereas human error are often not. Doing wrong with a couple of prescription because the doc worked for 12+ hrs is different from systematically doing wrong on a significant number of prescriptions until someone double check the results.


An error in a massive hand-crafted Excel sheet also becoms systematic and wide-spread.

Because Excel has no way of doing unit tests or any kind of significant validation. Big BIG things have gone to shit because of Excel.

Things that would have never happened if the same thing was a vibe-coded python script and a CSV.


I agree with the excel thing. Not with thinking it can't happen with vibecoded python.

I think handling sensitive data should be done by professional. A lawyer handles contracts, a doctor handles health issue and a programmer handles data manipulation through programs. This doesn't remove risk of errors completely, but it reduces it significantly.

In my home, it's me who's impacted if I screw up a fix in my plumbing, but I won't try to do it at work or in my child's school.

I don't care if my doctor vibe codes an app to manipulate their holidays pictures, I care if they do it to manipulate my health or personal data.


Of course issues CAN happen with Python, but at least with Python we have tools to check for the issues.

Bunch of your personal data is most likely going through some Excel made by a now-retired office worker somewhere 15 years ago. Nobody understands how the sheet works, but it works so they keep using it :) A replacement system (a massive SaaS application) has been "coming soon" for 8 years and cost millions, but it still doesn't work as well as the Excel sheet.


When you don't know the answer to a question you ask an LLM, do you verify it or do you trust it?

Like, if it tells you merge sort is better on that particular problem, do you trust it or do you go through an analysis to confirm it really is?

I have a hard time trusting what I don't understand. And even more so if I realize later I've been fooled. Note that it's the same with human though. I think I only trust technical decision I don't understand when I deem the risk of being wrong low enough. Overwise I'll invest in learning and understanding enough to trust the answer.


For all these "open questions" you might have it is better to ask the LLM write a benchmark and actually see the numbers. Why rush, spend 10 minutes, you will have a decision backed by some real feedback from code execution.

But this is just a small part from a much grander testing activity that needs to wrap the LLM code. I think my main job moved to 1. architecting and 2. ensuring the tests are well done.

What you don't test is not reliable yet, looking at code is not testing, it's "vibe-testing" and should be an antipattern, no LGTM for AI code. We should rely on our intuition alone because it is not strict enough, and it makes everything slow - we should not "walk the motorcycle".


Ok. I also have the intuition that more tests and formal specifications can help there.

So far, my biggest issue is, when the code produced is incorrect, with a subtle bug, then I just feel I have wasted time to prompt for something I should have written because now I have to understand it deeply to debug it.

If the test infrastructure is sound, then maybe there is a gain after all even if the code is wrong.


Often those kind of performance things just don't matter.

Like right now I am working on algorithms for computing heart rate variability and only looking at a 2 minute window with maybe 300 data points at most so whether it is N or N log N or N^2 is beside the point.

When I know I computing the right thing for my application and know I've coded it up correctly and I am feeling some pain about performance that's another story.


> I have a hard time trusting what I don't understand

Who doesn't? But we have to trust them anyway, otherwise everyone should get a PhD on everything.

Also for people who "has a hard time trusting", they might just give up when encountering things they don't understand. With AI at least there is a path for them to keep digging deeper and actually verify things to whatever level of satisfaction they want.


Sure, but then I rely on an actual expert.

My issue is, LLM fooled me more than a couple of times with stupid but difficult to notice bugs. At that point, I have hard time to trust them (but keep trying with some stuff).

If I asked someone for something and found out several time that the individual is failing, then I'll just stop working with them.

Edit: and to avoid with just anthropomorphizing LLM too much, the moment I notice a tool I use bug to point to losing data for example, I reconsider real hard before I use it again or not.


I tell it to write a benchmark, and I learn from how it does that.


IME I don't learn by reading or watching, only by wrestling with a problem. ATM, I will only do it if the problem does not feel worth learning about (like jenkinsfile, gradle scripting).

But yes, the bench result will tell something true.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: