Yep, 100% correct. We're still reviewing and advising on test cases. We also write a PRD beforehand (with the LLM interviewing us!) so the scope and expectations tend to be fairly well-defined.
It doesn't require removing them if you think you'll need them. It just requires writing tests for those edge cases so you have confidence that the code will work correctly if/when those branches do eventually run.
I don't think anyone wants production code paths that have never been tried, right?
I suspect it will still fall on humans (with machine assistance?) to move the field forward and innovate, but in terms of training an LLM on genuinely new concepts, they tend to be pretty nimble on that front (in my experience).
Especially with the massive context windows modern LLMs have. The core idea that the GPT-3 paper introduced was (summarizing):
A sufficiently large language model can perform new tasks it has never seen using only a few examples provided at inference time, without any gradient updates or fine-tuning.
I never claim that 100% coverage has anything to do with code breaking. The only claim made is that anything less than 100% does guarantee that some piece of code is not automatically exercised, which we don't allow.
It's a footnote on the post, but I expand on this with:
100% coverage is actually the minimum bar we set. We encourage writing tests for as many scenarios as is possible, even if it means the same lines getting exercised multiple times. It gets us closer to 100% path coverage as well, though we don’t enforce (or measure) that
> I never claim that 100% coverage has anything to do with code breaking.
But what I care about is code breaking (or rather, it not breaking). I'd rather put effort into ensuring my test suite does provide a useful benefit in that regard, rather than measure an arbitrary target which is not a good measure of that.
SimpleCov in ruby has 2 metrics, line coverage and branch coverage. If you really want to be strict, get to 100% branch coverage. This really helps you flesh out all the various scenarios
Brakes in cars here in Germany are integrated with less than 50 % coverage in the final model testing that goes to production.
Seems like even if people could potentially die, industry standards are not really 100% realistic. (Also, redundancy in production is more of a solution than having some failures and recalls, which are solved with money.)
Can you say more? I see a lot of teams struggling with getting AI to work for them. A lot of folks expect it to be a little more magical and "free" than it actually is. So this post is just me sharing what works well for us on a very seasoned eng team.
As someone who struggles to realise productivity gains with AI (see recent comment history) I appreciate the article.
100% coverage for AI generated code is a very different value proposition than 100% coverage for human generated code (for the reasons outlined in the article).
Hi, the reason I have this expectation is that on a (cognitively) diverse team there will be a range of reactions that all need to be accommodated.
some (many?) devs don't want agents. Either because the agent takes away the 'fun' part of their work, or because they don't trust the agent, or because they truly do not find a use for it in their process.
I remember being on teams which only remained functional because two devs tried very hard to stay out of one another's way. Nothing wrong with either of them, their approach to the work was just not very compatible.
In the same way, I expect diverse teams to struggle with finding a mode of adoption that does not negatively impact on the existing styles of some members.
i was thinking it was more like llms when used personally can make huge refactorings and code changes that you review yourself and just check it in, but with a team its harder to make sweeping changes that an llm might make more possible cause now everyone's changes start to conflict... but i guess thats not much of an issue in practice?
Reranking is definitely the way to go. We personally found common reranker models to be a little too opaque (can't explain to the user why this result was picked) and not quite steerable enough, so we just use another LLM for reranking.
We use Claude Code pretty aggressively at our startup, so were naturally curious to compare the coding examples that OpenAI published today to Opus 4.1.
When you asked it to choose by picking a random number between 1 and 4, it skewed the results heavily to 2 and 3. It could have interpreted your instructions to mean literally between 1 and 4 (not inclusive).
reply