More

sgk284 · 2025-12-30T06:57:17 1767077837

Yep, 100% correct. We're still reviewing and advising on test cases. We also write a PRD beforehand (with the LLM interviewing us!) so the scope and expectations tend to be fairly well-defined.

sgk284 · 2025-12-30T01:20:55 1767057655

It doesn't require removing them if you think you'll need them. It just requires writing tests for those edge cases so you have confidence that the code will work correctly if/when those branches do eventually run.

I don't think anyone wants production code paths that have never been tried, right?

sgk284 · 2025-12-30T00:00:28 1767052828

FWIW all of the content on our eng blog is good ol' cage-free grass-fed human-written content.

(If the analogy, in the first paragraph, of a Roomba dragging poop around the house didn't convince you)

sgk284 · 2025-12-29T22:20:17 1767046817

I suspect it will still fall on humans (with machine assistance?) to move the field forward and innovate, but in terms of training an LLM on genuinely new concepts, they tend to be pretty nimble on that front (in my experience).

Especially with the massive context windows modern LLMs have. The core idea that the GPT-3 paper introduced was (summarizing):

  A sufficiently large language model can perform new tasks it has never seen using only a few examples provided at inference time, without any gradient updates or fine-tuning.

sgk284 · 2025-12-29T22:14:13 1767046453

I never claim that 100% coverage has anything to do with code breaking. The only claim made is that anything less than 100% does guarantee that some piece of code is not automatically exercised, which we don't allow.

It's a footnote on the post, but I expand on this with:

  100% coverage is actually the minimum bar we set. We encourage writing tests for as many scenarios as is possible, even if it means the same lines getting exercised multiple times. It gets us closer to 100% path coverage as well, though we don’t enforce (or measure) that

nicoburns · 2025-12-30T01:08:10 1767056890

> I never claim that 100% coverage has anything to do with code breaking.

But what I care about is code breaking (or rather, it not breaking). I'd rather put effort into ensuring my test suite does provide a useful benefit in that regard, rather than measure an arbitrary target which is not a good measure of that.

reactordev · 2025-12-29T22:42:24 1767048144

I feel this comment is lost on those who have never achieved it and gave up along the journey.

xcskier56 · 2025-12-29T22:54:18 1767048858

SimpleCov in ruby has 2 metrics, line coverage and branch coverage. If you really want to be strict, get to 100% branch coverage. This really helps you flesh out all the various scenarios

a3w · 2025-12-29T23:36:39 1767051399

Brakes in cars here in Germany are integrated with less than 50 % coverage in the final model testing that goes to production.

Seems like even if people could potentially die, industry standards are not really 100% realistic. (Also, redundancy in production is more of a solution than having some failures and recalls, which are solved with money.)

sgk284 · 2025-12-29T22:06:24 1767045984

Can you say more? I see a lot of teams struggling with getting AI to work for them. A lot of folks expect it to be a little more magical and "free" than it actually is. So this post is just me sharing what works well for us on a very seasoned eng team.

imron · 2025-12-29T23:11:44 1767049904

As someone who struggles to realise productivity gains with AI (see recent comment history) I appreciate the article.

100% coverage for AI generated code is a very different value proposition than 100% coverage for human generated code (for the reasons outlined in the article).

justatdotin · 2025-12-30T00:06:45 1767053205

it is MUCH easier for solo devs to get agents to work for them than it is for teams to get agents to work for them.

andrekandre · 2025-12-30T01:52:48 1767059568

that's interesting, whats the reason for that?

justatdotin · 2025-12-30T02:38:44 1767062324

Hi, the reason I have this expectation is that on a (cognitively) diverse team there will be a range of reactions that all need to be accommodated.

some (many?) devs don't want agents. Either because the agent takes away the 'fun' part of their work, or because they don't trust the agent, or because they truly do not find a use for it in their process.

I remember being on teams which only remained functional because two devs tried very hard to stay out of one another's way. Nothing wrong with either of them, their approach to the work was just not very compatible.

In the same way, I expect diverse teams to struggle with finding a mode of adoption that does not negatively impact on the existing styles of some members.

andrekandre · 2025-12-30T02:45:40 1767062740

thanks for the reply, thats interesting

i was thinking it was more like llms when used personally can make huge refactorings and code changes that you review yourself and just check it in, but with a team its harder to make sweeping changes that an llm might make more possible cause now everyone's changes start to conflict... but i guess thats not much of an issue in practice?

justatdotin · 2025-12-30T03:21:50 1767064910

oh yeah well that's an extreme example of how one dev's use could overwhelm a team's capacity.

sgk284 · 2025-12-25T11:23:28 1766661808

Reranking is definitely the way to go. We personally found common reranker models to be a little too opaque (can't explain to the user why this result was picked) and not quite steerable enough, so we just use another LLM for reranking.

We open-sourced our impl just this week: https://github.com/with-logic/intent

We use Groq with gpt-oss-20b, which gives great results and only adds ~250ms to the processing pipeline.

If you use mini / flash models from OpenAI / Gemini, expect it to be 2.5s-3s of overhead.

sgk284 · 2025-11-26T01:24:08 1764120248

We put this together mostly just to do side-by-side comparisons, though you make a good point. It'd be fun to blind-vote on your favorite impl.

sgk284 · 2025-08-07T20:57:30 1754600250

We use Claude Code pretty aggressively at our startup, so were naturally curious to compare the coding examples that OpenAI published today to Opus 4.1.

All of these were one-shotted by Claude.

(OpenAI's GPT-5 Coding Examples: https://github.com/openai/gpt-5-coding-examples)

sgk284 · 2025-04-30T09:31:46 1746005506

Fun post! Back during the holidays we wrote one where we abused temperature AND structured output to approximate a random selection: https://bits.logic.inc/p/all-i-want-for-christmas-is-a-rando...

onionisafruit · 2025-04-30T12:40:50 1746016850

I enjoyed that.

When you asked it to choose by picking a random number between 1 and 4, it skewed the results heavily to 2 and 3. It could have interpreted your instructions to mean literally between 1 and 4 (not inclusive).

sgk284 · 2025-05-02T08:46:08 1746175568

Yea, absolutely. That's a good point. We could have phrased that more clearly.

LourensT · 2025-04-30T13:14:59 1746018899

could you use structured output to make a more efficient estimator for the logits based?

sgk284 · 2025-05-02T08:52:20 1746175940

The two mechanisms are a bit disjoint, so I don't think it's the right tool to do so. Though it could have been an interesting experiment.