Upshot: Steve thinks he’s built a quality task tracker/work system (beads), and is iterating on architectures, and has gotten convinced an architecture-builder is going to make sense.
Meanwhile, work output is going to improve independently. The bet is that leverage on the top side is going to be the key factor.
To co-believe this with Steve, you have to believe that workers can self-stabilize (e.g. with something like the Wiggum loop you can get some actual quality out of them, unsupervised by a human), and that their coordinators can self stabilize.
If you believe those to be true, then you’re going to be eyeing 100-1000x productivity just because you get to multiply 10 coordinators by 10 workers.
I’ll say that I’m generally bought in to this math. Anecdotally I currently (last 2 months) spend about half my coding agent time asking for easy in-roads to what’s been done; a year ago, I spent 10% specifying and 90% complaining about bugs.
Example, I just pulled up an old project, and asked for a status report — I got a status report based on existing beads. I asked it to verify, and the computer ran the program and reported a fairly high quality status report. I then asked it to read the output (a PDF), and it read the PDF, noticed my main complaints, and issued 20 or so beads to get things in the right shape. I had no real complaints about the response or workplan.
I haven’t said “go” yet, but I presume when I do, I’m going to be basically checking work, and encouraging that work checking I’m doing to get automated as well.
There’s a sort of not-obvious thing that happens as we move from 0.5 9s to say 3 9s in terms of effectiveness — we’re going to go from constant intervention needed at one order of magnitude of work to constant intervention needed at 2.5x that order of magnitude of work — it’s a little hard to believe unless you’ve really poked around — but I think it’s coming pretty soon, as does Steve.
Who, nota bene, to be clear, is working at a pace that he is turning down 20 VCs a week, selling memecoin earnings in the hundreds of thousands of dollars and randomly ‘napping’ in the middle of the day. Stay rested Steve, keep on this side of the manic curve please, we need you.. I’d say it’s a good sign he didn’t buy any GAS token himself.
> Stay rested Steve, keep on this side of the manic curve please, we need you
This is my biggest takeaway. He may or may not be on to something really big, but regardless, it's advancing the conversation and we're all learning from it. He is clearly kicking ass at something.
I would definitely prefer to see this be a well paced marathon rather than a series of trips and falls. It needs time to play out.
Yep, it works. Like anything getting the most out of these tools is its own (human) skill.
With that in mind, a couple of comments - think of the coding agents as personalities with blind spots. A code review by all of them and a synthesis step is a good idea. In fact currently popular is the “rule of 5” which suggests you need the LLM to review five times, and to vary the level of review, e.g. bugs, architecture, structure, etc. Anecdotally, I find this is extremely effective.
Right now, Claude is in my opinion the best coding agent out there. With Claude code, the best harnesses are starting to automate the review / PR process a bit, but the hand holding around bugs is real.
I also really like Yegge’s beads for LLMs keeping state and track of what they’re doing — upshot, I suggest you install beads, load Claude, run ‘!bd prime’ and say “Give me a full, thorough code review for all sorts of bugs, architecture, incorrect tests, specification, usability, code bugs, plus anything else you see, and write out beads based on your findings.” Then you could have Claude (or codex) work through them. But you’ll probably find a fresh eye will save time, e.g. give Claude a try for a day.
Your ‘duplicated code’ complaint is likely an artifact of how codex interacts with your codebase - codex in particular likes to load smaller chunks of code in to do work, and sometimes it can get too little context. You can always just cat the relevant files right into the context, which can be helpful.
Finally, iOS is a tough target — I’d expect a few more bumps. The vast bulk of iOS apps are not up on GitHub, so there’s less facility in the coding models.
And any front end work doesn’t really have good native visual harnesses set up, (although Claude has the Claude chrome extension for web UIs). So there’s going to be more back and forth.
Anyway - if you’re a career engineer, I’d tell you - learn this stuff. It’s going to be how you work in very short order. If you’re a hobbyist, have a good time and do whatever you want.
I still don't get what beads needs a daemon for, or a db. After a while of using 'bd --no-daemon --no-db' I was sick of it and switched to beans and my agents seem to be able to make use of it much better, on the one hand its directly editable by them as its just markdown, on the other hand the CLI still gives them structure and makes the thing queryable
Steve runs beads across like 100 coding environments simultaneously. So, you need some sort of coordination, whether that's your db or a daemon. Realistically with 100 simultaneous connections, I would probably reach for both myself. I haven't tried beans, thanks for the reference.
yeah that does make sense that these choices are related to it being a big part of gastown, still I feel it would be much more sensible to make a different abstraction separating beads core features from the coordination layer
It's 100% both sides. We haven't had a president work to roll back his own power, since ... Hmm. Maybe Gerald Ford? I guess Carter was fairly principled on some of this.
This part of the system - executive power grabs - is supposed to be curtailed by the courts first and congress second in the US system.
> We haven't had a president work to roll back his own power,
this is just not true. For example, all under the Obama administration
* the closure of Guantanamo Bay and other black sites, the prohibition of torture as an interrogation method including updates to Army Field Manual and mandatory access of Red Cross to any POW, all represented a significant reduction in executive power in how we treat detainees.
* following the Snowden leaks there were several actions taken to curtail executive power in applying surveillance programs to both US citizens and non-US persons. these also rolled back several components of the PATRIOT act (passed under his predecessor we all know and love, Dubya)
* the signing statements reform meant the executive no longer had an effective line-item veto
* the AG under Obama implemented a new DoJ policy limiting the use of "state secret" privilege during litigations.
I agree with those things but they were not rollbacks of executive power. That was Obama using executive power to reel in bad policy, not ceding the power entirely.
Of course perhaps he couldn’t. Congress needs to do that, and the courts, and neither seem interested in doing their job. Lower courts sometimes step up but the Supreme Court seems to be on the side of a dictatorial executive for some time now.
What does Congress even do these days? Seems like half crackpot debate club and half hospice care facility.
There are light years of space between the behavior we're seeing now and "a president working to roll back his own power," and even that has arguably happened in many presidencies, depending on what you mean. You would need much more than that to demonstrate anything approaching behavioral parity on this dimension. Otherwise - yes, politicians from every party, forever, everywhere, exhibit some similar faults.
A lot of this depends on your workflow. A language with great typing, type checking and good compiler errors will work better in a loop than one with a large surface overhead and syntax complexity, even if it's well represented. This is the instinct behind, e.g. https://github.com/toon-format/toon, a json alternative format. They test LLM accuracy with the format against JSON, (and are generally slightly ahead of JSON).
Additionally just the ability to put an entire language into context for an LLM - a single document explaining everything - is also likely to close the gap.
I was skimming some nano files and while I can't say I loved how it looked, it did look extremely clear. Likely a benefit.
Thanks for sharing this! A question I've grappled with is "how do you make the DOM of a rendered webpage optimal for complex retrieval in both accuracy and tokens?" This could be a really useful transformation to throw in the mix!
Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.
I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:
1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.
Not really worth it, in general. It does reduce latency a little. In practice, you do have a continuing context, though, so you end up using it whether you care or not.
In general, with per minute rate limiting you limit load spikes, and load spikes are what you pay for: they force you to ramp up your capacity, and usually you are then slow to ramp down to avoid paying the ramp up cost too many times. A VM might boot relatively fast, but loading a large model into GPU memory takes time.
I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.
It's even cheaper to just use it through z.ai themselves I think.
I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser
Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.
With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.
I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.
Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)
I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.
They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)
Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.
i wasn't judging, i was asking how it works. why would openai/anthrophic/google let a competitor scrape their results in sufficient amounts that it lets them train their own thing?
I think the point is that they can't really stop it. Let's say that I purchase API credits, and I let the resell it to DeepSeek.
That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)
They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.
Speculation I think, because for one those supposed proxy providers would have to provide some kind of pricing advantage compared to the original provider. Maybe I missed them but where are the X0% cheaper SOTA model proxies?
Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.
Thanks I that case my conclusion is that all the people saying that these models are "distilling SOTA models" are, by extension, also speculating. How can you distill what you don't have?
Only way I can think of is paying for synthesizing training data using SOTA models yourself. But yeah, I'm not aware of anyone publicly sharing that they did so it's also speculation.
The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.
What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.
Note that this is the Flash variant, which is only 31B parameters in total.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
Probably the wrong attitude here - beads is infra for your coding agents, not you. The most I directly interact with it is by invoking `bd prime` at the start of some sessions if the LLM hasn’t gotten the message; maybe very occasionally running `bd ready` — but really it’s a planning tool and work scheduler for the agents, not the human.
What agent do you use it with, out of curiosity?
At any rate, to directly answer your question, I used it this weekend like this:
“Make a tool that lets me ink on a remarkable tablet and capture the inking output on a remote server; I want that to send off the inking to a VLM of some sort, and parse the writing into a request; send that request and any information we get to nanobanana pro, and then inject the image back onto the remarkable. Use beads to plan this.”
We had a few more conversations, but got a workable v1 out of this five hours later.
Counterpoint - you can go much faster if you get lots of people engaging with something and testing it. This is exploratory work, not some sort of ivory tower rationalism exercise, (if those even ever truly exist), there’s no compulsion involved, so everyone engaged does so for self-motivated reasons..
Don’t be mad!
Also, beads is genuinely useful. In my estimation, gas town, or a successor built on a similar architecture, will not only be useful, but likely be considered ‘state of the art’ for at least a month sometime in the future. We should be glad this stuff is developed in the open, in my opinion.
It is worth an install; it works very differently than an agent in a single loop.
Beads formalizes building a DAG for a given workload. This has a bunch of implications, but one is that you can specify larger workloads and the agents won’t get stuck or confused. At some level gas town is a bunch of scaffolding around the benefits of beads; an orchestrator that is native to dealing with beads opens up many more benefits than one that isn’t custom coded for it.
Think of a human needing to be interacted with as a ‘fault’ in an agentic coding system — a copilot agent might be at 0.5 9s or so - 50% of tasks can complete without intervention, given a certain set of tasks. All the gas town scaffolding is trying to increase the number of 9s, and the size of the task that can be given.
My take - Gas town (as an architecture) certainly has more nines in it than a single agent; the rest is just a lot of fun experimentation.
Yes he is on an extended manic episode right now - we can only sit back and enjoy the fruits of his extreme labor. I expect the dust will settle at some point, and I think he’s right that he’s on to some quality architecture.
Upshot: Steve thinks he’s built a quality task tracker/work system (beads), and is iterating on architectures, and has gotten convinced an architecture-builder is going to make sense.
Meanwhile, work output is going to improve independently. The bet is that leverage on the top side is going to be the key factor.
To co-believe this with Steve, you have to believe that workers can self-stabilize (e.g. with something like the Wiggum loop you can get some actual quality out of them, unsupervised by a human), and that their coordinators can self stabilize.
If you believe those to be true, then you’re going to be eyeing 100-1000x productivity just because you get to multiply 10 coordinators by 10 workers.
I’ll say that I’m generally bought in to this math. Anecdotally I currently (last 2 months) spend about half my coding agent time asking for easy in-roads to what’s been done; a year ago, I spent 10% specifying and 90% complaining about bugs.
Example, I just pulled up an old project, and asked for a status report — I got a status report based on existing beads. I asked it to verify, and the computer ran the program and reported a fairly high quality status report. I then asked it to read the output (a PDF), and it read the PDF, noticed my main complaints, and issued 20 or so beads to get things in the right shape. I had no real complaints about the response or workplan.
I haven’t said “go” yet, but I presume when I do, I’m going to be basically checking work, and encouraging that work checking I’m doing to get automated as well.
There’s a sort of not-obvious thing that happens as we move from 0.5 9s to say 3 9s in terms of effectiveness — we’re going to go from constant intervention needed at one order of magnitude of work to constant intervention needed at 2.5x that order of magnitude of work — it’s a little hard to believe unless you’ve really poked around — but I think it’s coming pretty soon, as does Steve.
Who, nota bene, to be clear, is working at a pace that he is turning down 20 VCs a week, selling memecoin earnings in the hundreds of thousands of dollars and randomly ‘napping’ in the middle of the day. Stay rested Steve, keep on this side of the manic curve please, we need you.. I’d say it’s a good sign he didn’t buy any GAS token himself.
reply