I have a similar but opposite experience. Since around 2015 I've mostly been working with people who primarily use Emacs. In 2014 I was the only weird one, then next team about 3-5, then a dozen, then there was a team of a few dozen where only two were using Vim. On my current team also most of the devs are Emacs users. However, a lot of people use Emacs with Evil-mode, so I guess they can be considered vimmers.
Also, I don't remember the last time when I worked with anyone who writes code and uses Windows.
Anecdotal experiences can lead to a warped understanding of reality; in mine, Windows and non-emacs users are niche.
Don't y'all have a #emacs slack channel or equivalent at your company? I work for a medium-sized tech company and we have a single digit amount of emacs users I feel like. The channel is mostly dead except for a few tips and tricks and the odd time people asking how we each install it on our macbooks.
Anecdotally a lot of managers use Emacs, though that may be an age thing.
(I use emacs for Real Work, unless that Real Work involves a JVM. Still do all the git stuff in emacs/magit, though)
Reddit was an interesting case here. They knew that they had particularly good AI training data, and they were able to hold it hostage from the Google crawler, which was an awfully high risk play given how important Google search results are to Reddit ads, but they likely knew that Reddit search results were also really important to Google. I would love to be able to watch those negotiations on each side; what a crazy high stakes negotiation that must've been.
Say what you will, but there's a lot of good answers to real questions people have that's on Reddit. There's a whole thing where people say "oh Google search results are bad, but if you append the word 'REDDIT' to your search, you'll get the right answer." You can see that most of these agents rely pretty heavily from stuff they find on Reddit.
Of course, that's also a big reason why Google search results suggest putting glue on pizza.
This is an underrated comment. Yes it's a big advantage and probably a measurable pain point for Anthropic and OpenAI. In fact you could just do a 1% survey of robots.txt out there and get a reasonable picture. Maybe a fun project for an HN'er.
This is right on. I work for a company with somewhat of a data moat and AI aspirations. We spend a lot of time blocking everyone's bots except for Google. We have people whose entire job is it to make it faster for Google to access our data. We exist because Google accesses our data. We can't not let them have it.
We've (ex Google Deepmind researchers) been doing research in increasing the reliability of agents and realized it is pretty non-trivial, but there are a lot of techniques to improve it. The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. We made our own benchmarks to measure progress.
> The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks.
OMFG thank you for saying this. As a core contributor to RA.Aid, optimizing it for SWE-bench seems like it would actively go against perf on real-world tasks. RA.Aid came about in the first place as a pragmatic programming tool (I created it while making another software startup, Fictie.) It works well because it was literally made and tested by making other software, and these days it mostly creates its own code.
Do you have any tips or suggestions on how to do more formalized evals, but on tasks that resemble real world tasks?
I would start by making the examples yourself initially, assuming you have a good sense for what that real-world task is. If you can't articulate what a good task is and what a good output is, it is not ready for out-sourcing to crowd-workers.
And before going to crowd-workers (maybe you can skip them entirely) try LLMs.
> I would start by making the examples yourself initially
What I'm doing right now is this:
1) I have X problem to solve using the coding agent.
2) I ask the agent to do X
3) I use my own brain: did the agent do it correctly?
If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.
The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.
SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.
Author of post here. I'd say most of the examples generated from the best model were good. However we chose examples that were not too gruesome, as news can be :)
We encourage you to try the code and see for yourself.
How does the model deal with dangling anaphora[1]? I wrote a summarizer for Spanish following a recent paper as a side project, and it looks as if I'll need a month of work to solve the issue.
[1] That is, the problem of selecting a sentence such as "He approved the motion" and then realising that "he" is now undefined.
Wouldn't it suffice to do a coreference pass before extracting sentences? Obviously you'll compound coref errors with the errors in your main logic, but that seems somewhat unavoidable.
I am working on this in my kbsportal.com NLP demo. With accurate coreference substitutions (eg., substituting a previous NP like 'San Francisco' for 'there' in a later sentence, substituting full previously mentioned names for pronouns, etc.) extractive summarization should provide better results, and my intuition is that this preprocessing should help abstractive summarization also.
>>"In those tasks training from scratch with this model architecture does not do as well as some other techniques we're researching, but it serves as a baseline."
Can you elaborate a little on that? Is the training the problem or is the model just not good at longer texts?