I live there in that city. There are hardly any homeless at all here. Not like other cities at least. I could see it being a major problem in other places.
It does seem that it should be possible to offer "free buses" without having to also offer "free hotels inside of the free buses". As an example, I can go to a local store and experience free parking or go to my nearby town and park for free downtown. I can't, however, park and sleep overnight in my car in that shopping centre or in that town.
Who’s supposed to enforce it? Is the driver supposed to pull over and wake up a sleeping person who has a small but real chance of stabbing them? Any situation where they call the police could be quite a hassle for the other passengers.
Because you can’t make a subjective judgement with regard to the worthiness of a particular passenger of a public resource. A car on private property eventually becomes trespassing.
Can you please not post in the flamewar style here, regardless of how wrong someone is or you feel they are?
It's always possible to make your substantive points thoughtfully, so please do that instead. You may not owe people who are wrong about use of buses by the homeless better, but you owe this community better if you're participating in it.
> also offer "free hotels inside of the free buses".
Is the inciting flamewar style spark. There is nothing in the article about this specific part. Is it not bad faith argument to insist that all buses every where are used as hotels just because of a few bad experiences? The way the commenter discusses all homeless as either dangerous, addicted drugs, smelly, etc. is incredibly flamewar intending to push stigma on the topic.
If the people who are pushing unfound truths can’t be called out for it, then I guess the FUD spreaders win. The community doesn’t need me. Please scramble this username to something random. I’m out.
I hear you that there was a provocation in that bit. But it's a matter of degree. From my point of view, the GP comment may have been wrong and even bad (let's assume that's so), but by itself that doesn't break the site guidelines. People are allowed to be wrong in comments; it's up to the community to debate what's right vs. wrong and sort that out.
The way to respond to wrong comments is to refute them with better arguments and better information. This can be done without breaking the site guidelines. Of course there are downsides to this approach—it's a lot more time-consuming to patiently refute wrongness than to post it in the first place. But the downsides of breaking the site guidelines are much greater—that path basically leads to conflagration, and we'd like to avoid having this place burn to a crisp. Scorched earth is not interesting (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...).
I would be in favour of (for example) someone who is attempting to “reside” on a bus being referred to a social worker that then sees to it the person ends up referred to an appropriate shelter.
We are not. I don’t believe homeless people are “using the bus as a hotel” because I actually ride buses unlike the commenter who is afraid and probably has never volunteered or talked to someone less fortunate in their community in their life.
Their username is literally trollbridge! I mean come on.
Definitely has been true for my work. LLMs have absolutely have been useful, I even forked an IDE (Zed) to add my own custom copilot to leverage a deeper integration for my work.
But even if we consider AI beyond just NLP, there's been so much ML you can apply to other more banal day to day tasks. In my org's case, one of the big ones was anomaly detection and fault localization in aggregate network telemetry data. Worked far better than conventional statistical modeling.
I usually assume there is a caricature of "AI Tools" that all of the detractors are working backwards from that are often nothing more than a canard for the folks that are actually using AI tooling successfully in their work.
We never have any proof or source for that, and when we have one (like the Devin thing) it’s always a ridiculous project in JS with a hundred lines of code that I could write in one day.
Give me some refactoring in a C++ code base with 100k lines of code, and we’ll be able to talk.
Anything with using tools which you are not an expert with. If you know how to do things and only use one specific language or framework -- there is nothing to use AI for.
This whole area is so drenched in bullshit, it's no wonder that the generation of BS and fluff is still the most productive use. Just nothing where reliable facts matter. I do believe that machines can vomit crap 10x as fast as humans.
I had to sign a 140 page contract of foreign language legalese. Mostly boiler plate, but I had specific questions about it.
Asking questions to an AI to get the specific page answering it meant I could do the job in 2 hours. Without an AI, it would have taken me 2 days.
For programming, it's very good at creating boilerplate, tests, docs, generic API endpoints, script argument parsing, script one-liners, etc. Basically anything for which, me, as a human, don't have much added value.
It's much faster to generate imperfect things with AI and fix them that to write them myself when there is a lot of volume.
It's also pretty good at fixing typos, translating, giving word definition, and so on. Meaning if you are already in the chat, no need to switch to a dedicated tool.
I don't personally get 10x on average (although on specific well suited task I can) but I can get a good X3 on a regular basis.
But what you're doing isn't a real job. Who hands someone who doesn't speakt the language a contract to sign? Don't you have a legal department that does this job for you and has people that are specialists for that?
Also, what are you going to do if the AI answered inaccurately and you signed a contract that says something different then what you thought?
I am actually pretty sure that the thing described literally isn’t a real job, at least not working for a serious employer. I can’t imagine a company telling someone to sign contracts in a language they can’t speak and somehow try to make a sense of it.
Either it’s their own company and they’re doing something unwise, they are doing it without the knowledge of their superior or their company shouldn’t be trusted with anything.
The point was that „AI helps me translate the contracts I want to sign“ isn’t a good example of „AI increases my productivity“ because that’s not something you should ever do.
But you shouldn't do some stuff you can't do properly at all, not quickly and not slowly. As a layman, you can't sign a contract in a language you don't speak, even if you have a whole year, unless you can become more-than-fluent in that language in a single year. That's just not something you should do, and the AI isn't reliable enough to help you with it. That's what a legal department is for.
I would never in my whole life sign anything in a foreign language that I don’t understand. It’s the perfect example of what AI is: let’s do anything that looks like a job well done and fuck it. That is not convincing. It’s suicidal.
HTMX and shoelace is an awesome combo. Super fast to prototype things and tweak as needed. Being able to copy paste snippets and directly inject data in a straightforward way is a nice way of working. It limits cognitive overhead so you can focus on the domain logic rather than fight javascript dependencies.
Don't forget to finetune the reranker too if you end up doing the embedding model. That tends to have outsized effects on performance for out of distribution content.
I don't think autoregressive models have a fundemental difference in terms of reasoning capability in latent space vs token space. Latent space enables abstract reasoning and pattern recognition, while token space acts as both the discrete interface for communication, and as a interaction medium to extend, refine and synthesize high order reasoning over latent space.
Intuively speaking, most people think of writing as a communication tool. But actually it's also a thinking tool that helps create deeper connections over discrete thoughts which can only occupy a fixed slice of our attention at any given time. Attentional capacity the primary limitation-- for humans and LLMs. So use the token space as extended working memory. Besides, even the Coconut paper got mediocre results. I don't think this is the way.
If uncertainty is an important signal then a model RL conditioned to perform good COT should be expected to learn how to encode an uncertainty sidechannel in its COT.
If we're fortunate it'll do so using language choice that would also convey uncertainty to humans. Before you complain that English uncertainty has poor precision, consider that nothing prevents the LLM from overloading it with a more precise meaning. Like how "MAY" in an RFC means something much more concrete than in general English. Though unless somehow conditioned for it the uncertainty signal could be something else entirely (including, perhaps, sounding more certain).
This also goes for pretty much any other side information you might hope could be conveyed.
The fundemental challenge of using log probabilities to measure LLM certainty is the mismatch between how language models process information and how semantic meaning actually works. The current models analyze text token by token-- fragments that don't necessarily align with complete words, let alone complex concepts or ideas.
This creates a gap between the mechanical measurement of certainty and true understanding, much like mistaking the map for the territory or confusing the finger pointing at the moon with the moon itself.
I've done some work before in this space, trying to come up with different useful measures from the logprobs, such as measuring shannon entropy over a sliding window, or even bzip compression ratio as a proxy for information density. But I didn't find anything semantically useful or reliable to exploit.
The best approach I found was just multiple choice questions. "Does X entail Y? Please output [A] True or [B] False. Then measure the linprobs of the next token, which should be `[A` (90%) or `[B` (10%). Then we might make a statement like: The LLM thinks there is a 90% probability that X entails Y.
That has been my understanding too. More generally, a verifier at the end certainly helps.
In our paper [1], we find that asking a follow up question like "Is the answer correct?" and taking the normalized probability of "Yes" or "No" token (or more generally any such token trained for) seems to be best bet so far to get well-calibrated probabilities out of the model.
In general, the log-probability of tokens is not a good indicator of anything other than satisfying the pre-training loss function of predicting the "next token." (it likely is very well-calibrated on that task though) Semantics of language are a much less tamable object, especially when we don't quite have a good way to estimate a normalizing constant because every answer can be paraphrased in many ways and still be correct. The volume of correct answers in the generation space of language model is just too small.
There is work that shows one way to approximate the normalizing constant via SMC [2], but I believe we are more likely to benefit from having a verifier at train-time than any other approach.
And there are stop-gap solutions to make log probabilities more reliable by only computing them on "relevant" tokens, e.g. only final numerical answer tokens for a math problem [3]. But this approach kind of side-steps the problem of actually trying to find relevant tokens. Perhaps something more in the spirit of System 2 attention which selects meaningful tokens for the generated output would be more promising [4].
You and the OP talk a lot of smack about logprobs but we show that using them in even the simple case of dynamic truncation of your cutoff point (min_p sampler vs static top_p/top_k) leads to extreme performance improvements (especially on small models) and unlocks very high temperature sampling (for more creativity/less slop/better synthetic data-gen): https://arxiv.org/abs/2407.01082 [1].
Indeed, ultra high temperature sampling in its own right should be studied. I can do top_k = 2 and temperature = system.maxint and get decent results which are extraordinarily creative (with increasing probability of token related spelling issues as top_k goes up).
I'm convinced that the models logprobs hold so much bloody value and knowledge that I unironically do not care about how many "theoretical guarantees" it lacks or about it's non-correspondence to our usage of language.
[1]: Btw, this paper is now ICLR 2025 accepted and likely going to get an oral/honorable mention since we are ranked #18 out of all submissions by scores and have extremely favorable meta-review. Peer review seems to agree with our claims of extreme performance improvements.
Congratulations on the strong reception of min-p. Very clever!
We may be talking about two orthogonal things here. And also to be clear, I don't care about theoretical guarantees either.
Now, min-p is solving for the inadequacies of standard sampling techniques. It is almost like a clever adaptive search which other sampling methods fail at (despite truncations like top-k/top-p).
However, one thing that I noticed in the min-p results was that lower temperatures were almost always better in the final performance (and quite expectedly the inverse for creating writing). This observation makes me think that the underlying model is generally fairly good at ranking the best tokens. What sampling allows us is a margin-for-error in cases where the model ranked a relevant next token not at the top, but slightly lower.
Therefore, my takeaway from min-p is that it solves for deficiencies of current samplers but its success is not in contradiction to the fact that logprobs are bad proxies for semantics. Sampling is the simplest form of search, and I agree with you that better sampling methods are a solid ingredient to extract information from logprobs.
Interesting, I had never heard about min-p until now. From what I understand, it's like a low-pass filter for the token sampling pool which boosts semantic coherence. Like removing static from the radio.
Do you have any benchmarks of min-p sampling with the new reasoning models, such as QwQ and R1?
Unfortunately LLMs are a gigantic monster to understand, we were considering your same approach with sliding window and we will try to keep the library updated with better and more reliable approaches based on new research papers and our internal tests.
My take: the distills under 32B aren’t worth running. Quants seem to impact quality much more than other models. 32B and 70B unquantized are very good. 671B is SOTA.