Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).
Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.
SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.
In my experience with other regular models, once the context starts to fill up, quality starts to degrade.
wouldn't getting batched at the end of a batch, have a similar -effect- on the results, where your prompt might recieve overall less attention focused into it, if the context window is almost full?
Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).