LLMs don't "read" text sequentially, right?

olliepro · 2025-10-22T23:14:03 1761174843

The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.

Merik · 2025-10-23T00:22:38 1761178958

Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...

ACCount37 · 2025-10-23T01:06:10 1761181570

Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.

Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.

anon291 · 2025-10-23T10:09:38 1761214178

If the attention is masked, then yes they do.