Good questions. The first one: transformers are "permutation invariant" by natur...

metanonsense · on May 24, 2023

With regards to the "It's still puzzling this works" wrt positional encoding, I have developed an intuition (that may be very wrong ;-). If you take the fourier transform of a linear or sawtooth function (akin to the the progress of time), I think you get something that resembles the positional encoding in the original transformer. EDIT: fixed typo

havercosine · on May 25, 2023

This is a good intuition. At times it reminds me of old school hand rolled feature engineering used in time series modelling: assuming that the signal is made up of a stationary component and a sine wave. Though haven't managed to mathematically figure out if the two are equivalent.

FartyMcFarter · on May 24, 2023

> The architecture is pretty dumb, the secret is the training data

If this were true, we could throw the same training data at any other "dumb" architecture and it would learn language at least as well/fast as transformers do. But we don't see that happening, so the architecture must be smartly designed for this purpose.

visarga · on May 24, 2023

Actually there are alternatives by the hundreds, with similar results. Reformer, Linformer, Performer, Longformer... none is better than vanilla overall, they all have an edge in some use case.

And then we have MLP-mixer which just doesn't do "attention" at all, MLP is all you need. A good solution for edge models.

homarp · on May 24, 2023

MLP-mixer: https://news.ycombinator.com/item?id=28581570

and https://towardsdatascience.com/mlp-mixer-in-a-nutshell-eccff...

whimsicalism · on May 24, 2023

Other dumb architectures don't parallelize as well. Other architectures that parallelize at similar levels (RNN-RWKV, H3, S4, etc.) do perform well at similar parameter counts and data sizes.

homarp · on May 24, 2023

RNN-RWKV - https://news.ycombinator.com/item?id=36038868

H3 - https://news.ycombinator.com/item?id=34673535

S4 - https://srush.github.io/annotated-s4/

amelius · on May 24, 2023

Regarding the positional encoding, why not include a scalar in the range (0..1) with every token where the scalar encodes the position of the token? This adds a small amount of complexity to the network, but it could aid comprehensibility which to me seems preferable if you're still doing research on these networks.

kristjansson · on May 24, 2023

That's a valid 1D position encoding :)

uh_uh · on May 24, 2023

I'm still not clear on the second question. If lalaithion's original statement "the Q and K matrices aren't structurally distinct" is true, then once the neural network is trained, how can we look at the two matrices and confidently say that one is the query matrix instead of it being the key matrix (or vice versa)? To put it another way: is the distinction between query and key roles "real" or is it just an analogy for humans?

ntonozzi · on May 24, 2023

I am not an expert, but I think that they are structurally identical only in decoder only transformers like GPT. The original transformers were used for translation, and so the encoder-decoder layers use Q from the decoder layer and K from the encoder layer. The attention is all you need paper has an explanation:

> In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as...

lostmsu · on May 24, 2023

> transformers are "permutation invariant" by nature

Surely that does not apply to GPT models which use causal masking.

anonymousDan · on May 24, 2023

Would this not imply that if I encrypt the input and then decrypt the output I would get the correct result (i.e. what I would have gotten if I used the plaintext input)?

alok-g · on May 25, 2023

>> concatenation was also ok, but less efficient

That's my question. Why is this so.