Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1. It works, the direct alternative (concatenation) allocates a smaller dimension to the initial embedding, and also added positional embeddings are no longer commonly used in newer Transformers. Schemes like RoPE and ALiBi are more common.

2. I'm not 100% sure I understand your question. The Ks correspond to the Vs, and so is used to compute the weighted sum over Vs. This is easiest to understand when you think of an encoder-decoder model (Qs come from the decoder, KVs come from the encoder), or decoding in a decoder (there is 1Q and multiple KVs)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: