Mixture of Logits was actually already deployed on 100M scale+ datasets at Meta ...

Mixture of Logits was actually already deployed on 100M scale+ datasets at Meta and at LinkedIn (https://arxiv.org/abs/2306.04039 https://arxiv.org/abs/2407.13218 etc.). The crucial departure from traditional embedding/multi-embedding approaches is in learning a query-/item- dependent gating function, which enables MoL to become a universal high-rank approximator (assuming we care about recall@1) even when the input embeddings are low rank.