It might not be compressing more (haven't yet looked at the paper). You can have...

It might not be compressing more (haven't yet looked at the paper). You can have fewer but larger tokens for the same amount of data.

It would decrease the workload by having fewer things to compare against balanced against workload per comparison. For normal N² that makes sense but the page says.

We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N²) to O(N) Mix-FFN

So not sure what's up there.