To be fair, unless I missed it, that article is mostly just saying "The L1 and L2 are much faster than L3 and beyond". That's correct - but it doesn't imply that the L2 is somehow as fast as the L1 (if that were true, why have the distinction between L1 and L2 at all?).
Once you go to the L3, you suffer a large jump in latency (something like 12 cycles for L2 to ~40 cycles for L3 - and on some multi-socket configurations this gets worse) and you are now sharing the cache with all the other cores on the chip, so the advice to "Keep in L1 and L2" make a lot of sense.
Intel used to claim their L2 had 64 bytes per cycle bandwidth - and indeed there is probably a bus between the L1 and L2 capable of transferring 64 bytes per cycle in some direction: but you could never achieve that since unlike the core to L1 interface, which has a "simple" performance model, the L1<->L2 interface is complicated by the fact that you need to (a) accept evictions from L1, and (b) that the L1 itself doesn't have an unlimited number of ports to simultaneously accept incoming 64-byte lines from L2 and (c) everything is cache-line based.
The upshot is that even ignoring (c) you never get more than 32 bytes per cycle from L2 and usually less. Intel recently changed all the mentions of "64 bytes per cycle for L2" into "32 bytes max, ~2x bytes sustained" in their optimization manual in recognition of this.
Once you consider (c) you might get even way less bandwidth from L2: in L1, on modern Intel, your requests can be scattered around more or less randomly and you'll get the expected performance (there is some small penalty for splitting a cache line: it counts "double" against your read/write limits), but with L2 you will get worse performance with scattered reads and writes since every access is essentially amplified to a full cache line.
Once you go to the L3, you suffer a large jump in latency (something like 12 cycles for L2 to ~40 cycles for L3 - and on some multi-socket configurations this gets worse) and you are now sharing the cache with all the other cores on the chip, so the advice to "Keep in L1 and L2" make a lot of sense.
Intel used to claim their L2 had 64 bytes per cycle bandwidth - and indeed there is probably a bus between the L1 and L2 capable of transferring 64 bytes per cycle in some direction: but you could never achieve that since unlike the core to L1 interface, which has a "simple" performance model, the L1<->L2 interface is complicated by the fact that you need to (a) accept evictions from L1, and (b) that the L1 itself doesn't have an unlimited number of ports to simultaneously accept incoming 64-byte lines from L2 and (c) everything is cache-line based.
The upshot is that even ignoring (c) you never get more than 32 bytes per cycle from L2 and usually less. Intel recently changed all the mentions of "64 bytes per cycle for L2" into "32 bytes max, ~2x bytes sustained" in their optimization manual in recognition of this.
Once you consider (c) you might get even way less bandwidth from L2: in L1, on modern Intel, your requests can be scattered around more or less randomly and you'll get the expected performance (there is some small penalty for splitting a cache line: it counts "double" against your read/write limits), but with L2 you will get worse performance with scattered reads and writes since every access is essentially amplified to a full cache line.
It turns out that L2 writes are especially bad:
https://stackoverflow.com/q/47851120/149138