Avoiding Instruction Cache Misses

hydroreadsstuff · on June 22, 2019

The complexity of current hardware is madness.

The performance of software is influenced by the workings of tens of units (+local caches) connected (non-linearly) with fifos, ooo buffers, replay mechanism. There are multiple versions of the same unit to save chip space (like light and heavy Integer/FP). Units/domains have different clock speeds. And from generation to generation port connections and instructions between pipelines can be reshuffled.

I suppose what keeps performance changes relatively straightforward for developers is the set of benchmarks used to evaluate the hardware early on.

I appreciate the article focusing on the I-Cache, and the nice intro to decoding.

I would have preferred having an example, and improving something instead of abstractly talking about problems, effects of code and optimizations and possible workarounds.

Tangent: I wonder if we will be seeing specialized instructions sometime that span multiple units and multiple cores in order to reduce data-movement. Think matrix-matrix multiplication. The potential improvements for power and speed seem huge.

jandrewrogers · on June 22, 2019

This is, in a nutshell, why high-performance systems engineering is a rare skill set. It entails writing C++ (or whatever) with full understanding of the machine code that is likely to generate and how that machine code will interact with the incredibly complex internals of modern microarchitectures. It essentially requires de-abstracting two levels of abstraction below the programming language, which exist to reduce cognitive load, in your software design and implementation.

It is unfortunate that this is still so useful in practice, given what it implies about the magnitude of waste in typical software systems.

toast0 · on June 22, 2019

That skill set is helpful, but you can go a long way with profiling tools to find what's your bottleneck is spending the most time on, and try to do less of that. And, in general, trying to work with the least amount of abstraction amd making sure the abstractions you use match the underlying reality as much as possible. Almost always, you can't add a layer to get performance, just like you can't usually add a layer to get code quality, and you can't usually add a layer to get security.

Obviously, abstractions are useful, but it can be very hard to dig out of wrong abstractions, when they've influenced the design of the whole project.

d1zzy · on June 22, 2019

What's interesting is that with the death of Moore's Law there's a sudden increased interest in accelerators (software or hardware assisted), specialized pieces of software/hardware that can perform certain tasks really fast and then integrating these in a PC system to boost performance of certain common tasks. Thus I found myself having to spend a lot more time lately learning how cache affects the code I write, branch prediction, speculative execution, etc. It feels like starting out learning to program all over again.

gumby · on June 22, 2019

One of the criticisms of the C++ standardization efforts is how many of the improvements are quitearcane. But those arcane features aren't necessarily for user code but to make library code efficient. Implementations of the standard library are often almost unreadable because of all the weird corner cases and performance optimizations they have to transparently handle.

This can go a long way to improving user code...but of course it's no magic bullet.

adrianN · on June 23, 2019

I really dislike this mindset in the C++ community. The people writing libraries are normal programmers too. They're not some kind of supermen on whom you can load infinite complexity. And it's not like the STL is this amazing piece of software that mere mortals can't beat. Quite the contrary, any sufficiently large piece of software tends to replace the STL at least in parts with their own implementation.

danbolt · on June 23, 2019

Herb Stutter mentioned once that an advantage of the STL is that it gets you good general-purpose performance initially, but tries to stay out of your way if you want to replace it for something more specialized.

I think you’re right about a codebase replacing parts of STL as necessary. One I see frequently is `std::array` being wrapped with iterators to be essentially a stack-allocated vector.

gumby · on June 23, 2019

I don't see that that in any way contrasts with what I wrote. The same is true of the C standard library.

MaxBarraclough · on June 23, 2019

Do you have an example of what you're thinking of?

cbetti · on June 23, 2019

Are you talking about STL interfaces or implementations?

gumby · on June 23, 2019

STL in particular, but it's hardly the only case.

rmdashrfstar · on June 22, 2019

Any advice to learn systems engineering properly?

jeffreygoesto · on June 22, 2019

Practice is a main ingredient. Build and debug (down to below assembler level with i.e. vtunes, learn about weak memory ordering, ...) increasingly complex systems. Learn how high level language code is transformed into "what the hardware shall do" to be able to predict, to some degree, whether that particular code you review will be optimized well or not (then measure and confirm/extend/correct your prediction). Working with FPGAs helped me learning a lot about how to get the right data to the right execution unit at the right point in time. Solving this "space-time-problem" is exactly what optimizing is about in CPUs as well. For the engineering part I'd say: being strict about separation of concerns and single source principles and not too religious with abstractions certainly helps. And finally: read, read, read. Books, Papers, other people's code...

jeffreygoesto · on June 22, 2019

Oh, and one more thing (tm): try to be the worst, not the best team member. Make sure you always can learn from peers. You need discussions, especially those wandering up and down the abstraction stack, asking "why" on every level.

loeg · on June 22, 2019

Get a job where you do it some or full time, and acquire lots and lots of practice.

jeffreygoesto · on June 22, 2019

+1 took me too long to type, otherwise I would have just quotey you ;-).

legulere · on June 22, 2019

At least for binary matrix multiplication there is precedent: https://github.com/riscv/riscv-bitmanip/wiki/bitmat

tempguy9999 · on June 22, 2019

> There are multiple versions of the same unit to save chip space (like light and heavy Integer/FP).

I've literally never heard of this. I know you can have multiple execution units for parallel instruction execution, but 'light' and 'heavy' - can you give some info or a link? TIA

hydroreadsstuff · on June 22, 2019

https://images.anandtech.com/doci/13699/Ronak26.jpg

https://www.anandtech.com/show/13699/intel-architecture-day-...

I believe this shows differences between FP and Integer units. In order to achieve a certain performance goal you don't necessarily need another integer divider when you want a new adder/multiplier. So you add a slimmer unit instead.

I listed this in my original comment, because this is a giant can of worms for the compiler and decision-maker on where to execute what.

tempguy9999 · on June 22, 2019

Ah, thanks. The slide is interesting for extra reasons.

With respect, I think you're misunderstanding. I thought you meant light/heavy versions of eg. adders, for some definition of light and heavy addition.

I'm not an expert but... CPUs will put in extra execution units according to need (will typical code get faster with an extra X?) and cost.

Shifters are typically very often used, and are simple. So are adders, though more complex. IIRC recent intel x64 will have several of of each[0]. Multipliers are less cheap so they have fewer (and often you can turn them into adds in certain cases such as progressive array lookups). Division is slow and very expensive in transistors, so they have 1 (division can often be turned into reciprocal multiplication anyway). Sqrt is even worse.

And to repeat, I'm no expert and any corrections welcome.

[0] <https://en.wikichip.org/wiki/intel/microarchitectures/coffee... If I'm reading this right, 2 shifters (2? I suppose they are fast so they are available soon after), 4 adders, 1 mult and 1 divider.

hydroreadsstuff · on June 22, 2019

Right, I mixed up GPUs and CPUs, here.

Do ports (as in the picture) have independent pipelines, or do they execute certain pipeline stages of a big pipeline? I suppose, either way you can't issue to the same port in the same cycle.

This paper sheds some light on how instructions are divied up between units on NVIDIA GPUs. http://www.stuffedcow.net/files/gpuarch-ispass2010.pdf Table IV. Notice that fp32 mul is in the SFU and SP, while others are not.

tempguy9999 · on June 22, 2019

I'm the wrong guy to answer this but I believe the ports are available to one pipeline, so no independent pipelines, but instructions behind the 'head' can hop over the head if the head is stalled, if there are execution units free. Instructions can get reordered, over quite a wide window, something like 200 instructions (look up the reorder buffer, ROB, although there's another windows which affects this, something to do with retiring instructions).

<https://en.wikipedia.org/wiki/Re-order_buffer>

Come to think of it, I don't know how the ports are used. I am entirely unqualified to answer this question :)

namibj · on June 22, 2019

Actually, barrel shifters are not that small. Added are significantly cheaper in terms of chip area.

tempguy9999 · on June 22, 2019

Seriously?? It's the same bit pattern, err, shifted. Adders have got to carry, at each stage (ripple carry?). I am amazed, thanks.

Robin_Message · on June 22, 2019

Think of it this way: a barrel shifter has to be able to "carry" every bit to (potentially) every other bit.

tempguy9999 · on June 22, 2019

Yep, but also while aligned. Bits x and y are going to be the same distance apart, always, unless shifted off the end (where they can wrap or be lost). Thinking of this in the pub it seemed a butterfly thingy would be appropriate <https://en.wikipedia.org/wiki/Butterfly_network>.

Further thinking suggested there'd be a ton of wires doing this, and perhaps it's the wiring that's taking up the silicon?

namibj · on June 22, 2019

It's the multiplexers. Also, critical path is going towards limiting your speed: https://en.wikipedia.org/wiki/Barrel_shifter

namibj · on June 22, 2019

You mean Nvidia's tensor cores? You can buy what you wonder about for over a year now.

AnanasAttack · on June 22, 2019

I've been wondering about the impact of insturction cache misses and potential downsides of executable bloat caused by C++ templates, but have never seen a real case of this being a problem.

cbetti · on June 23, 2019

In 2010 Endeca's analytics engine exe was growing substantially via introduction of template instantiations. The engine eventually crossed the threshold of instruction cache misses outweighing the benefits of avoiding type lookups at runtime, and the team agreed to stop pursuing template instantiation so aggressively as a performance enhancement.

I wonder if modern compilers would have allowed the team to keep at it.

It's been a while, but if I recall correctly, the optimized, stripped exe's at the time were a bit under 100mb. >500mb with debug symbols, etc.

mhh__ · on June 22, 2019

Executable bloat can be avoided by stripping the executable afterwards, if that's what you mean: Although the function body will have been provided if instantiated and used, the compiler will have probably inlined a huge amount of any code that is actually used - a lot of templated stuff is made up of one liners or single uses, which compilers are very very good about reasoning through the motion of these days.

I guess a naive compiler could have issues but every time I check something on compiler explorer, the modern big boys (GCC, LLVM, ICC?) are pretty shrewd (Especially if you optimise for size).

cbetti · on June 23, 2019

While aggressive inlining eliminates function call overhead, it can actually exacerbate instruction cache misses because it makes the executable larger.

jcelerier · on June 22, 2019

I don't understand where does this sentiment come from. I've once seen a complete high-level networking algorithm implemented with boost.asio be reduced to less than 250 instructions...

jandrewrogers · on June 22, 2019

Modern C++ compilers are pretty good at minimizing template bloat. As with most things in C++ it requires some awareness and intent to optimize this but it isn't the issue it used to be in many cases.

tempguy9999 · on June 22, 2019

How would you know it's happening? How would you know it's not? How have you measured/looked?

I'm know little about this area, so I'm curious.

opencl · on June 22, 2019

Several popular profiler tools (valgrind, visual studio profiler, dtrace, perfmon) can track cache miss rates.

namibj · on June 22, 2019

perf-tools are my favorite. The overhead is negligible, and thus any metrics you gather are very accurate. Valgrind is rarely useful, considering the execution time disadvantage.

namibj · on June 22, 2019

  perf record -e icache.ifdata_stall ./test-program-binary

followed by

  perf report

or, for fancy viewing, use hotspot from KDAB.

This is on linux, btw, but it should work fine with wine.

htfy96 · on June 22, 2019

Another big issue is iTLB misses. Today if you profile a real-life application, the iTLB miss rates can be 5%+ due to increased program image size. Splitting hot/cold variant like this article also works for this problem, but I was wondering if we can use transparent huge page on code while disabling it for mmapped data pages

shereadsthenews · on June 22, 2019

I wish the article discussed huge pages for executable text and their beneficial impact on the iTLB.