> that will have a higher impact than register allocation and instruction select...

KMag · on Nov 5, 2020

Even with inlining and escape analysis to remove unnecessary boxing, unboxing, and heap allocations, most Java programs still do a lot of pointer chasing. All of that (often poor locality) memory traffic tends to hide the overhead of unnecessary register moves and even some unnecessary register spills.

MaxBarraclough · on Nov 4, 2020

Could that be due to the cleverness of modern heavyweight CPUs, with techniques like register renaming? Would things change if you used less sophisticated processors?

chrisseaton · on Nov 4, 2020

This is the explanation I get when I dig into these things, yes.

For example I'll see what seems to my less-experienced eyes some completely redundant 'mov' instructions shifting things around. I'll ask 'should we fix this' and the response is usually 'doesn't really matter it's basically free'.

MaxBarraclough · on Nov 4, 2020

Interesting, so (to a rough approximation) the CPU is smart enough to boil off many 'shallow' inefficiencies. Doesn't cache behaviour still matter though? I'd have thought code-compactness would still be worth carefully optimising for.

chrisseaton · on Nov 4, 2020

> Doesn't cache behaviour still matter though? I'd have thought code-compactness would still be worth carefully optimising for.

Well that's partly why I'm still surprised, but I think for a lot of dynamic languages there's such a huge volume of code anyway... that these little 'mov's are pretty insignificant.

Removing them would have a down-side - extra compilation time.

beagle3 · on Nov 5, 2020

yes and no.

Most instructions have 3-20 cycle latency. A lot of what CPU do on many programs is wait for an instruction to complete so they can use its result to chase some pointer - and in the mean time, find an instruction that doesn't depend on it to execute. So many "simple" moves and register renames are practically free, as they happen in these waiting-for-result slots.

HPC code is most definitely not like that - but it's only found in specialized parts of software (game rendering paths, maths stuff, some NumPy code, etc). But most code spends most of its time with the CPU waiting.

higerordermap · on Nov 5, 2020

How fast is code generated by graal?

In languages like Java w/o value types & virtual methods everywhere, interprocedural analysis is very important. Even otherwise I believe interprocedural optimizations are more important than small register fiddling or cute integer arithmetic tricks. But when you reach a dead-end in optimized code, all these things matter.

Sure, it may not matter for Erlang / Elixir on server.

Rochus · on Nov 4, 2020

Are there any papers or articles about the mentioned findings?

chrisseaton · on Nov 4, 2020

About what findings, sorry?

Linear Scan is an example I reach for when I talk about what parts of the compiler are really important, if that's what you mean.

http://web.cs.ucla.edu/~palsberg/course/cs132/linearscan.pdf

> The linear scan algorithm is considerably faster than algorithms based on graph coloring, is simple to implement, and results in code that is almost as efficient as that obtained using more complex and time-consuming register allocators based on graph coloring.

Also things like escape analysis and inlining are often called 'the mothers of optimisation' because they fundamentally enable so many other optimisations. Not sure there's really a citation for it but I doubt anyone would dispute.

I wrote about the impact on optimising Ruby.

https://chrisseaton.com/truffleruby/pushing-pixels/

Rochus · on Nov 4, 2020

> About what findings, sorry?

These findings: The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference.

> Linear Scan is an example I reach for when I talk about what parts of the compiler are really important

Sure, but you just said that register allocation is not a relevant factor compared to other achievements when using Gral VM. Or did I get that wrong?

chrisseaton · on Nov 4, 2020

> These findings:

Right - didn't I cover those? I gave the example that Graal is extremely powerful with a huge amount of money and effort behind it, but doesn't really bother at all with serious register allocation or clever instruction selection as suggested you should in text books. It uses the most basic algorithm and doesn't see any need to do any more, even when they're still keen on tiny improvements to the output. It just doesn't seem to be worth it.

But it does put a lot of effort into object allocation, boxing, inlining, specialisation, etc. So those seem in practice to be worth it.

Rochus · on Nov 4, 2020

> But it does put a lot of effort into object allocation, boxing, inlining, specialisation, etc. So those seem in practice to be worth it.

Well, as I understand this is an assumption, not the result of a dedicated study. I made similar observations when using LuaJIT as a SOM backend (see https://github.com/rochus-keller/Som), but it's not clear why the Gral based SOM implementations are that much faster.

chrisseaton · on Nov 4, 2020

> Well, as I understand this is an assumption, not the result of a dedicated study.

I'm not sure what you're angling for?

Some kind of falsifiable proof about where it makes sense to put engineering effort and complexity? You're not going to find that sorry nobody's ever been able to seriously quantify those kind of things.

> I made similar observations

Well then why are we arguing about it if it's apparent to both of us?

Rochus · on Nov 4, 2020

> I'm not sure what you're angling for?

You assume that The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference.

But it would be very interesting to have something like a scientific study about why Gral is indeed faster than other approaches.

> Well then why are we arguing about it if it's apparent to both of us?

Because I would like to understand the true reason to be able to improve my implementation (if feasible).

EDIT: as you claim textbooks about compiler design are wrong or not up to date, so my desire to have someone change that seems understandable, isn't it?

chrisseaton · on Nov 4, 2020

I don't think it's correct to say I'm just assuming.

Linear Scan produces a program with apparently less efficient register allocation. In practice, it does not matter for the wider performance of the code. Is this not evidence to support the assumption that sophisticated register allocation does not matter as much as we thought?

When you enable Graal's more sophisticated escape analysis algorithms you get real-world speed ups in workloads such as Twitter, worth hundreds of thousands of dollars in costs saved. Is this not evidence to support the assumption that sophisticated escape analysis algorithms do matter?

The first is a formal scientific study. The second is not but it's still a very large-scale empirical result measured by an expert. They aren't falsifiable but as I said I don't think that's a realistic expectation, and I think these are enough for it to be more than an assumption.

Rochus · on Nov 4, 2020

> Is this not evidence to support the assumption that sophisticated register allocation does not matter as much as we thought?

It's an indication but it doesn't sufficiently support the conclusion. There are so many other things to consider.

> Is this not evidence to support the assumption that sophisticated escape analysis algorithms do matter?

Would you as a scientist seriously accept this as a sufficient evidence for your claims?

But let's leave it at that for the moment. As far as I know there are ongoing research projects which could deliver more insights why specifically a Smalltalk VM runs faster on Gral than anywhere else.

chrisseaton · on Nov 4, 2020

> Would you as a scientist seriously accept this as a sufficient evidence for your claims?

It was a comment on a web page dude... I didn't claim it in a research paper for publication!

If we discourage others from more casually sharing our observations as you're doing we'll miss opportunities to find things to form hypotheses from in the first place! Casual discussion in the community is part of science, something to be encouraged, and you're sadly missing something by dismissing it like this.

Rochus · on Nov 4, 2020

Ok, that sounds like a response to my initial question Are there any papers or articles about the mentioned findings?

Casually sharing observations is a good thing, and even better when there is some detail information available which makes it possible to understand the propositions sufficiently well and to assess how certain the conclusions are.