My take on it is that I like most of the design choices they've made (Been tinke...

chrisseaton · on Nov 4, 2020

> that will have a higher impact than register allocation and instruction selection

The most powerful JIT compiler that I have worked with - Graal - generates code that often on the face of it seems puzzlingly inefficient in terms of registers allocated and instructions selected. Turns out maybe it's another thing that's not as important as all the text books say?

The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference.

KMag · on Nov 5, 2020

Even with inlining and escape analysis to remove unnecessary boxing, unboxing, and heap allocations, most Java programs still do a lot of pointer chasing. All of that (often poor locality) memory traffic tends to hide the overhead of unnecessary register moves and even some unnecessary register spills.

MaxBarraclough · on Nov 4, 2020

Could that be due to the cleverness of modern heavyweight CPUs, with techniques like register renaming? Would things change if you used less sophisticated processors?

chrisseaton · on Nov 4, 2020

This is the explanation I get when I dig into these things, yes.

For example I'll see what seems to my less-experienced eyes some completely redundant 'mov' instructions shifting things around. I'll ask 'should we fix this' and the response is usually 'doesn't really matter it's basically free'.

MaxBarraclough · on Nov 4, 2020

Interesting, so (to a rough approximation) the CPU is smart enough to boil off many 'shallow' inefficiencies. Doesn't cache behaviour still matter though? I'd have thought code-compactness would still be worth carefully optimising for.

chrisseaton · on Nov 4, 2020

> Doesn't cache behaviour still matter though? I'd have thought code-compactness would still be worth carefully optimising for.

Well that's partly why I'm still surprised, but I think for a lot of dynamic languages there's such a huge volume of code anyway... that these little 'mov's are pretty insignificant.

Removing them would have a down-side - extra compilation time.

beagle3 · on Nov 5, 2020

yes and no.

Most instructions have 3-20 cycle latency. A lot of what CPU do on many programs is wait for an instruction to complete so they can use its result to chase some pointer - and in the mean time, find an instruction that doesn't depend on it to execute. So many "simple" moves and register renames are practically free, as they happen in these waiting-for-result slots.

HPC code is most definitely not like that - but it's only found in specialized parts of software (game rendering paths, maths stuff, some NumPy code, etc). But most code spends most of its time with the CPU waiting.

higerordermap · on Nov 5, 2020

How fast is code generated by graal?

In languages like Java w/o value types & virtual methods everywhere, interprocedural analysis is very important. Even otherwise I believe interprocedural optimizations are more important than small register fiddling or cute integer arithmetic tricks. But when you reach a dead-end in optimized code, all these things matter.

Sure, it may not matter for Erlang / Elixir on server.

Rochus · on Nov 4, 2020

Are there any papers or articles about the mentioned findings?

chrisseaton · on Nov 4, 2020

About what findings, sorry?

Linear Scan is an example I reach for when I talk about what parts of the compiler are really important, if that's what you mean.

http://web.cs.ucla.edu/~palsberg/course/cs132/linearscan.pdf

> The linear scan algorithm is considerably faster than algorithms based on graph coloring, is simple to implement, and results in code that is almost as efficient as that obtained using more complex and time-consuming register allocators based on graph coloring.

Also things like escape analysis and inlining are often called 'the mothers of optimisation' because they fundamentally enable so many other optimisations. Not sure there's really a citation for it but I doubt anyone would dispute.

I wrote about the impact on optimising Ruby.

https://chrisseaton.com/truffleruby/pushing-pixels/

Rochus · on Nov 4, 2020

> About what findings, sorry?

These findings: The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference.

> Linear Scan is an example I reach for when I talk about what parts of the compiler are really important

Sure, but you just said that register allocation is not a relevant factor compared to other achievements when using Gral VM. Or did I get that wrong?

chrisseaton · on Nov 4, 2020

> These findings:

Right - didn't I cover those? I gave the example that Graal is extremely powerful with a huge amount of money and effort behind it, but doesn't really bother at all with serious register allocation or clever instruction selection as suggested you should in text books. It uses the most basic algorithm and doesn't see any need to do any more, even when they're still keen on tiny improvements to the output. It just doesn't seem to be worth it.

But it does put a lot of effort into object allocation, boxing, inlining, specialisation, etc. So those seem in practice to be worth it.

Rochus · on Nov 4, 2020

> But it does put a lot of effort into object allocation, boxing, inlining, specialisation, etc. So those seem in practice to be worth it.

Well, as I understand this is an assumption, not the result of a dedicated study. I made similar observations when using LuaJIT as a SOM backend (see https://github.com/rochus-keller/Som), but it's not clear why the Gral based SOM implementations are that much faster.

chrisseaton · on Nov 4, 2020

> Well, as I understand this is an assumption, not the result of a dedicated study.

I'm not sure what you're angling for?

Some kind of falsifiable proof about where it makes sense to put engineering effort and complexity? You're not going to find that sorry nobody's ever been able to seriously quantify those kind of things.

> I made similar observations

Well then why are we arguing about it if it's apparent to both of us?

Rochus · on Nov 4, 2020

> I'm not sure what you're angling for?

You assume that The important bit is removing object allocating, removing boxing, deep inlining, and specialisation... when you've done all that work the exact registers and instructions don't seem to really make a huge difference.

But it would be very interesting to have something like a scientific study about why Gral is indeed faster than other approaches.

> Well then why are we arguing about it if it's apparent to both of us?

Because I would like to understand the true reason to be able to improve my implementation (if feasible).

EDIT: as you claim textbooks about compiler design are wrong or not up to date, so my desire to have someone change that seems understandable, isn't it?

chrisseaton · on Nov 4, 2020

I don't think it's correct to say I'm just assuming.

Linear Scan produces a program with apparently less efficient register allocation. In practice, it does not matter for the wider performance of the code. Is this not evidence to support the assumption that sophisticated register allocation does not matter as much as we thought?

When you enable Graal's more sophisticated escape analysis algorithms you get real-world speed ups in workloads such as Twitter, worth hundreds of thousands of dollars in costs saved. Is this not evidence to support the assumption that sophisticated escape analysis algorithms do matter?

The first is a formal scientific study. The second is not but it's still a very large-scale empirical result measured by an expert. They aren't falsifiable but as I said I don't think that's a realistic expectation, and I think these are enough for it to be more than an assumption.

Rochus · on Nov 4, 2020

> Is this not evidence to support the assumption that sophisticated register allocation does not matter as much as we thought?

It's an indication but it doesn't sufficiently support the conclusion. There are so many other things to consider.

> Is this not evidence to support the assumption that sophisticated escape analysis algorithms do matter?

Would you as a scientist seriously accept this as a sufficient evidence for your claims?

But let's leave it at that for the moment. As far as I know there are ongoing research projects which could deliver more insights why specifically a Smalltalk VM runs faster on Gral than anywhere else.

chrisseaton · on Nov 4, 2020

> Would you as a scientist seriously accept this as a sufficient evidence for your claims?

It was a comment on a web page dude... I didn't claim it in a research paper for publication!

If we discourage others from more casually sharing our observations as you're doing we'll miss opportunities to find things to form hypotheses from in the first place! Casual discussion in the community is part of science, something to be encouraged, and you're sadly missing something by dismissing it like this.

Rochus · on Nov 4, 2020

Ok, that sounds like a response to my initial question Are there any papers or articles about the mentioned findings?

Casually sharing observations is a good thing, and even better when there is some detail information available which makes it possible to understand the propositions sufficiently well and to assess how certain the conclusions are.

kannanvijayan · on Nov 5, 2020

> Some might mention that V8 moved away from the always-JIT

They did, but their always-JIT was a bit heavyweight. They generated full method-jit code on first invocation at first. This was a massive improvement over the existing state of the art, but it was also relatively early in the whole development of the "fast JS" ecosystem we live in now.

The _size_ of JS grew, and the size of functions grew, and the complexity of logic grew. The amount of cold and run-once code grew.

That issue was brought into focus by the growth of the webapp space and the size of the payloads.

In this case (erlang folks) - they're going about the whole thing in a very good way. Their "jit-everything" is actually "jit one instruction at a time", which is _amazingly_ perceptive of the challenges people in other JIT teams have faced (e.g. JS JIT teams). What they're doing is tightly scoped, easy to bootstrap and test with fallbacks to VM-calls into the VM for slowpaths or complicated stuff.

That's a solid base you can slowly layer with higher tiers later if it matters. They're keeping their abstraction layers strong by having a well-specified bytecode system, and hopefully they will strive to keep it relatively independent of the runtime, and avoid leakages of runtime semantics into instructions.

I was personally very impressed by their description of their approach and motivation behind each decision.

nickjj · on Nov 4, 2020

> Erlang is mostly used in servers i think they might see this as a good compromise.

I'm not sure what the differences will be but start up time is still a very big deal for web servers.

For example I can start up a non-Dockerized Python / gunicorn app with 8,000 lines of code and 50+ package dependencies in about 400ms on a $5 a month DigitalOcean server without doing any type of optimizations.

For someone who just wants to deploy a project to 1 server that means there's only 400ms of down time when I restart my server.

If a runtime takes let's say 10 seconds to start up, that could be enough of a deterrent to not use it if you know you're going down the route of a single server deploy.

dnautics · on Nov 4, 2020

> For someone who just wants to deploy a project to 1 server that means there's only 400ms of down time when I restart my server.

Erlang already gives you robust primitives to do things like blue-green deploys, and even graceful transfer of server state even across versioning changes. If downtime between releases is an something you care about, you should use those, and it's likely that your downtime can be in the microseconds range regardless of the vm startup latency.

nickjj · on Nov 5, 2020

I don't think hot reloads in Elixir are used that often for web apps due to how much complexity / state changes need to take place. I'm basing that on replies I've seen from a lot of folks in the forums who say it's not really worth the trouble due to how much can go wrong and how involved it would be to do such a thing.

Which blue-green deploy primitives are you referencing btw?

dnautics · on Nov 6, 2020

Not referring to hot reloads.

To do a simple blue green deploy you just have to use the builtin node system to make a cluster, and use :global to detect if there are any downversion nodes, if the current node is up version, it slays a downversion node and your external restart mechanism (either erlangs heart, systemd, kubernetes, or Amazon elb, or whatever brings back the node but with the newer code).

nickjj · on Nov 6, 2020

Do you happen to have a working example of that posted somewhere? A blog post with code perhaps? I'm unable to find a single example of that because every search around this topic turns up things like hot reloading and a bunch of people saying not to use it.

cmrdporcupine · on Nov 4, 2020

I don't think we're talking "start time" so much as "warm up time"; if I understand these things correctly it's likely the VM would start almost immediately, but would take a little bit of time for hot code paths to JIT and become highly performant. I don't think that would be much of a concern in your example.

lawik · on Nov 4, 2020

From my conversation with them on Elixir Mix I don't believe warm-up is the issue. In this case the JIT is simpler than that. But the BEAM VM isn't the quickest starter in the world.