If we can know exactly where a particular object’s lifetime ends, then indeed we...

pron · on April 24, 2023

Rust also relies on a GC for the more dynamic lifetimes in Rust programs (reference counting), it just doesn't call it a GC, and while Rust's GC has a lower footprint than tracing GCs, it has worse performance and it suffers from cycle problems. I wouldn't call it "bug ridden" (perhaps "limited" is a better term), but it is slow.

Some think that reference counting gives better performance predictability than a tracing GC, but that is no longer true, certainly compared to the generational ZGC presented here. If you could determine the time where the collection would take place with a refcounting GC, i.e. when the last reference is cleared, you wouldn't need a refcounting GC. It is unpredictable pretty much by definition.

pkolaczk · on April 24, 2023

> Rust's GC (...) has worse performance [than tracing GC]

That's an unsubstantiated claim.

While reference counting may be slower than tracing in the general case, Rust's reference counting is not the same thing as general reference counting in languages like e.g. Swift or Python. There are two reasons:

- The good majority (99%+) of objects in Rust don't need to be managed by reference counting at all.

- Rust relies heavily on move semantics to avoid updating the ref-counts, so the good majority of refcounted objects have very few (sometimes just 1) refcount updates.

> reference counting gives better performance predictability than a tracing GC, but that is no longer true, certainly compared to the generational ZGC presented here.

Where does ZGC claim nanosecond latency? But even if it did, it would be still worse, because refcounting doesn't do pauses at all. Pausing applies to pausing all application threads. Spending X nanoseconds freeing memory by one thread does not count as a pause, because the system can still make progress.

pron · on April 24, 2023

Even then updating the counter is still more work than tracing collectors do, which is nothing almost all the time.

> refcounting doesn't do pauses at all

At the cost of throughput.

BTW, I'm not comparing Java to Rust, just ZGC to Rust's GC. I am well aware that it is not used for most objects.

pkolaczk · on April 24, 2023

But reference counting is just a tiny subcomponent of Rust's GC. The majority of Rust's automatic memory management is based on affine types and static code analysis, not refcounting. You're just picking subcomponents to make Java solution look better, but what matters for developers is the whole picture.

And BTW, I wouldn't be so sure about throughput either. ZGC is capable of allocation rates of the order of only ~5 GB/s, which is way below the theoretical memory bandwidth. I bet jemalloc can do much better.

pron · on April 24, 2023

I thought I made it clear in my last comment that I'm not claiming that Java's memory management is somehow "better" than Rust's. I was merely pointing out that even Rust has a GC, and it's not a good one, but it might certainly be good enough for Rust's needs as it doesn't depend on it. Overall, obviously Rust and Java choose very different tradeoffs in their memory management strategy.

> ZGC is capable of allocation rates of the order of only ~5 GB/s

That was the non-generational ZGC. Indeed, ZGC's max throughput was well below that of ParallelGC and G1. One of the main goals of generational ZGC is to increase the throughput, and as you can see in the video, they got a 4x increase in throughput.

kaba0 · on April 24, 2023

The statement was that ZGC (or any JVM GC for that matter) is better than Rust’s GC, which is just naive ref counting. This is true - if we were to create a program that allocates the same amount in both implementations, Java would win by a huge margin.

pkolaczk · on April 24, 2023

Have you ever tested it by yourself? I tested and no, it didn't win by a huge margin. Malloc+free was only a tad slower than the default Java GC new (stop the world, parallel) but the default GC is much faster than G1 and ZGC so I would not be surprised if malloc could beat those modern ones at allocation rate benchmark.

kaba0 · on April 24, 2023

Do you have a public benchmark for that? It’s quite hard to write a benchmark that actually measures what we are interested in. Not doubting you, I would be interested only.

pkolaczk · on April 24, 2023

This has been on my list of next blog posts for so long. But unfortunately no, not yet.

kaba0 · on April 24, 2023

That last line was just a play on this ( https://en.m.wikipedia.org/wiki/Greenspun%27s_tenth_rule ), but you are right of course.

yw3410 · on April 24, 2023

It does give better performance predictability! Decrementing a refcount may either be a no-op or expensive but crucially the op can happen only during that particular call, nowhere else.

pron · on April 24, 2023

It does not. Maintaining the counter requires more work than tracing GCs do, and you don't know in advance when the last reference is dropped. True, you get more promptness in deallocation (not the same as predictability), which reduces the footprint but comes at the expense of compaction, the lack of which in most refcounting GCs also adds unpredictability. In contrast, in the common case, allocation with a compacting, tracing GC is a pointer bump and there's no overhead at all to copying references or other memory operations. During collection cycles, GC barriers are triggered, which add overheard to things like mutating pointers or reading pointers from objects in the heap.

There's a great paper (linked from this post https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/unifie...) showing how refcounting GCs can match the performance of tracing GCs -- in fact, how they can each have the properties of the other -- but that requires sophistication in the implementation of refcounting GCs that most of them don't have. So it is true that a sophisticated refcounting GC and a sophisticated tracing GC can exhibit similar behaviour, but while OpenJDK's tracing GCs are very sophisticated (especially ZGC and G1), most refcounting GCs in the wild are quite unsophisticated. It may certainly be true that unsophisticated refcounting GCs have more desirable properties than unsophisticated tracing GCs, but in practice the choice offered is between unsophisticated refcounting GCs and very sophisticated tracing GCs. The latter beat the former on most counts except for footprint overhead.

pkolaczk · on April 24, 2023

Ah, it's that topic again. The whole point "but refcounting requires more maintenance per object than GC" is totally moot when there are only 0.01% of objects being refcounted (in Rust or C++) vs 100% objects traced by GC in Java.

pron · on April 24, 2023

I never claimed otherwise. Java's and Rust's overall memory management strategies are far too different to compare. I was talking specifically about refcounting vs tracing.

yw3410 · on April 25, 2023

But what you call promptness is predictability from the PoV of the programmer, no?

I don't think I've mentioned performance anywhere; thanks for the paper I'll take a look.

pron · on April 25, 2023

> But what you call promptness is predictability from the PoV of the programmer, no?

I don't think so. You don't know when memory will be freed or how long that would take only that it will be freed as soon as possible. The end result is lower footprint but that's about it.

pkolaczk · on April 24, 2023

Rust's lifetimes is a strict subset of the general case, but it covers a good majority of use-cases with affine types - which means trivial allocation and deallocation. The number of times I had to use lifetime specifiers or dynamic lifetimes (RefCell/Rc/Arc) is extremely low, so low that it is not even a blip on the profile. Also, unlike most GC languages, Rust favors stack allocation over heap allocation, which makes most Rust program allocation rates several orders of magnitude lower than any Java program I saw (at the expense of more inter-stack copying, but that is extremely cheap because the top of the stack is almost always in L1 cache).

kaba0 · on April 24, 2023

In my experience most applications do need an escape hatch from a strictly tree-like lifetime pattern, it really is limiting.

Object layout is the primary reason for the performance differences between low-level languages and manages ones — Java uses a thread local allocation buffer, which is for all practical purposes just as hot as the stack. What is more expensive is arrays of references and such, but that will be solved by value types once, hopefully.

EDIT: What I’m trying to say is that there is very little overhead to creating short-lived objects in Java.

pkolaczk · on April 24, 2023

> EDIT: What I’m trying to say is that there is very little overhead to creating short-lived objects in Java.

I've been working on high performance Java software (databases) for many years now and the overhead of creating short lived objects is bearable only because we're going so far trying not to create those objects, writing Java in a kinda C-style - no objects, just primitives as much as you can.

The total overhead is not just bumping up the pointer and zeroing memory but actually all what happens when the GC runs out of the slab. By allocating short lived objects youre making GC collect earlier and overall doing more work. Also the faster you allocate, the more objects survive the collection and more objects must be copied. That process is very expensive and also very complex.