Crazy to me that garbage collection systems are a subject of continual evolution and discussion. You'd think it would be "solved" by now... Or is that just in Java world?
Not an expert but bear in mind that there can't really be one single thing called a 'good' garbage collector, because the behaviour of the programs that produce the garbage can be hugely varied, and the garbage so produced can likewise be hugely varied, and how you want to recover it can depend on the platform you're running on (if embedded, you probably want reference counting the tight memory, in other circumstances you might be very happy just to leak everything and let the OS clean up at the end. Edit: yes, it's sometimes ok to leak).
As somebody says elsewhere, it's multidimensional issue and a single dimensional garbage collector can't be a brilliant match for everything.
The book "Garbage Collection" by Richard Jones, recently discussed on HN, is something I'd highly recommend you buy and read.
It’s just as “solved” as regular memory allocators are for C/C++ — it’s constantly evolving, with many competing solutions. TBB, tcmalloc, jemalloc, mimalloc, etc etc etc.
Memory management is hard, and in a world where memory sizes keep increasing, new techniques being developed is expected and perfectly normal.
I'm not sure what "solved" would mean for GC. Can't there always be improvements in maintainability, pause duration, correctness, throughput etc. Many dimensions to continually improve on.
Unless someone comes up with a no pause, no compute power no code fully correct GC?
There are some dimensions yes, but I do wonder if GC research is starting to approach the point of being "solved" well enough that we'll see a drop in investment. Not yet, clearly, but approaching that territory.
If you look at the very latest JVMs you have three GCs which a spread over throughput-optimized (parallel), a middle ground jack of all trades (g1), and latency-optimized (zgc). They're a lot more self configuring in the past, so no matter what needs your app has it's usually just one CLI switch to get something fairly optimal.
But ZGC is going generational, at which point it will have significantly higher throughput than before. Also, all the JVM GCs can handle massive heaps these days (like, terabyte scale). So at that point the choice might become even simpler: just use ZGC all the time unless you can tolerate higher pauses and want to reclaim some CPU for the app, in which case go with parallel. And in many cases the right tradeoff between latency and performance could be inferred automatically from the frameworks and libraries that are on the classpath, meaning that for most apps even needing to make that choice could go away.
If you have a GC that doesn't pause, can handle terabyte sized heaps, has great throughput, is largely or completely self-configuring, is portable, open source, supports dozens of languages and has clean code, there are really very few metrics left to optimize after that. It seems like HotSpot is nearly there, especially when combined with GraalVM for language support.
> If you have a GC that doesn't pause, can handle terabyte sized heaps, has great throughput, is largely or completely self-configuring, is portable, open source, supports dozens of languages and has clean code
That's basically ZGC Generational + GraalVM. So, code that's available now albeit you'd have to compile it from version control (graalvm stable releases don't support zgc yet, released zgc isn't generational yet). Give it another year or so and that'll be shipping out of the box in ordinary JVMs.
The hardware still evolves, so at the least, garbage collectors will have to be tuned or self-tune themselves for that.
I also think it’s unlikely that a GC can be written that handles both terabyte sized heaps on machines with hundreds of cores and megabyte sized heaps on single-CPU systems well.
Single GC, no. Single VM, yes. HotSpot serial GC is designed for the latter use case. I forgot to mention it before but it's still there and supported, I think even selected automatically in some cases.
But there aren't so many single core systems with megabyte sized heaps anymore. Even watches are multi-core these days.
Are there any production-ready GCs that have strictly bounded latency? I'm not aware of any, and that means that GC is still too immature for use in about 95% of the computers in my house, the deeply embedded ones.
Metronome did this for embedded hard realtime use cases. I don't know much about it. ZGC has bounded pause latency at least but it's not really meant for very small machines. Android has used a pauseless GC for a while.
Strictly bounded anything (hard real-time) is an entirely different category, with very few uses for the average person - and contrary to what one might think from its name, it is not “fast” at all. It’s mostly about “calculating the trajectory of that incoming projectile should definitely happen in 0.3s”, which is many orders of magnitude more than what a modern computer would take to do 3 multiplications.
But to answer your question, there is JamaicaVM and many niche JVM implementations with hard real time GCs — but it’s simply false that a GC would be somehow a problem for most of your devices.
Hard real-time has a huge amount of uses for the average person. I just made toast -- the software-controlled thermostat in my toaster oven has (slow) hard bounds on its sense-to-control latency to avoid dangerous overheating. Before that, I brushed my teeth -- the brushed motor controller detects a motor stall by increased current if I brush too hard and stall it, shuts down the motor to avoid overheating (with a hard real-time sense-to-actuate constraint), and beeps to inform me. Then I spent some time on HN, with a laptop with PMICs and other microcontrollers that have hard real-time constraints all over the place. By far the majority of my computer interactions so far today have been with computers that have hard real-time constraints, and I haven't even gotten in my car yet!
I think it was a tongue in cheek reply, as it is indeed a no-code, no-pause, no-compute power GC that does eventually reclaim the garbage (at process exit).
It’s meant for short lived apps where you simply wouldn’t get OOM either way. I don’t know about an objective definition for what a GC is, one I quite like is “gives the illusion of infinite memory”, based on that it isn’t a GC.
But there are also conservative GCs, RCs, etc, that fail different definitions also (leaving behind circular dependencies, having false positives, in reverse order), like in case of many things in CS, there isn’t an objective definition.
Theoretically someone could perhaps one day discover the theoretically optimal GC algorithm, and someone else could write the TeX of GCs using that algorithm, and then perhaps we'd have the final GC.
Extremely unlikely that such an algorithm exists of course, and even if it did, looking at sorting will quickly suggests that improvements will keep being found even if the best asymptotic complexity had been achieved.
Lots of things in CS are actually cyclic: RAM is expensive, implement swapping from HDD. RAM gets cheaper, load all data there and instead optimize your code for CPU. Data gets bigger, need to go back to swapping again. Suddenly we get SSDs and hard drive access is lots faster, invent something new but similar to the old stuff.
And also in system design, computer architecture and engineering. T. H. Meyer and I. E. Sutherland coined this generational behaviour the "Wheel or Reincarnation" in their paper "On the Design of Display Processors." IMHO a paper that is still well worth reading:
Optimal memory behavior and lifetime is highly application specific, so the behavior of an automatic system likewise will be better for some applications than others.
In Java, the goal is to have something that performs reasonably well, without degenerative cases, for a wide variety of applications and services. It also has to handle cases where a monolithic code base results in a multitude of different memory behaviors. Even thinking too hard on the ideal behavior will negatively impact application code.
The research field of garbage collectors is a study of trade-offs.
A highly tuned, application-specific memory strategy will likely outperform Java, but the goal is to not require anyone to build such a thing for most software.
> You'd think it would be "solved" by now... Or is that just in Java world?
I'm not sure what you mean but I'll try to relate. Automatic garbage collection is a problem not just in the Java world. It's also needed in JavaScript, Python, Go, Haskell, etc.
Citation needed, but I suspect the Java Virtual Machine ecosystem has the most advanced GC algorithms in the world. Even if that's not true, I can tell you with certainty that other language environments have adopted features many years after Java did, such as JavaScript V8 introducing concurrent generational copying in year 2017 ( https://v8.dev/blog/orinoco-parallel-scavenger ). I'm sure there are examples in Go as well where their GC implemented features that Java had a decade earlier.
Memory management for C is not itself a solved problem, not only is there a lot of performance to squeeze out of malloc itself (the benchmarks on https://github.com/microsoft/mimalloc exemplifies the variance between the implementations), but it's up to the programmer to implement memory management in the large in an efficient way, which is not an easy task. One sure mark of a slow C program is one with a ton of mallocs and frees strewn all over.
If we can know exactly where a particular object’s lifetime ends, then indeed we can write more efficient programs than possible with a GC, but that in the general case is not something we can determine at compile time (Rust’s lifetimes are a strict subset of the general case).
Anything more dynamic will require some “ad hoc, informally-specified, bug-ridden, slow implementation of a GC”.
Rust also relies on a GC for the more dynamic lifetimes in Rust programs (reference counting), it just doesn't call it a GC, and while Rust's GC has a lower footprint than tracing GCs, it has worse performance and it suffers from cycle problems. I wouldn't call it "bug ridden" (perhaps "limited" is a better term), but it is slow.
Some think that reference counting gives better performance predictability than a tracing GC, but that is no longer true, certainly compared to the generational ZGC presented here. If you could determine the time where the collection would take place with a refcounting GC, i.e. when the last reference is cleared, you wouldn't need a refcounting GC. It is unpredictable pretty much by definition.
> Rust's GC (...) has worse performance [than tracing GC]
That's an unsubstantiated claim.
While reference counting may be slower than tracing in the general case, Rust's reference counting is not the same thing as general reference counting in languages like e.g. Swift or Python. There are two reasons:
- The good majority (99%+) of objects in Rust don't need to be managed by reference counting at all.
- Rust relies heavily on move semantics to avoid updating the ref-counts, so the good majority of refcounted objects have very few (sometimes just 1) refcount updates.
> reference counting gives better performance predictability than a tracing GC, but that is no longer true, certainly compared to the generational ZGC presented here.
Where does ZGC claim nanosecond latency?
But even if it did, it would be still worse, because refcounting doesn't do pauses at all. Pausing applies to pausing all application threads. Spending X nanoseconds freeing memory by one thread does not count as a pause, because the system can still make progress.
But reference counting is just a tiny subcomponent of Rust's GC. The majority of Rust's automatic memory management is based on affine types and static code analysis, not refcounting. You're just picking subcomponents to make Java solution look better, but what matters for developers is the whole picture.
And BTW, I wouldn't be so sure about throughput either.
ZGC is capable of allocation rates of the order of only ~5 GB/s, which is way below the theoretical memory bandwidth. I bet jemalloc can do much better.
I thought I made it clear in my last comment that I'm not claiming that Java's memory management is somehow "better" than Rust's. I was merely pointing out that even Rust has a GC, and it's not a good one, but it might certainly be good enough for Rust's needs as it doesn't depend on it. Overall, obviously Rust and Java choose very different tradeoffs in their memory management strategy.
> ZGC is capable of allocation rates of the order of only ~5 GB/s
That was the non-generational ZGC. Indeed, ZGC's max throughput was well below that of ParallelGC and G1. One of the main goals of generational ZGC is to increase the throughput, and as you can see in the video, they got a 4x increase in throughput.
The statement was that ZGC (or any JVM GC for that matter) is better than Rust’s GC, which is just naive ref counting. This is true - if we were to create a program that allocates the same amount in both implementations, Java would win by a huge margin.
Have you ever tested it by yourself? I tested and no, it didn't win by a huge margin. Malloc+free was only a tad slower than the default Java GC new (stop the world, parallel) but the default GC is much faster than G1 and ZGC so I would not be surprised if malloc could beat those modern ones at allocation rate benchmark.
Do you have a public benchmark for that? It’s quite hard to write a benchmark that actually measures what we are interested in. Not doubting you, I would be interested only.
It does give better performance predictability! Decrementing a refcount may either be a no-op or expensive but crucially the op can happen only during that particular call, nowhere else.
It does not. Maintaining the counter requires more work than tracing GCs do, and you don't know in advance when the last reference is dropped. True, you get more promptness in deallocation (not the same as predictability), which reduces the footprint but comes at the expense of compaction, the lack of which in most refcounting GCs also adds unpredictability. In contrast, in the common case, allocation with a compacting, tracing GC is a pointer bump and there's no overhead at all to copying references or other memory operations. During collection cycles, GC barriers are triggered, which add overheard to things like mutating pointers or reading pointers from objects in the heap.
There's a great paper (linked from this post https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/unifie...) showing how refcounting GCs can match the performance of tracing GCs -- in fact, how they can each have the properties of the other -- but that requires sophistication in the implementation of refcounting GCs that most of them don't have. So it is true that a sophisticated refcounting GC and a sophisticated tracing GC can exhibit similar behaviour, but while OpenJDK's tracing GCs are very sophisticated (especially ZGC and G1), most refcounting GCs in the wild are quite unsophisticated. It may certainly be true that unsophisticated refcounting GCs have more desirable properties than unsophisticated tracing GCs, but in practice the choice offered is between unsophisticated refcounting GCs and very sophisticated tracing GCs. The latter beat the former on most counts except for footprint overhead.
Ah, it's that topic again. The whole point "but refcounting requires more maintenance per object than GC" is totally moot when there are only 0.01% of objects being refcounted (in Rust or C++) vs 100% objects traced by GC in Java.
I never claimed otherwise. Java's and Rust's overall memory management strategies are far too different to compare. I was talking specifically about refcounting vs tracing.
> But what you call promptness is predictability from the PoV of the programmer, no?
I don't think so. You don't know when memory will be freed or how long that would take only that it will be freed as soon as possible. The end result is lower footprint but that's about it.
Rust's lifetimes is a strict subset of the general case, but it covers a good majority of use-cases with affine types - which means trivial allocation and deallocation. The number of times I had to use lifetime specifiers or dynamic lifetimes (RefCell/Rc/Arc) is extremely low, so low that it is not even a blip on the profile. Also, unlike most GC languages, Rust favors stack allocation over heap allocation, which makes most Rust program allocation rates several orders of magnitude lower than any Java program I saw (at the expense of more inter-stack copying, but that is extremely cheap because the top of the stack is almost always in L1 cache).
In my experience most applications do need an escape hatch from a strictly tree-like lifetime pattern, it really is limiting.
Object layout is the primary reason for the performance differences between low-level languages and manages ones — Java uses a thread local allocation buffer, which is for all practical purposes just as hot as the stack. What is more expensive is arrays of references and such, but that will be solved by value types once, hopefully.
EDIT: What I’m trying to say is that there is very little overhead to creating short-lived objects in Java.
> EDIT: What I’m trying to say is that there is very little overhead to creating short-lived objects in Java.
I've been working on high performance Java software (databases) for many years now and the overhead of creating short lived objects is bearable only because we're going so far trying not to create those objects, writing Java in a kinda C-style - no objects, just primitives as much as you can.
The total overhead is not just bumping up the pointer and zeroing memory but actually all what happens when the GC runs out of the slab. By allocating short lived objects youre making GC collect earlier and overall doing more work. Also the faster you allocate, the more objects survive the collection and more objects must be copied. That process is very expensive and also very complex.
This can be an interesting inqury. Sad that you have to ruin it with snark.
I guess in some sense it was "solved" more than a decade ago in Azul's C4 (Continuously Concurrent Compacting Collector). I wonder how a generational ZGC will compare against it.
Java was released in 1995. I wonder if you have ever updated any software that came out before then? Surely you don't update your OS. Windows/Linux/Mac all came out before 1995, and must surely be "solved" by now. In fact you probably still use Windows 95. Surely Windows 98 or any later version can't be any better?
What's crazy to me is the sort of cognitive dissonance that you must have to hold a view like that, and still apply software updates at all.
Since garbage collection based memory models are mush more complex and require a lot of software engineering it is not a surprise that we have many iterations.
That's backwards. One proof of rice's theorem relies on the fact that, if you could decide any interesting property of programs in general, then you could solve the halting problem; it doesn't follow that, if you could solve the halting problem, then you could decide any interesting property.
It happens to be true that, if you could solve the halting problem, then you could decide any interesting property of programs; but this is simply because the antecedent is false.
Being able to turn an interesting property solver into a halting problem solver is not the same thing as being able to turn a halting property solver into an interesting problem solver.
My claim or at least intent was you can solve interesting property solver like you can solve a halting problem. I.e. not in a satisfying way (no false negatives or false positives with unlimited iterations).
You can solve it to some extent (e.g. does program terminate in X steps).
But this is vacuous and unrelated to the halting problem or Rice's Thm. A program that emits "Yes" on every input will decide some property of programs with zero false negatives.
"Interesting property", as defined by Rice's Thm, is "any semantic property that is not either always true or always false for all program executions." It does not mean "interesting" as in "interesting to humans."
And I'm talking about the decider, not the program being fed as input.
I'm not a computer scientist, but I am aware Interesting doesn't mean interesting to humans.
> And I'm talking about the decider
EDIT: If by decider you mean program tested for having property P, that's a particular case and not applicable in general, ergo non-important (a program that's a NOP can be shown to leak no memory as well).
So the property P you are testing always returns true? Then it's not interesting per your definition. Leaking memory however is interesting.
No that's not what I mean by decider. The decider is the program that takes the other program as input and tells you whether that input program has the semantic property you want.
The only noninteresting semantic properties of programs that exist are "this is a program" and "this is not a program."
> A program that emits "Yes" on every input will decide some property of programs with zero false negatives.
Followed by
> any semantic property that is not either always true or always false for all program executions
So what you said is your decider always outputs true. That's not an interesting property, because it's true for all inputs.
Btw what is the point of your argument? It still doesn't refute my initial statement. If you want to determine an interesting property of program like leaks, or automatic lifetime calculation or whether output is square of input, you run into Rice Theorem, which makes the problem undecidable. In other words it's impossible to get correct yes and no algorithmically.