Don't do that. good data design and architecture is always needed.

rbehrends · on March 31, 2024

1. Such data structures (or more generally, std::vector<std::string, std::vector<std::string>> or something like that) are the natural way to represent e.g. dictionaries. So, you absolutely often need to do that and "don't do that" doesn't help here.

2. This general issue extends to virtually all collections. The idea that you should avoid large collections is not a practical solution to real world problems.

3. An alternative solution would be lazy destruction, but that comes with its own issues, such as a really bad worst-case memory overhead or making RC sufficiently more complex that it's not really a win over tracing GC anymore [1].

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...

pkolaczk · on March 31, 2024

It’s not an issue in practice because, contrary to tracing GC, it doesn’t affect the other threads. So it doesn’t really count as pausing.

rbehrends · on April 1, 2024

It would then follow that a stop-the-world collector is A-OK for single-threaded programs, because that doesn't count as pausing?

pkolaczk · on April 1, 2024

Still no, because STW GC can pause innocent code in arbitrary unexpected moments.

If you’re pausing the thread because the thread is doing some work eg calling into system to do I/O, this is not considered a pause (even though technically it is, as the thread may be simply paused and waiting for data) but rather just the cost of the operation. If the thread is being interrupted and paused because some unrelated operation has to run - now that’s bad and considered a true pause.

rbehrends · on April 1, 2024

> Still no, because STW GC can pause innocent code in arbitrary unexpected moments.

Well, yes, that's the problem, isn't it? And that's the point I was making. Pauses in the main thread absolutely count for latency purposes. Note that in most stop-the-world GCs you also can have pretty good control over when the GC is invoked, and that doesn't make things better.

The idea that you can always predict when cascading deletes happen in RAII code is also misleading. They can easily be hidden behind an abstraction barrier. Do you exactly know what's happening under the hood in all third-party libraries that you use, just for an example?

> If you’re pausing the thread because the thread is doing some work eg calling into system to do I/O, this is not considered a pause

In real-time scenarios (soft or hard), it absolutely is. "Regular" operations are absolutely part of your latency budget, and if "regular" operations can exceed it, you absolutely have a problem. See e.g. pretty much any of Gil Tene's talks on the subject.

pron · on April 3, 2024

The kernel (as long as it's not realtime) also pauses innocent code in arbitrary, unexpected moments, and modern concurrent collectors like ZGC pause such "innocent" code for no longer than the OS does. These synchronisation pauses are of roughly constant duration (regardless of working set/heap size) and overall introduce unexpected latency to a lesser degree than the freeing work done by refcounting collections.

bluGill · on March 31, 2024

There is a lot in that statement and it deserves challenge. Note two things: there is no universal right answer here, sometimes you are correct, sometimes you are wrong; my responses will be contradictory (and also contradict things I didn't think to write) each project will need to figure out which of the combinatorical points is important their project to come up with the right also. Not doing this is premature pessimization - getting your data structures wrong up front will force slow runtime on you in a way that no micro-optimization (which is what premature optimization is really referring to) can ever fix and getting out will cost a very expensive rewrite.

> 1. Such data structures (or more generally, std::vector<std::string, std::vector<std::string>> or something like that) are the natural way to represent e.g. dictionaries.

They are natural, and easy but that doesn't mean they are the right way.

Often I find with a little thinking that there is a enum key under it all that I can hard code - and the enum key is also type safe so that the compiler can prove my code is correct against some class of bugs in a way a string as key cannot.

Why are your keys and values std::string which (baring small string optimization) will allocate more? Often you can place a maximum length on one of both that is small enough that you can replace std::string with a struct containing a char array (or std::string_view - I still have to support C++14 so I haven't been able to try it) and avoid a lot of allocations.

Failing that, often the correct data structure is a database (I reach for sqlite first, but there are plenty of options with pros and cons) which is fast lookup, allows for more complex structures easially, it persists the data to disk so that next startup I don't have to spend all the time reconstructing all that data (realistically performance of creating that large data dictionary is of far greater concern that getting rid of it).

> 2. This general issue extends to virtually all collections. The idea that you should avoid large collections is not a practical solution to real world problems.

This article is about systems programming. In systems programming you rarely have such a large dictionary in that format. You often will in something like the filesystem, but there the point is to have the data persisted from disk and typically you only read the subset you care about. You may have a copy in cache (and cache invalidation becomes an issue), but the in memory portion of data itself is never in a dictionary of that format.

> 3. An alternative solution would be lazy destruction, but that comes with its own issues, such as a really bad worst-case memory overhead or making RC sufficiently more complex that it's not really a win over tracing GC anymore [1].

That is one option, there are more.

Intentionally leak that whole dictionary. There is no reason to care if all the destructors run for any of the examples you presented and realistically if you are getting rid of the whole thing you are probably in shutdown anyway so who cares if you leak memory? Many systems programs and libraries do this.

Move the data to a different thread that only handles reclaiming memory and then don't worry about it. (in discussion we talk about making that thread idle time priority so it won't affect performance - but in practice all schedulers I'm aware of do this poorly in some way)

Finally, if after reading all that and consideration all your options (including things I didn't think of!) if you still conclude a large dictionary of strings to strings is your correct answer, then you probably should demand good tracing garbage collector. I'm not against using them where they are appropriate - I'm just arguing that if you think about your data a little more you will discover they are not needed nearly as much as it seems - and this time spending thinking is usually worth it because it results in much faster programs anyway (even if there is one large dictionary in the code how many others did you get rid of?).

rbehrends · on April 1, 2024

> They are natural, and easy but that doesn't mean they are the right way.

> Often I find with a little thinking that there is a enum key under it all that I can hard code - and the enum key is also type safe so that the compiler can prove my code is correct against some class of bugs in a way a string as key cannot.

There's a fundamental misunderstanding here, I think. By dictionaries I mean language dictionaries, e.g. English -> French. You won't find an underlying "enum key" here. Nor am I sure how an enum would ever get large.

> This article is about systems programming. In systems programming you rarely have such a large dictionary in that format.

First, the article is, but the subthread that started this discussion was more general than that.

Second, even in systems programming, you will commonly have arrays or other collections. Caches are a very common thing in systems programming, after all.

> That is one option, there are more.

It is trivially true that there are alternative options, but all these are workarounds around the problem, i.e. you have to complicate your design or make silent assumptions that may not hold in the future (e.g. when disposing of memory at process exit does no longer work because you now need the code as a library).

bluGill · on April 2, 2024

> By dictionaries I mean language dictionaries, e.g. English -> French. You won't find an underlying "enum key"

That is a niche case not a common one though. There are a lot of niches each either different requirements.

>It is trivially true that there are alternative options, but all these are workarounds around the problem, i.e. you have to complicate your design

This attitude of non-systems programmers is why people argue garbage collection is slow. Garbage collection can be faster, but people who work in garbage collected languages think that solves all problems and so they don't build efficient data structures. Sure they are not leaking memory, but garbage collection is not enough more efficient than reference counted as to make up for thousands of destruction's when the more complex reference counted version only had a couple. Yes the code is more complex, but it is also faster and that is a trade off I'll take.

rbehrends · on April 3, 2024

> That is a niche case not a common one though.

It was an example meant as an illustration. The general case of "collection of objects" is hardly niche.

> This attitude of non-systems programmers is why people argue garbage collection is slow.

I started programming writing Z80 assembler in the 1980s, counting T-states in order to make hard realtime code work. I wrote a graphics driver for Atari ST/TT hardware not to soon after. I think I have a pretty good idea what working in a real-time and/or resource-constrained environment means.

> This attitude of non-systems programmers is why people argue garbage collection is slow.

That is an incorrect generalization. In fact, I see plenty of inefficient code in C++ and Rust (e.g. because a lot of the workarounds for not having GC require additional copying).

> Sure they are not leaking memory, but garbage collection is not enough more efficient than reference counted as to make up for thousands of destruction's when the more complex reference counted version only had a couple.

This is some really unclear statement. If you're trying to make a connection between absence of (tracing) GC and having value types, they are not inherently related. You can have tracing GC and value types (e.g. C#) or reference counting and a lack of value types (e.g. Python).

What is true is that in general memory allocation is relatively expensive, so you want to avoid unnecessarily allocating objects, but that's true regardless of whether you use malloc()/free() or a garbage collector and the strategies for dealing with that are the same in both cases.

> Yes the code is more complex, but it is also faster and that is a trade off I'll take.

Again, this is an untrue generalization.

neonsunset · on April 3, 2024

C# has structs with generics/explicit layout/object references/fixed buffers/etc. and byrefs which are special pointers that can point to managed memory, stack or unmanaged one yet are still tracked by GC despite allowing pointer arithmetic on them, and regular C pointers yet successfully uses GC.

Much damage has been dealt to non-JVM languages by stereotypes and perception that the way JVM languages work is the way all GC languages work.