snaury's comments

snaury · on Sept 24, 2023

Unfortunately in 2023 c++20 coroutines are just barely starting to become usable. Compilers have bugs, especially with symmetric transfer, e.g. msvc doesn't suspend until await_suspend returns (which causes races and double-frees), gcc stops using tail calls in some modes (which can cause stack overflow), I ended up emulating symmetric transfer because you can't rely on it, and emulating it is of course slower. Compilers spill local variables from await_suspend to coroutine frames, which again causes races, which necessitates workarounds like marking await_suspend noinline (in a non-portable way). Having noinline anything in the code path disables heap allocation elision, so all coroutines start to allocate from the heap. When comparing a function call to a coroutine call there's a vast difference in performance. And you need to use coroutines all the way down, which really adds up.

Fibers have their own problems: you need to allocate stacks (and that may be expensive), you can't switch fibers between system threads (compilers cache thread local addresses, unfortunately with x86 linux tls often uses fs: segment addressing, so it may appear to work until it doesn't, and weirdly this problem existed in c++20 coroutines too until recent clang versions), widely used open source implementations (e.g. boost context / boost coroutine / boost fiber) are not even exception safe, because you need to save/restore exception globals when switching fibers, and nobody seems to care. :-/

One huge upside to fibers is that a function call is just a normal function call, and it's fast as a result. I wish c++ taken fibers more seriously and added steps for making them safe to use (portable way to handle global state, portable way to mark switching functions as invalidating tls addresses, etc.), we could have something similar to java virtual threads or go goroutines then.

gpderetta · on Sept 24, 2023

> exception safe, because you need to save/restore exception globals when switching fibers,

Do you have more details about this issue? What exception globals? And for which ABI? You mean some sort of current pending exception when switching during unwinding?

[I'm a huge fan of stackful coroutines and I wish were blessed by the standard]

snaury · on Sept 24, 2023

C++ ABI has some per-thread globals: the number that is returned from std::uncaught_exceptions(), and the chain of currently caught exceptions. For example in llvm this is available with a cxa_get_globals call:

https://github.com/llvm/llvm-project/blob/b05f1d93469fbd6451...

These need to be saved/restored when switching fibers, otherwise fiber switches from catch clauses (and destructors!) are unsafe, throw without argument may rethrow incorrect exception, code that commits/rollbacks based on uncaught exceptions counter will not work correctly, etc.

One example I know where this save/restore is implemented is the userver framework, but it seems to be unexpectedly rare in fiber implementations last time I looked.

gpderetta · on Sept 24, 2023

Thanks. It is unsurprising that most libraries punt on that. It seems that switching this state would easily dominate over the cost switching. IIRC on gcc the current exception is protected by a global mutex, at least until recently.

It is another reason for having this built into the language/standard library so that it can be implemented optimally.

indigoabstract · on Sept 24, 2023

Yes, in my (limited) use case with coroutines for a recursive descent generator I got it working by following this tutorial:

https://www.scs.stanford.edu/~dm/blog/c++-coroutines.html

but I was unpleasantly surprised by much extra code it required to make it run.

The Tiny Fiber library in the article looks pretty hacky right now and the amount of x64/arm specific code might cause trouble with portability, but the concept itself is intriguing. Thank you for the details.

snaury · on Sept 11, 2023

When you protect an std::deque with a mutex you would need at least two atomic operations: to lock the queue before pushing, and to unlock the queue after pushing. Because you're using an std::deque it may need to allocate memory during a push, which would happen under the lock, which makes it more likely for a thread to suspend with the lock taken. While the queue is locked other threads will have to wait, possibly even suspend on a futex, and then the unlocking thread would have to wake another thread up.

The most expensive part of any mutex/futex is not locking, it's waking other threads up when the lock is contended. I'm actually surprised you only get 10 million messages per second, is that for a contended or an uncontended case? I would expect more, but it probably depends on the hardware a lot, these numbers are hard to compare.

My actor framework currently uses a lockfree intrusive mailbox [1]_, which consists of exactly two atomic exchange operations, so pushing a node is probably cheaper than with a mutex. But the nicest part about it is how I found a way to make it "edge triggered". A currently unowned (empty) queue is locked by the first push (almost for free, compared to a classic intrusive mpsc queue [2]_ the second part of push uses an exchange instead of a store), which may start dequeueing nodes or schedule it to an executor. The mailbox will stay locked until it is drained completely, after which it is guaranteed that a concurrent (or some future) push will lock it. This enables very efficient wakeups (or even eliding them completely when performing symmetric transfer between actors).

I actually get ~10 million requests/s in a single-threaded uncontended case (that's at least one allocation per request and two actor context switches: a push into the target mailbox, and a push into the requester mailbox on the way back, plus a couple of steady_clock::now() calls when measuring latency of each request and checking for soft preemption during context switches). Even when heavily contended (thousands of actors call the same actor from multiple threads) I still get ~3 million requests/s. These numbers may vary depending on hardware though, so like I said it's hard to compare.

In conclusion it very much depends on how lockfree queues are actually used, and how they are implemented, they can be faster and more scalable than a mutex (mutex is a lockfree data structure underneath anyway).

I'd agree with you in that mutexes are better when protecting complex logic or data structures however, because using lockfree interactions to make it "scalable" often makes the base performance so low, that you'd maybe need thousands of cores to justify the resulting overhead.

.. [1] https://github.com/snaury/coroactors/blob/a599cc061d754eefea... .. [2] https://www.1024cores.net/home/lock-free-algorithms/queues/i...

snaury · on Sept 11, 2023

The lockfree scheduler certainly looks interesting (especially linearizability of event broadcasts), but I was surprised to see benchmark results in the paper with a peak of 43500 messages/s for 12 pairs of actors (and 12 cores?), with a graph showing ~5000 messages/s for a single core, which is surprisingly low for that kind of benchmark. Unfortunately the engine requires linux and more importantly x86 (due to asm intructions) so I wasn't able to replicate them yet, but I would expect at least ~1 million requests/s per a pair of actors (e.g. with Erlang), otherwise the overhead is prohibilitely low.

The engine also focuses on message passing, but from experience it's very difficult to work with (state machines are hard, especially when working with multiple downstream actors), and at the core actors are more about isolating state without locks than message passing. Swift actors did it right in my opinion, method calls instead of messages are not only easier to reason about, they give additional hints to the runtime when context may switch without involving a scheduler at all (any shared state is slow and inhibits scalability).

I actually wrote a header-only library recently (search for "coroactors" if you're interested) that implements something similar to Swift actors with C++20 coroutines, and I thought ~10 million requests/s (when uncontended) or ~1-3 million/s (when contended and depending on a scheduler) was a way too high of an overhead, especially when compared to normal method calls with shared state protected with mutexes. Coroutines tend to go viral (with more and more functions becoming "async" coroutines), and any non-trivial code base would have a lot of coroutine calls (or messages passed), and that overhead needs to be as low as possible, otherwise you'd spend more time task switching than doing useful work.