I have my own animation here - if you scroll down. You can see the parsing I do for an async syntax (scroll to the hearing named Definition) - you can edit the syntax and it reparses and displays the parsed output. Sorry for the mess it's kind of a place I work on ideas. Further down is my graph renderer.
I am working on a multithreaded runtime in C which is incomplete but can communicate between threads in roughly 60 nanoseconds at 6 million requests per second or closer to LMAX throughout of 20 million requests per second at a latency of 90-120 nanoseconds.
In my learnings Mutexes do not scale.
I am inspired by bulk synchronous parallel which means threads synchronize in a rhythm without Mutexes and a lock free algorithm to do with semaphores and observation rather than mutability.
I run two io_urings in different threads - one thread does send and another thread does recv. One ioring is talking to an epoll instance for network buffer readiness with EPOLLOUT.This means sending and receiving is full duplex - you can send and receive to different clients in parallel. I call it split IO.
I plan to write up all my learnings soon.
My adapted HTTP server from the loti examples gets a throughout that scales with the number of connections I am at 7000 requests per second. I am not nginx.
I use mailboxes and double buffering for thread safe communication. Each thread can communicate with another thread through its mailbox. This is like Erlang.
I plan to add coroutines based on Marce Colls coroutines. I can deschedule a coroutine when there is an await.
I think the schedulers have a few bugs. The multi-core one under:
> The simplest way to achieve multi-core scheduling, is to do exactly as before. Having a global queue of tasks that are ready to run, and run them once a core is ready:
I also think the deadline scheduler has a bug, too: sometimes the pink blob will just skip the queue entirely; it seems like its deadline is "0s away" until it has moved all the way to the queue, so if it is mid move & the CPU becomes available, it seems like the most runnable. (It happens with the other colors too, under the same conditions, but you'll see it with pink more often, since it spends far more time in that state — moving from I/O to waiting to run —, but I did catch other colors doing it too.)
And the one at the top would occasionally double-schedule a core. (Multiple dots would animate onto a single core.) https://i.imgur.com/bPHgPJ0.mp4
This is great information. Especially the libuv internals.
The article uses longjmp to explain coroutines. The native stack is doing a lot of heavy lifting there... What would the implementation look like if you reified the stack?
This aside also left me wanting more:
> Rust's async transforms a block of code into a state machine that is not run until you await it.
Would be interesting to read about the internals of that!
I recently asked a question related to all this on the new language implementation stack exchange.
There is a 1:1 correspondence between a generator/coroutine and an iterator with internal state. I suggest looking at the PHP generator RFC[1] and Concurrency from the Ground Up[2] talk.
Excellent overview of scheduling with code examples and animations.
One minor correction about Erlang is that that scheduler doesn't get invoked just on function calls. It will be invoked as soon as the lightweight process consumes a certain number of allowed operations. So even if you have one huge function that never calls others, just computes something, it will still consume operations and will be preempted. Some internal C utility functions also consume some virtual number of ops as well and may yield.
Even some BIFs and NIFs (BIF are mostly just an built-in NIFs) can yield [1]
and choose to reschedule themselves later. But inside the C code it is voluntary, of course. An example of this can be seen when running hash functions [2].
Another interesting idea is that a NIF which may not reschedule themselves with yield, can still signal the scheduler that they are consuming so many "reductions" if they did some lengthy work. So if, say, they run for 10 msec, they might want to bump the reduction counter a bit: enif_consume_timeslice. Otherwise a tight recursive loop calling the same NIF could still effectively block the VM.
> Running a thread per client is famously known as The C10K Problem.
Gonna have to be that guy: the C10K problem was "how to serve 10000 clients from a single server" specifically in the 2000ish time-frame. The article links to the C10K problem page, which discusses the tradeoffs involved in at least five high-level strategies[1], only one of which is the strategy of running a thread per client [2].
Yeah, there was talk of the C10M problem. Just like C10K was considered solved with event loops like epoll, kqueue, and IOCP, C10M is generally been considered solved with techniques that colocate your network stack and your business layer code.
So DPDK on one side keeping everything in user space talking directly to the NIC, or for static resources Netflix style techniques of optimized pathways for KTLS+sendfile to keep the dataplane all in kernel space. With that you can avoid the copies from a buffer in a socket call to into the network device packet buffers.
I like the animation at the top, except that it's meaningless unless you know what the different parts represent. If you scroll down far enough, you will eventually find this:
Each circle is a task. The white progress circle around tasks is the time left to run until the task is blocked.
Purple box - The queue holding tasks ready to run.
Green box - The CPU.
Gray box - Tasks blocked on something (e.g. I/O).
It would be great to have that description right at the top with the first animation.
I have my own animation here - if you scroll down. You can see the parsing I do for an async syntax (scroll to the hearing named Definition) - you can edit the syntax and it reparses and displays the parsed output. Sorry for the mess it's kind of a place I work on ideas. Further down is my graph renderer.
https://processes3.replit.app/
I want to auto parallelize async tasks.
I am working on a multithreaded runtime in C which is incomplete but can communicate between threads in roughly 60 nanoseconds at 6 million requests per second or closer to LMAX throughout of 20 million requests per second at a latency of 90-120 nanoseconds.
In my learnings Mutexes do not scale.
I am inspired by bulk synchronous parallel which means threads synchronize in a rhythm without Mutexes and a lock free algorithm to do with semaphores and observation rather than mutability.
I run two io_urings in different threads - one thread does send and another thread does recv. One ioring is talking to an epoll instance for network buffer readiness with EPOLLOUT.This means sending and receiving is full duplex - you can send and receive to different clients in parallel. I call it split IO.
I plan to write up all my learnings soon.
My adapted HTTP server from the loti examples gets a throughout that scales with the number of connections I am at 7000 requests per second. I am not nginx.
I use mailboxes and double buffering for thread safe communication. Each thread can communicate with another thread through its mailbox. This is like Erlang.
I plan to add coroutines based on Marce Colls coroutines. I can deschedule a coroutine when there is an await.