How the end-to-end back-pressure mechanism inside Wallaroo works

scottlf · on April 3, 2018

Howdy. I'm the author of both parts of this 2-page series on overload mitigation. Let me know if I've overlooked an important technique for handling too-big workloads or if you have questions (e.g., how we've stitched together TCP's sliding window protocol flow control with Wallaroo's internal flow control).

-Scott

petesoder · on April 3, 2018

I hope we hear something about this in your talk at DataEngConf SF!

scottlf · on April 4, 2018

I don't recall which of us is planning on attending DataEngConf SF, but I'm asking around now.

EDIT: Ah, it will be Vid Jain, the CEO of Wallaroo Labs.

colanderman · on April 3, 2018

Backpressure gets more "interesting" when you throw QoS into the mix. If the client has hard latency requirements, the server can't afford to "stop the world" when it gets backed up (à la TCP).

Keeping your queues shallow can be a solution, but only if your system doesn't need the benefits of batching that deep queues can bring. You then have to be a bit smarter about crediting the client, only handing credits out when you know you'll be able to complete that work within the QoS requirements, instead of simply whenever you have space in your queues. (Simply rate-limiting the source will work in a pinch too, at the cost of potential throughput.)

scottlf · on April 4, 2018

No disagreement from me.

It's definitely not easy to keep queues shallow end-to-end. In my experiences with Erlang systems, queue management is an area where the BEAM VM's runtime is definitely not actively helpful enough. One anecdote: within the last few months, the weak scheme that the BEAM uses for runtime back-pressure was removed because it isn't effective enough to justify the complexity of its code.

When a system does have the ability to keep queues shallow end-to-end, then I think it's a good base to build on, adding additional features to allow deeper queues where and when we want them.

Regarding Wallaroo: Today, you can use Kafka as a data source, allowing Kafka to be your deep-as-you-wish buffer upstream and/or downstream of Wallaroo. Tomorrow (a.k.a. vapor, though we've discussed the feature internally) it's certainly feasible to add an option to (for example) `TCPSource` that would add a large buffer at the entry to a Wallaroo cluster ... with flexibility to queue in RAM, disk, or elsewhere. It would also make failure recovery more complex, and it's one reason why we've deferred implementing it.