> That third subset means the third frontend only has a subset of three backends, even though you want it to have four.
This explanation is correct, thanks.
Alas, word limits demand brevity.
> implicit assumption about the requirement that the final algorithm require no dynamic/runtime coordination
An earlier iteration of this article included coordination as one of the properties, but this unfortunately had to be cut. AFAICS, the only other two kinds of coordination are “frontend tasks talk to each other” or “frontend tasks ask a subsetting service for their subset”. Within Google, both of these options are unacceptable: we either introduce the risk that a rogue frontend task brings down all the frontends, or introduce new unappealing failure modes (what do you do if the subsetting service is unavailable?). There is potential for other subsetting algorithms in this space, and while I’d be excited to see them, I’m mildly sceptical about their practicality at scale.
Yeah, the brevity thing is always tricky- the classic problem with any academic paper is that you need to assume your reader has some level of background that enables them to follow your reasoning; otherwise you end up derailing your paper by explaining too much of the background material.
FYI your claim re coordination isn't true across the board. The second option is used for some services (Slicer has an opt-in integration on the L2s).
> For instance, in table 3, it looks like they excluded backend tasks {0,1} (for frontend tasks {0, 1}) then {2,3} (for frontend tasks {2,3}) in the N=10 case, but backend tasks {1,2} then {3,4} in the N=11. Why the discrepancy?
With N = 10, there will be N mod k = 10 mod 4 = 2 leftover tasks, and so the round-robin fashion excludes {0, 1} then {2, 3}.
However for N = 11, there will be N mod k = 11 mod 4 = 3 leftover tasks, so the round-robin fashion excludes {0, 1, 2} then {3, 4, 5}.
But as joatmon-snoo correctly said, the more important point is demonstrating how bad backend churn is with this algorithm.
OK, that makes a lot of sense. Thanks for taking the time to clarify!
> But as joatmon-snoo correctly said, the more important point is demonstrating how bad backend churn is with this algorithm.
Yes, again the overall point came across clearly, but faced with specific examples I like to dive into the details to check my understanding of how things work. Otherwise, it's easy to overlook key but subtle details.
> That third subset means the third frontend only has a subset of three backends, even though you want it to have four.
This explanation is correct, thanks. Alas, word limits demand brevity.
> implicit assumption about the requirement that the final algorithm require no dynamic/runtime coordination
An earlier iteration of this article included coordination as one of the properties, but this unfortunately had to be cut. AFAICS, the only other two kinds of coordination are “frontend tasks talk to each other” or “frontend tasks ask a subsetting service for their subset”. Within Google, both of these options are unacceptable: we either introduce the risk that a rogue frontend task brings down all the frontends, or introduce new unappealing failure modes (what do you do if the subsetting service is unavailable?). There is potential for other subsetting algorithms in this space, and while I’d be excited to see them, I’m mildly sceptical about their practicality at scale.