The load balancing they describe in their article is only used for inbound requests, similar to a traditional HTTP LB setup. If I understand it correctly they don't implement the outgoing message delivery to the called user themselves, but instead use SMS or a Google Service to deliver push notifications.
So the problem of "how do I route request X to user Y" has to be solved by either Google or the Provider that delivers the short message. -- Actually, with their current setup they don't even need to know where a user is located geographically, since they simply choose a proxy server by asking Route53 for a list of close servers (I presume close to the calling user) and then use their connect-then-disconnect-hack to choose the "best" server from that list.
> Instead we decided to write our own minimal signaling protocol
> and use push notifications (at first SMS, then eventually GCM when
> it was introduced) in order to initiate calls.
So I imagine you could just stick the address of the switch that was chosen by the caller into the message that gets delivered to the called device when a call is initiated.
Obviously this doesn't solve the hard part of "how not to drop a call when your switch goes down mid-call" though.
The signaling is the first part of the article. The challenge there is avoiding your own persisted connection or polling mechanism. (The solution is to use someone else's persisted connection or polling!)
In the second part they're describing TURN which is a 3rd party packet relay which you bounce your packets through when you can't directly route between two endpoints (usually because of NAT). As in, the call has been signaled, the keys exchanged, and now I just need to get packets of audio between Alice and Bob every 20ms or so and how nice would it be if they could do that between themselves and my servers could stay out of it?!
Broken NAT (the need for TURN) is probably the thing that frustrates me most about the Internet. If any two endpoints could always simply and easily connect just through direct routing it makes a lot of applications lives much easier, and many applications possible which otherwise end up centralized.
This is just one case in point. Just like Skype, the IP isn't even really in the actual "product" but in the hacks it takes to make the product work in the reality of pathologically NAT'd networks.
Yes, this seems like a bunch of work to keep up and running and I agree that most of the meat of their solution is actually in Google's or Amazon's systems running the GeoDNS/push stuff. Hopefully IPv6 will fix it all ™.
However, until then, GCM [1] seems like a really good workaround. And I believe it is actually free of charge and available for both iOS and Android.
Just centralize it :-). Exactly the reason I think IPv6 won't fix it (arbitrary inbound will still be firewalled by upstream either as a ToS/AUP or as a "feature") because too much money is at stake for the centralized services.
I think solid decentralized service discovery and direct routing would be as pivotal as the Blockchain itself. One actually helps solve the other, e.g using an altcoin for anti-spam. But the direct routing (a better NAT hole punch) apparently is not possible without service provider cooperation.
Tor is probably the biggest reliable semi-centralized overlay network, I'm not sure if there are any better options for punching through NAT that don't involve running your own public relay or trusting a 3rd party. But I assume
Tor is much too slow to support realtime voice between an arbitrary client and hidden service?
Bitcoin miners faced a similar problem of needing to build a higher speed relay network to shuttle large blocks faster than the existing P2P relay network. In that case there was a group of P2P nodes configured to allow much larger number of peers combined with dedicated fast paths between themselves each located in high-speed hubs. I'm not sure if it was ever deployed.
So the problem of "how do I route request X to user Y" has to be solved by either Google or the Provider that delivers the short message. -- Actually, with their current setup they don't even need to know where a user is located geographically, since they simply choose a proxy server by asking Route53 for a list of close servers (I presume close to the calling user) and then use their connect-then-disconnect-hack to choose the "best" server from that list.
So I imagine you could just stick the address of the switch that was chosen by the caller into the message that gets delivered to the called device when a call is initiated.Obviously this doesn't solve the hard part of "how not to drop a call when your switch goes down mid-call" though.