I wonder how did they actually find out the reason for the failure? They had a system which worked perfectly (almost) and probably could be tested in every standard way without showing the problem. They must've had a seriously good logging system that showed something suspicious, or someone had a really interesting "a-ha" moment...
I'd like to hear the story of debugging this one. Also how they managed to identify that this incident was caused by that specific bug.
I'd like to hear the story of debugging this one. Also how they managed to identify that this incident was caused by that specific bug.