I wonder how did they actually find out the reason for the failure? They had a s...

I wonder how did they actually find out the reason for the failure? They had a system which worked perfectly (almost) and probably could be tested in every standard way without showing the problem. They must've had a seriously good logging system that showed something suspicious, or someone had a really interesting "a-ha" moment...

I'd like to hear the story of debugging this one. Also how they managed to identify that this incident was caused by that specific bug.