I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others, including a lot of the engineers who responded to the incident). If you've any specific feedback on how to make this more useful, I'd love to hear it.
I don't think either one is particularly "useful" to me as a consumer of the business, other than knowing that "we have top people working on it right now" and there's a plan in place to try and avoid future problems.
What's fun for a software person is that there's a lot of interesting digressions and stuff to learn in the Cloudflare one. The whole explanation of the regexp at the end is something that no one cares about from the business side, but is an interesting read in and of itself.
It's worth noting that yours came out a bit more than a week faster than theirs, which jgrahamc clearly spent a lot of time writing. No idea if anyone cares about the speed with which these things are released...
Hi Dave, you probably won't remember me (we only spent about 2 months together in Stripe), but I bet Mr Larson remembers.
The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.
I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.
In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.
I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.
Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.
I work at Stripe, on the marketing team, and assisted a bit here. My last major engineering work was writing the backend to a stock exchange.
If anyone on HN knows anyone who has the sort of interesting life story where they both know what can cause a cluster election to fail and like writing about that sort of thing, we would eagerly like to make their acquaintance.
Kyle used to work at Stripe and left. I don’t think he would come back unfortunately. That guy is absolutely amazing, especially with regards to distributes DBs and writing about them
For starters maybe provide more details beside the vague information of some feature of some database didnt work as expected. Imagine you are giving this to your employees (especially new ones) to learn something. How much actual useful knowledge is being shared here to learn from?
Unexpected things are bound to happen. But, one thing that stuck out to me is that you dont seem to have a safe way to test changes (which would have prevented the second failure). Are there no other environments to test changes on? Is there no way to incrementally roll-out? Is there not another environment which can step in in place of a failing one while you investigate? These seem like fairly common industry practices which help you deal with unexpected failures, but I dont see a mention of if/why these practices failed and if/how that is being remediated.
It would be great that in these types of situations if the CC Tokens validity period is extended, or at least known as the documentation states it is short. For our app if the tokens were valid longer, we could write this up as a non-event and retry when things were better.