[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
[2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.
In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.
I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.
Thank you for taking the time to respond to my questions.
I believe the high potential of causing a follow-up incident was left out of the post (or maybe I missed it?).
I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.
I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.
They very likely have continuous deployment. So each change could potentially be released as a separate deploy. If the changes have changed to the data model, they gotta run a migration. So hundreds seems reasonable to me.
From the outside it sounds like, whatever the database is, it has far too many critical services tightly bound within it. E.g. leader election implemented internally instead of as a service with separate lifecycle management - pushing the database query processor minor version forward forcing me to move the leader election code or replica config handling forwards... ick.
From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.
Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!
Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)
In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.
> In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better.
Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:
> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible
I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake.
Mitigation is your top-priority. Bringing the system back to a good shape.
If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.
If there was a deployment, roll-back.
My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.
I'm just asking questions based on the documentation provided; I do not have more insights.
I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.
> My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced.
Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.
That prevents 'bit rot' meaning you daren't reduild or rollback something.
In a lot of environments this is a terrible idea. In private environments exposing build manifest information is a good idea, but not so that you can alert at 1 month. Where I work, software that's 2-3 years old is considered good - mature, tested, thoroughly operationalized, and understood by all who need to interact with it on a daily basis. Often, consistency of the user experience is better than being bug free.
I think this is a good point. Don't rollback if you don't know why your new code is giving you problems. You may fix things with the rollback, or you may put yourself in a worse situation where the forward/backwards compatibility has a bug in it. The issue may even be coincidental to the new code.
However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.
A rollback without understanding is definitely risky. An uninformed rollback is one of the factors that killed Knight Capital Group in 2012. For those not familiar, the actual problem was they failed to update one of a cluster of eight servers, and the server on the old version was making bad trades. They attempted to mitigate with a rollback, which made all eight servers start to make bad trades. In the end they lost $460 million over the course of about 45 minutes.
Knight Capital also didn't know what version of software their servers were running, didn't know which servers were originating the bad requests, had abandoned code still in the codebase, and reused flags that controlled that abandoned code (another summary: https://sweetness.hmmz.org/2013-10-22-how-to-lose-172222-a-s...). I'm not sure what you can infer about the risk of a rollback in a less crazy environment.
If rollbacks are not safe then you have a change management problem.
If you have a good CM system, you should have a timeline of changes that you can correlate against incidents. Most incidents are caused by changes, so you can narrow down most incidents to a handful of changes.
Then the question is, if you have a handful of changes that you could roll back, and rollbacks are risk free, then does it make sense to delay rolling back any particular change until the root cause is understood?
It's not always as simple as that. What if the problem was that something in a change didn't behave as specified and wound up writing important data in an incorrect but retrievable format? Rolling back might not recognise that data properly and could end up either modifying it further so the true data could no longer be retrieved or causing data loss elsewhere as a consequence.
In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.
There are certainly changes that cannot be rolled back such that the affected users are magically fixed, which is not what I am suggesting. In the context of mission critical systems, mitigation is usually strongly preferred. For example, the Google SRE book says the following:
> Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!
> Instead, your course of action should be to make the system work as well as it can under the circumstances. This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. [...] The highest priority is to resolve the issue at hand quickly.”
I have seen too many incidents (one in the last 2 days in fact) that were prolonged because people dismissed blindly rolling back changes, merely because they thought the changes were not the root cause.
In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.
OK, but then what if it's new data being stored in real time, so there isn't any previous backup with the data in the intended form? In this case, we're talking about Stripe, which is presumably processing a high volume of financial transactions even in just a few minutes. Obviously there is no good option if your choice is between preventing some or all of your new transactions or losing data about some of your previous transactions, but it doesn't seem unreasonable to do at least some cursory checking about whether you're about to cause the latter effect before you roll back.
I think you guys are considering this from the wrong angle...
Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.
The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.
If only the new release fails, a rollback should be initiated automatically.
If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.
This is one of those ideas that looks simple enough until you actually have to do it, and then you realise all the problems with it.
For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.
But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.
And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.
My point was that it's all but impossible for any rollback to be entirely risk-free in this sort of situation. If everything was understood well enough and if everything was working to spec well enough for that to happen, you wouldn't be in a situation where you had to decide whether to make a quick rollback in the first place.
I'm not saying that the decision won't be to do the rollback much of the time. I'm just saying it's unlikely to be entirely without risk and so there is a decision to be considered. Rolling back on autopilot is probably a bad idea no matter how good a change management process you might use, unless perhaps we're talking about some sort of automatic mechanism that could do so almost immediately, before there was enough time for significant amounts of data to be accumulated and then potentially lost by the rollback.
The odds of you understanding all of the constraints and moving variables in play, and doing situation analysis better than the seasoned ops team at a multibillion dollar company are pretty low. Maybe hold off on the armchair quarterbacking.
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.