That is interesting. I would imagine that this leads to extremely risk averse de...

crpatino · on Sept 7, 2016

There are 2 points that you seem to have overlooked.

First is budget. 4 min per month goal is not arbitrary; management has come with this number based on the monetary cost of downtime, and (ideally) is committed to expend a conmeasurate amount of resources (e.g. personnel, time, infrastructure, etc) to make this happen. It is OK to ask for a bullet proof application, but you have to be willing to pay big $$$ for kevlar. Issuing cardboard vests and then acting surprised when those fail to stop bullets is both stupid and unprofessional.

Second is rollbacks. If everything was working ok until today, then things begin failing a few hours after you push a new version, save trace data and go back to the previous version!!! This won't work every time, but more often than not it will save time.

Besides those points, you ask...

> ...1 hour of down time uses up over a year in failure budget! How can a team so heavily "in debt" ever hope to push code again?

The whole point is that you dont want a single big failure to put you on your knees. If your max downtime is 4 minutes per month, you don't try to run your shop so that the average time is 3:55 min. You play it safe for a couple of years and try to make it 3:15 min. When you build up a 15 you may get bolder and strive for a 3:30 or 3:40 average, but never more than 3:45.

This way, the big 1hr event wipes your surplus... but you can play things conservative for the next 3 months and then go back to operating as you have always done.

kevan · on Sept 7, 2016

I'm not with Google but the SRE book has a whole chapter on this (Ch. 3 Embracing Risk). For most services you don't need reliability quite as high as you'd think. There's a background error rate caused by networks and devices you can't control, and once your service is more reliable than the background rate then you should shift focus to feature development instead.

Rapzid · on Sept 8, 2016

The book severely simplifies this concept and borders on error when talking about the general reliability of the network vs your service. It's an interesting point to consider though.

honkhonkpants · on Sept 7, 2016

Deployment shouldn't normally be a huge cause of serious outages because the problems should be caught early, when they are affecting only a small fraction of traffic. If you deploy to one of ten thousand replicas, and even if the new build is completely broken, serving errors to 0.0001 of your queries for a few minutes isn't going to eat much of your failure budget. Even better if you can deploy to testing, internal, or other non-public-facing replicas that don't count against your failure budget at all.

There are very large services at Google that deploy builds daily. We have a lot of monitoring, and many of our services are rigged to automatically detect and roll back bogus pushes.

The largest source of outages in these large systems is not software deployment, but configuration changes especially in systems that aren't rigged for regular automatic push. The second largest source is somewhat ironically incorrect response to small outages.

sqeaky · on Sept 7, 2016

The big hint in the post you are responding to was "rollback".

Why would you ever attempt any upgrade without a good option for rolling it back?

Today it is easier than ever. A small shop might host heir service on a VM and spin up a second VM for the new deployment. Just change the IP address in the VMs and now the if customers are busted, just unswitch the IPs. Once service is up for a convincingly long time take the old VM down and store it off-line somewhere.

That is clearly a hack, but allows nearly instant rollback for the the smallest shops. Coming up with something similar at medium shops shouldn't be too hard either.

Once rollback is solved a new deployment might eat minutes of failure budgets instead of hours.

jwatte · on Sept 8, 2016

Try that when your deployment requires a schema change of a database table with a billion rows :-)

That being said, we've automated deploy and rollback on error to the point where it's a script, and we run it 50 times a day. The "immune system" from a few minutes of partial deploy saves us weekly.

kelnos · on Sept 8, 2016

DB schema updates don't change your rollback ability.

Need to add a column? Go for it. Then do your deployment that uses the new column. If you have to roll back, you can, and the new column will just sit there, again unused.

Have a column you don't need anymore? Do the deployment that stops using it, and let it sit for a while. After you're convinced it's safe, drop the column.

Have a column that needs to change size/purpose/etc.? Nope, don't do that. Just create a new one, and deal with differences between the old and new column in code until you can safely drop the old column, as above.

Does your ALTER statement take too long to run and cause downtime? Do it on a replica that isn't serving traffic, and then promote that to be the new master. Actually, in general, you should be doing this for any kind of change anyway, just in case you screw up the schema update.

sqeaky · on Sept 8, 2016

Spin up a second database VM and point a fresh frontend VM at it. Then atomically switch the frontend VM like I was describing before.

If that doesn't work come up with some kind, any kind, of rollback solution. There is no server problem that cannot have a rollback solution made for it.

Some places are cheap and would rather just upgrade in place in the middle of the night. Those places are cheap and bad at engineering. Do not work for or do business with them if you actually care about uptime.