There are 2 points that you seem to have overlooked.
First is budget. 4 min per month goal is not arbitrary; management has come with this number based on the monetary cost of downtime, and (ideally) is committed to expend a conmeasurate amount of resources (e.g. personnel, time, infrastructure, etc) to make this happen. It is OK to ask for a bullet proof application, but you have to be willing to pay big $$$ for kevlar. Issuing cardboard vests and then acting surprised when those fail to stop bullets is both stupid and unprofessional.
Second is rollbacks. If everything was working ok until today, then things begin failing a few hours after you push a new version, save trace data and go back to the previous version!!! This won't work every time, but more often than not it will save time.
Besides those points, you ask...
> ...1 hour of down time uses up over a year in failure budget! How can a team so heavily "in debt" ever hope to push code again?
The whole point is that you dont want a single big failure to put you on your knees. If your max downtime is 4 minutes per month, you don't try to run your shop so that the average time is 3:55 min. You play it safe for a couple of years and try to make it 3:15 min. When you build up a 15 you may get bolder and strive for a 3:30 or 3:40 average, but never more than 3:45.
This way, the big 1hr event wipes your surplus... but you can play things conservative for the next 3 months and then go back to operating as you have always done.
First is budget. 4 min per month goal is not arbitrary; management has come with this number based on the monetary cost of downtime, and (ideally) is committed to expend a conmeasurate amount of resources (e.g. personnel, time, infrastructure, etc) to make this happen. It is OK to ask for a bullet proof application, but you have to be willing to pay big $$$ for kevlar. Issuing cardboard vests and then acting surprised when those fail to stop bullets is both stupid and unprofessional.
Second is rollbacks. If everything was working ok until today, then things begin failing a few hours after you push a new version, save trace data and go back to the previous version!!! This won't work every time, but more often than not it will save time.
Besides those points, you ask...
> ...1 hour of down time uses up over a year in failure budget! How can a team so heavily "in debt" ever hope to push code again?
The whole point is that you dont want a single big failure to put you on your knees. If your max downtime is 4 minutes per month, you don't try to run your shop so that the average time is 3:55 min. You play it safe for a couple of years and try to make it 3:15 min. When you build up a 15 you may get bolder and strive for a 3:30 or 3:40 average, but never more than 3:45.
This way, the big 1hr event wipes your surplus... but you can play things conservative for the next 3 months and then go back to operating as you have always done.