Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Who's on call? (susanjfowler.com)
165 points by emilong on Sept 7, 2016 | hide | past | favorite | 112 comments


Most interaction at Google between SRE and developer teams is mediated within the context of a "failure budget". For example, let's say the agreement between the engineers and the product and budget people is that the service needs to have four nines of reliability; that's the amount of computing and human power they're willing to pay for.

Well, that means the service is allowed to be down for four minutes every month. Let's say for the past three months, the service has actually only been out of SLA for about 30 seconds per month. That means the devs have a bit of failure budget saved up that they can work with.

How do you spend a failure budget? Well, let's say you're a developer and you have a new feature that you just finished writing late Thursday night, but the SREs have a rule that no code can be deployed on a Friday. If you have a lot of failure budget saved up, you have more negotiating power to get the SREs to make a special exception.

But let's say that this Friday deployment leads to an outage late Saturday night, and the service is down for sixteen minutes before it can be rolled back. Well, you now have a negative failure budget, and you can expect the SREs to be much more strict in the coming months about extensive unit and cluster testing, load tests, canarying, quality documentation, etc, at least until your budget becomes positive.

The beauty of this system is that it aligns incentives properly; without it, the devs always want to write cool new code and ship it as fast as possible, and the SREs don't ever want anything changing. But with it, the devs have an incentive to avoid shipping bad code, and the SREs have reason to trust them.


Sounds great, even utopian.

In almost all environments I've ever worked in, telling executives that you can't deploy a new feature because you've exhausted your error budget for the month is not acceptable which renders this system of "failure budgets" moot.

I've also spoken with Googlers about their devops/SRE responsibilities and I've gotten the distinct impression that what is advertised as being the organization's "best practice" is not in fact the common practice.

Measuring and tracking error budgets comes at a large cost, and enforcing failures budgets is rarely a politically effective initiative as an engineering manager.


"I've also spoken with Googlers about their devops/SRE responsibilities and I've gotten the distinct impression that what is advertised as being the organization's "best practice" is not in fact the common practice."

--This is exactly my gripe with this article and accompanying comments - portraying the Google workplace in an almost mythical fashion and failing to see the marketing ploy behind it. I suspect that SREs who work(ed) there don't really want to equate themselves to DevOps because then they would lose their prestige and devalue their market worth. Google management also doesn't want to do that because otherwise no one would take up the job. Call it as you see it people. See my other comment about this point, which adds more information: https://news.ycombinator.com/item?id=12446385

(edit:throwaway account because I am revealing personal information in the other comment)


They don't talk much about the challenges. It is a ton of work even on the backend to get the amount of SLI coverage and in the shape needed to effectively calculate an error budget. Even then you can introduce changes that might break functionality in stealth ways.. Doubly so for the front end.


That is interesting. I would imagine that this leads to extremely risk averse deploys? You can't know how much downtime a particular failure will incur so one small error could easily result in expending the failure budget for a significant period of time. Also, if ~4 minutes per month is the monthly failure budget, then 1 hour of down time uses up over a year in failure budget! How can a team so heavily "in debt" ever hope to push code again?


There are 2 points that you seem to have overlooked.

First is budget. 4 min per month goal is not arbitrary; management has come with this number based on the monetary cost of downtime, and (ideally) is committed to expend a conmeasurate amount of resources (e.g. personnel, time, infrastructure, etc) to make this happen. It is OK to ask for a bullet proof application, but you have to be willing to pay big $$$ for kevlar. Issuing cardboard vests and then acting surprised when those fail to stop bullets is both stupid and unprofessional.

Second is rollbacks. If everything was working ok until today, then things begin failing a few hours after you push a new version, save trace data and go back to the previous version!!! This won't work every time, but more often than not it will save time.

Besides those points, you ask...

> ...1 hour of down time uses up over a year in failure budget! How can a team so heavily "in debt" ever hope to push code again?

The whole point is that you dont want a single big failure to put you on your knees. If your max downtime is 4 minutes per month, you don't try to run your shop so that the average time is 3:55 min. You play it safe for a couple of years and try to make it 3:15 min. When you build up a 15 you may get bolder and strive for a 3:30 or 3:40 average, but never more than 3:45.

This way, the big 1hr event wipes your surplus... but you can play things conservative for the next 3 months and then go back to operating as you have always done.


I'm not with Google but the SRE book has a whole chapter on this (Ch. 3 Embracing Risk). For most services you don't need reliability quite as high as you'd think. There's a background error rate caused by networks and devices you can't control, and once your service is more reliable than the background rate then you should shift focus to feature development instead.


The book severely simplifies this concept and borders on error when talking about the general reliability of the network vs your service. It's an interesting point to consider though.


Deployment shouldn't normally be a huge cause of serious outages because the problems should be caught early, when they are affecting only a small fraction of traffic. If you deploy to one of ten thousand replicas, and even if the new build is completely broken, serving errors to 0.0001 of your queries for a few minutes isn't going to eat much of your failure budget. Even better if you can deploy to testing, internal, or other non-public-facing replicas that don't count against your failure budget at all.

There are very large services at Google that deploy builds daily. We have a lot of monitoring, and many of our services are rigged to automatically detect and roll back bogus pushes.

The largest source of outages in these large systems is not software deployment, but configuration changes especially in systems that aren't rigged for regular automatic push. The second largest source is somewhat ironically incorrect response to small outages.


The big hint in the post you are responding to was "rollback".

Why would you ever attempt any upgrade without a good option for rolling it back?

Today it is easier than ever. A small shop might host heir service on a VM and spin up a second VM for the new deployment. Just change the IP address in the VMs and now the if customers are busted, just unswitch the IPs. Once service is up for a convincingly long time take the old VM down and store it off-line somewhere.

That is clearly a hack, but allows nearly instant rollback for the the smallest shops. Coming up with something similar at medium shops shouldn't be too hard either.

Once rollback is solved a new deployment might eat minutes of failure budgets instead of hours.


Try that when your deployment requires a schema change of a database table with a billion rows :-)

That being said, we've automated deploy and rollback on error to the point where it's a script, and we run it 50 times a day. The "immune system" from a few minutes of partial deploy saves us weekly.


DB schema updates don't change your rollback ability.

Need to add a column? Go for it. Then do your deployment that uses the new column. If you have to roll back, you can, and the new column will just sit there, again unused.

Have a column you don't need anymore? Do the deployment that stops using it, and let it sit for a while. After you're convinced it's safe, drop the column.

Have a column that needs to change size/purpose/etc.? Nope, don't do that. Just create a new one, and deal with differences between the old and new column in code until you can safely drop the old column, as above.

Does your ALTER statement take too long to run and cause downtime? Do it on a replica that isn't serving traffic, and then promote that to be the new master. Actually, in general, you should be doing this for any kind of change anyway, just in case you screw up the schema update.


Spin up a second database VM and point a fresh frontend VM at it. Then atomically switch the frontend VM like I was describing before.

If that doesn't work come up with some kind, any kind, of rollback solution. There is no server problem that cannot have a rollback solution made for it.

Some places are cheap and would rather just upgrade in place in the middle of the night. Those places are cheap and bad at engineering. Do not work for or do business with them if you actually care about uptime.


I'd say the most important interaction between SRE and dev teams at Google is when the developers come to the design review with their dumb idea and the SREs say "we're not doing that".


"Nope" is not a strategy.


Nope is often a great strategy when asked to do something impractical. It's better when its followed up with an alternative, but stopping someone from building something terrible early on is much better than having them try to fix it after the fact.


Here's an idea: pay extra for on-call work. As a professional I want to fix stuff if I break it, but there's a limit to demands on my time.

It's especially infuriating to spend an evening away from my family to fix a problem that someone else caused and could have fixed in 5 minutes, but I had to spend several hours getting familiar with.

At this point, if management asked me to start on an on-call rotation I'd want to know how I was going to be compensated for the additional time and opportunity cost of being on call, or I'd start looking around for a new gig.


Because orgs don't want to pay extra for requiring you to be on call.

Unless it's codified in labor law, they can extract that off hours work from you for free (in other professions, you're compensated just for being on call, and then further if a call comes in).

As others in this thread have mentioned, people are wising up to the perils of being on call, along with the lack of compensation that goes with it.

Source: 15 years of ops experience.


there are companies that pay for being on-call and if additional sums if you get called.

source: I work for Ubisoft and they pay me for being on-call and pay me per hour for when I get called.


I believe they exist (as you mention) but it's not as common as it should be (as it would if labor law required it).


This probably comes from Ubisoft's French roots. It sounds like a very French attitude to take towards being called after hours (which is illegal there).

I like that they do this for you; good on them!


> people are wising up to the perils of being on call

I did too late.

It started with being asked to sleep in the guest bedroom while on shift and only got worse from there. Take it from me: it takes active investment to keep a family healthy while on call, especially when you're on call at a shaky startup with several pages per day that require extensive remediation.

I'm not stupid and I know on call was not the reason in itself, but it was a significant catalyst and debit upon my time and mental health -- not sleeping adds up. Be aware of it lest you end up like me, finally off call with an empty home to show for it.


It is not as bad if you can rely on a strong team (3.5 years there, no significant consequences in personal life).

Some things that help:

1. Everybody is in the roll, not just the new guy(s).

2. The team is big enough for everyone to have a reasonable ratio of on-call vs off-call days. Merging two or more small teams into one single on-call roll of death does not count; people who is not knowledgeable on the problem at hand (e.g. everyone, eventually) will just fuck up and end up calling someone from the correct team anyways (after the problem has grown worse and the customer is angrier).

3. Team is encouraged to trade days or cover for each other if needed.

4. On-call guy has vetoe power over deployments. If you want to push something urgent at the end of the day, you better make yourself available to the guy that can vetoe your deployment.

5. Management understands that developer's productivity will slow down while on-call, and plan accordingly.


Yeah, I worked at a hedge fund for nearly a decade. I got called back to the office from vacation after driving across the country for a development - not production - issue. I started getting 2-4 am calls for outages in dev by our offshore support who were too lazy and/or incompetent to read a log file and take action. Glad I don't work there anymore.


Every policy that gets put in place has the possibility of influencing behavior of those involved.

If your organization pays extra for on-call work, that influences buggier code because you will get paid to fix or solve it later (even if it's not intentionally buggy).

You see this all the time with contract work and it's a huge issue with low bids that end up costing a lot more because of the sunk cost.

Edit In fact, you already have developers fighting this incentive at your organization. If developers are already writing buggy code hoping that it will be someone else's issue and not caring when it's their turn to be on call, imagine when they get paid extra to fix the bugs they introduced to begin with


Instead of on call, have coverage.

Rotate through different employees and give them times when they are responsible for //being awake and ready to work on short notice//: IE this is a normal hourly wage with the bonus that you don't have to be at work.

For both the above and 'on call', you also charge the 'department' with the failure's root cause (as determined later) the penalty and pay out a bonus for those needed to respond to the incident.


Why not pay a flat rate for taking the on-call responsibility, regardless of actual work performed?


That's fine, but unless your on call scheduling is erratic and non-predictable then it's no different than pay that's included with your salary.

Like the article says, you really get the best results when the people responsible for the code is also responsible for issues that come up after hours. That means your team should be rotating out so that everyone feels ownership and responsibility for the deployment.


The Earth rotates and has people all over. We shouldn't have "after hours"


Placing folks around the globe doesn't really help with the weekend or holiday situation.


> Placing folks around the globe doesn't really help with the weekend or holiday situation.

Given that non-working "weekends" and "holidays" aren't global standards, it kind of can , though arrangements to simultaneously mitigate those situations and working hours situations may be more complex than ones intended to handle only one or the other problem.


This works. We did this for developer on-call rotations (which was also a development opportunity for rising engineers). We also did this when I ran Ops.

It's not perfectly fair in the micro sense, but it's fair enough to not trip most people's frustration meter and it's very easy to administer.


The way it used to work in the organisation I used to be part of (generally a good bunch) was that there was a flat-rate, plus an extra payment if we were required to do something.

Occasionally there was time in lieu given as well, if we were actually called and it was overnight/took a while.

Worked well, but as a general rule we were delivering a quality product, had good guys working for us and so we all felt responsible if something went wrong out of hours and would look at what happened and how we would prevent it going forward.


We also gave comp time for significant disruptions/outages, but that was mostly recognition of the practical reality that someone who normally worked days and unexpectedly worked 1-4 AM responding to the pager was going to be useless the next day anyway.

I specifically wanted to avoid variable pay, the associated timesheets tracking and approval processes, the HR and finance/payroll integration, and any temptation (beyond normal professional responsibility) to either decrease or increase the hours spent on a problem response. I'm sure it's different for different businesses, but I judged that paying on-call bonuses for weeks on-call quarterly was low enough finance integration effort and still gave the employees a sense that they were being paid some differential for being on-call.

Your last sentence is the key outcome to strive for and as long as you have decent leadership and culture, I think that's fairly easy to achieve in a small group. I don't need to be paid "extra" to do a little extra to help my team and company.


IMHO. The problem with on-call rotation is that people just try to push the problems until it is someone else's turn. No one is looking at the root causes. No one is making changes on the systems to avoid problems in the future. No one is building operational support in the applications (mentioned by Johnny555 in another comment).


If you are an hourly employee, there is already extra pay for nights/weekend/long shifts.

If you are salaried, you're expected to do the job that's needed, without particular daily time keeping.

Easiest is to build robust infrastructure that recovers from failure, uses circuit breakers, soaks in forked workloads before being live, and such, so being on call is a low risk venture.


As an in-demand software engineer with oncall experience at a well-regarded company, I will not consider jobs that require me to be on call.

Jobs with oncall don't offer more compensation than jobs that don't.

Scheduling my life around being able to answer a page is inconvenient, and waking up in the middle of the night is something I'd rather avoid.

Operational work is often not considered as important as feature development for promotions, so you feel like you're wasting your time when doing it.

In my experience, system quality is completely independent of whether the developers do oncall or not. But I'd welcome objective data that proves otherwise.

There is no upside for me as an individual to take a job with oncall responsibilities.


Companies need to evaluate the loss of man-days when they demand or even suggest on-call duties for engineers. Also the effect of these duties on number of sick days taken, and overall health. Because when I'm up from 3-3:30am fixing something, I'm going to be less useful and emotionally testy that day at the office. When I'm a Good Marine and am up eyeballing a problem from 1am-6am, the same will happen but over a 1-2 week span. And I'll probably get sick.

I really think engineers end up on-call because companies give short shrift to documentation. Maybe if you afforded time to document systems and processes, you could outsource on-call duties, or trust sysadmins to remediate without developer intervention.


Operational work is often not considered as important as feature development for promotions, so you feel like you're wasting your time when doing it.

Which is why developers need to be on call, otherwise they don't build operational support into apps (monitoring hooks, useful logging, etc)

At my company devops is primary on-call, with a developer(s) as backup, if devops can't fix the problem quickly, they escalate to the developer. If they can't reach the developer or the developer can't fix it, then the entire release is rolled back and then the developers bear the brunt of the release schedule changes.

Without developers in that loop, it's hard for devops to get the tools they need to diagnose and fix problems.


> At my company devops is primary on-call, with a developer(s) as backup

Sort of defeats the purpose of Dev ops, if you have a Dev ops team and a Dev team, doesn't it?


Does it?

Dev Ops is focused more on operations (tooling around releases, monitoring, etc) as well as doing initial triage for system problems to figure out where the problem is, and if they can't fix it themselves, determine which engineering group can help resolve an issue. Most of the Devops staff does development, but more around operational needs, product features are implemented by Dev.

Dev Ops means different things to different companies.


I think it's fair to say that you're just ops since you have the exact same functional role. And you can suffer from the same organizational issues that prompted the creation of the devops movement.

That said, there's a need for a way to divide the industry between traditional ops people and those with the ability to handle modern infrastructure and develop custom tooling where needed. And for better or worse the industry has decided that "devops" is going to be the way to determine that. Now the problem is all of the fakers (both individuals and teams) adopting the moniker. So now we're back to square one and we actually need to evaluate people and teams based on their knowledge and skill.


We're more integrated with Dev than a traditional "ops" role where Dev throws the code over the fence at Ops and says "Deploy it!" and Ops throws it back if there's a problem and says "Fix it!"

Our devops team sits in on design and code reviews to help ensure that operational needs are met early on in the process, and we'll change code to fix bugs or support operational needs. We're not full developers and won't rearchitect entire systems, but we will do bug fixes when we can.

So now we're back to square one and we actually need to evaluate people and teams based on their knowledge and skill

Have we ever left that square?


Sounds like your devops and dev teams do devops which is great. I wasn't trying to imply that they don't or that they had the same problems that prompted the movement, only that your roles are divided the same way as traditional dev and ops even if you do in fact do them smarter than average. What I'm getting at is that "devops" as a title is an (internal) marketing term.

You're right that every org defines devops differently but it's totally and unashamedly coopted from a movement which, despite being very generally defined, defines it radically differently. And that's not to say it's always a bad thing either. Sometimes it takes a title to help signal change and make an organization better which in the end accomplishes the goal of the movement.

And I agree, I don't think we ever left that square and we never will. But if prepending "dev" to "ops" wasn't at least somewhat effective in getting something out of people in decision making positions it wouldn't be tacked on to job titles and team names left and right. So at least some people think the marketing terms in a job title are a meaningful shortcut even if smart people like you know better.


Yep. This is how my org does it. DevOps builds automation/tooling to help the operations focused developers on the dev teams do their jobs better.


The word Devops means that the Dev team is the Ops team. If you have both teams, you are defeating the purpose of DevOps


Actually devops is supposed to be a culture where the dev team and ops team work together. What people seem to want it to mean is you're an ops team with automation skills. I've yet to see this be a dev team doing ops.


It goes a bit beyond "an ops team with development skills" in my company -- someone from Devops sits in on design and code reviews to suggest changes for operational needs (which almost always means adding metrics so we know how busy it is, and when it breaks or is near breaking, followed closely by adding enough logging so when it does break, we know why and don't need to page a developer to read a stack dump to try to figure it out).

But ultimately, the developer is the one that best understands how their code may break so he needs to instrument the code accordingly. And when the developer knows that giving the right information to Ops may avoid a 3am phone call, they have good incentive to do so.


Typically a DevOps team would mean a team that works on things that make it easier for the whole organization to do DevOps, e.g. tooling for easier building, deploying, testing, configuration and monitoring. It doesn't defeat the purpose of DevOps unless the people at the organization think that it's the only team that needs to be doing DevOps.


If you have a DevOps team you are doing it wrong.


So much this. I don't trust an engineer who's never done maintenance work. And I really don't trust an engineer who's never been on call. I find that they write vastly different (usually much simpler) systems. Personally I will now only build a system that I know I can troubleshoot at 4am while drunk.


I 100 percent agree. I'm sorry you feel the need to post under a throwaway, but I'll put my name on this to back your point up.

On call is a deal breaker for my future job searches, and I am considering leaving the company I just joined because they sprung it on me while never mentioning it in the interview process.

On call is justified in my opinion only if someone will die if the problem isn't handled. Or if compensation is dramatically increased and agreed upon in advance.


On call is justified in my opinion only if someone will die if the problem isn't handled. Or if compensation is dramatically increased and agreed upon in advance.

What if the company will die if the problem isn't handled? Many companies (mine included) provide a service that needs to be highly available 24x7 (customers run automated tasks 24x7 against our service). If our site regularly went down for hours at night (or even for an entire weekend) because of a software problem that no one could fix because the developers weren't picking up the phone, we'd lose customers and eventually, the company would go out of business.

Even a 10 minute outage is a significant event and requires a full RCA for customers. We try hard to architect for high availability, but bugs do happen.


The answer here seems simple. Include on-call time in compensation, and fire developers.

Note that for some developers, you're never going to be able to compensate 'on-call' hours appropriately due to their evaluation of the opportunity cost of their time.


> Include on-call time in compensation...

I'm glad you brought this up - there's a quick conclusion to this conversation:

"On-call time compensation is part of your salary."


Yeah, the labor laws are broken.

Salary shouldn't be a thing that a company can ever hide behind.

Labor laws really do need to cover the maximums that an employee can be expected to work. They should also make going above those maximums exponentially more expensive in 'bonus time'; and the accumulation rate shouldn't magically reset after time, but only after time back on normal/reduced work load and duration.

Also, while I'm on this subject, 'full time' work should really begin at more like 24 hours / week. Benefits for part time work should be pro-rated. (It should never be more cost effective to split a full time job in to part time jobs. That is defrauding the economy and making others pay for the costs of your labor.)


That's just fine, as long as on-call expectations are fixed as part of the job offer, just like salary is. If the on-call expectations change for the worse later, that is exactly the same as a salary cut.

IANAL but I believe (in the US or in NYS at least) this would be grounds for quitting with full unemployment coverage (you are not usually eligible for unemployment benefits if you quit, exceptions include the basic nature/hours/wage of the work being changed without your consent, since that is philosophically similar to having your old job terminated and being offered a new, worse one)

Get any on call expectations in writing when taking an offer if you can, just like you'd expect salary in writing


The real problem then is engineers being unwilling to vote with their two feet.

In a market as friendly to software developers as we're currently in, the correct response to that is totally "great, consider this my two week's notice, I'll go to this other company across the road that pays the same salary but doesn't demand I have on-time call".

(No, the response to every problem with your work environment should be to get a new job, nor is that even a realistic possibility for many. The economic argument still stands.)


It's a valid answer provided that on-call was part of the conditions known during hiring and salary negotiations. If we've negotiated a salary of $X for doing Y, but then you suddenly want Y+on-call, then you're going to either pay more or find another employee, because $X is simply too little for that.


"What if the company will die if the problem isn't handled?"

Then they should be paying huge bonuses to be on call, or simply hiring for multiple shifts. It's pretty easy.


Maybe they "should" do one of those, but it seems unrealistic to expect a small 10 - 30 person company to do either.

And when the company grows large enough that they can support 24x7 engineering staffing, they usually do it by having an overseas team.


If you can't pay enough to find someone competent to be on-call, you have to do without. That's life.


"Maybe they "should" do one of those, but it seems unrealistic to expect a small 10 - 30 person company to do either."

Why? What's so special about a tiny company that they get things for free? What's so hard about, "If you want something, then pay for it"?


Because, math? That company may have a team of 5 developers, how do you split those 5 developers across 3 8 hour shifts that cover 24 hours/day 7 days/week + holidays + vacations + sickdays to get 24x7x365 coverage without doing an on-call rotation.

When you're in a 100 or 1000 person company, then it's easier to have dedicated after-hours support staff (or staff working from multiple timezones around the world)

No one is saying that it should be "free", it's built into the salary - every company that I've worked at that's had an on-call is very clear about on-call rotations during the interview.


"Because, math?"

Except you're saying that small businesses should get to ignore that math, and get stuff for free.

" That company may have a team of 5 developers, how do you split those 5 developers across 3 8 hour shifts that cover 24 hours/day 7 days/week + holidays + vacations + sickdays to get 24x7x365 coverage without doing an on-call rotation."

You hire more people so that you can, or you don't make deals you can't afford. Honestly, why is this so hard to understand? Why do you feel so entitled to things you can't afford?


Again, who said anyone is getting anything for free?

My company has 100+ engineers, all of whom do a on-call rotation, they are all aware of it when the sign up, so that on-call duty is included in their salary. The employee can evaluate for themselves whether or not they think the salary is sufficient to cover that, but as far as I know, we've never had a candidate turn down an offer due to the on-call requirement (though I don't know that they'd always tell us that's why)

But really, does any engineer join a small company (in the USA) without assuming that they'll be on-call after hours? Even in early stage companies that haven't launched a product yet, there's still after-hours support to be done to keep dev systems running, fixing broken builds, etc.


And that overseas team often has little-to-no ownership so they call people while they're sleeping or half ass everything. This problem is magnified if they're contracted and not actually employees.


as someone who runs an on-call team, and is woken up frequently still as a co-founder, if someone sprang that on you, you should quit asap, and tell them why. people need to know that on-call is a decision, not a mandate.

the only reason my team puts up with it is because we're 100% remote. they never leave their families, so that's the trade-off.


Grumble grumble. There are times I'm treated as oncall despite being hourly. Figure that out.


consultants charge double during on-call hours. it's standard practice.


Unplanned == big $$. If you have to travel to the customer you can charge double your hourly rate and an emergency trip fee + expenses.


No just an hourly employee. Support technician at that.


That sucks, but hey, at least you get compensated for it. I think in some states they even have to pay you for a certain amount of time at minimum even if you only work for 10 minutes.


It happens very rarely. The times it does they do it under the guise of me being potentially fired if I answer wrong. Ie they want immediate clarification of what I did in some crisis but really just want info. Not paid and I'm actively searching for work.


find a new job buddy. nobody is going to stick up for you but yourself.


Did you ever ask about it during the interview?


I am compensated, beyond normal salary, for every hour of oncall outside of normal working hours, up to a cap of 15% salary.


"Jobs with oncall don't offer more compensation than jobs that don't."

This sounds particular to your experience. There are definitely companies where on-call time is compensated.


I think its more: Comparing multiple companies, my total yearly income won't be any higher if I pick a company that requires oncall than if I pick one that doesn't, even if the one that requires it compensate for it. I can do the same job at company X for 200k without being on call, 175k at another that requires it, or vice versa...there's no correlation.

So if its not obviously worth my time, I just wont' do it.


As someone occasionally told my job is becoming irrelevant by people here on HN, I hope you folks have some regard for sysadmins who do on call regularly and (likely) get paid drastically less to do it. ;)

I've been told here that now developers can and should do everything, but it doesn't seem like a lot of developers want to be called in at 2am.


I think the term developer is a bit loaded. If you're properly developing an application, you should know what your dependies are, where you store your data, what are the failure condiditions are, how to prove your appication is actually functioning.

In my world, developers barely understand the programming language or framework they're using and are more interested in using buzzword foo rather than solving the problem at hand or dealing with actually running the systems that solve said problems.

As someone who does both dev and ops jobs, together and seperatly, I think your job will be safe for a very long time.


"In my world, developers barely understand the programming language or framework they're using and are more interested in using buzzword foo rather than solving the problem at hand or dealing with actually running the systems that solve said problems."

Oh, man. A thousand times this. Many devs I work with have been writing code that runs on AWS for years, and don't know what RDS means.


> As someone occasionally told my job is becoming irrelevant by people here on HN

HN is an enormous echo chamber; Take it with a liberal amount of skepticism.


Do you write production code?


A few years ago I took a job where all engineers took turns carrying the pager. The reasons were that we were too small for dedicated ops resources (justified), and that the head of engineering wanted us to feel like a family restaurant (not justified.)

Shortly after joining, I gravitated towards our desktop client and just couldn't keep up with all the changes on the server environment. When the pager went off, I just didn't know what to do. What was more frustrating is that our system had a few chicken littles in it, and I really wasn't up to date on the context about when "the sky is falling" really means "the sky is falling."

Probably the bigger problem is that I don't consider myself an "ops" person. I prided myself in making the desktop product stable and performant; I didn't have the time to learn the ins and outs about service packs and when to reboot.

I agree with the article completely, developers should be on call when their code is shipped, and while their code is immature. Just keeping developers on call, or rotating in developers who just aren't involved with the servers, is a complete waste of time. It fundamentally misunderstands why successful companies rely on specialization and divisions of labor in order to grow.

I think the author is spot-on when she states "Who should be on-call? Whoever owns the application or service, whoever knows the most about the application or service, whoever can resolve the problem in the shortest amount of time."


There is nothing more frustrating than being responsible for software you didn't write or had no part in writing.

I don't get why management finds this so hard to get. I know there is some spreadsheet somewhere with my name next to this application, but I have never touched it and until now didn't know it existed. Please don't expect me to figure it out in an emergency situation.


Try being the ops person expected to be able to troubleshoot this dumpster fire in the middle of the night; they didn't write a single line of the breaking code either. Why exactly are they expected to know what broke and how to fix it while you can't seem to be bothered learning how the rest of your stack works?


"Bothered" is a bit of a strong word there. When the company I worked at had 30 engineers, fewer services, and fewer lines of code, I did know how almost everything worked. Now we have over 3x as many engineers, and enough services such that it would be a full time job and then some just to keep track of how everything works.

Fortunately we do team/product-based on-call, so people are (for the most part) only responsible for services they work on or at least are familiar with.


This was a pretty interesting article that hits very close to home (I'm an SRE at Google). I think the central thesis (that developers are better at running rapidly changing products because they are able to find and fix bugs more quickly) is a bit flawed, however.

The reason is that I think the most valuable contribution of the SRE is not in responding quickly to outages, but in improving the system to avoid outages in the first place. SREs tend to be better at this than developers because (a) they have better knowledge of best practices by virtue of doing this kind of work all day every day and (b) they are more incentivized to prioritize this kind of work.

Because of this, the dynamic I commonly observe is that SRE-run services have fewer and smaller release-related outages because techniques like canarying, gradual rollouts, automated release evaluation, and so forth are deployed to a great extent. On the other hand, developer run services tend to have more frequent and larger release-related outages because these techniques are not used or are used ineffectively. So even though the developers can diagnose the cause of a release-related bug more efficiently than SREs can, the SRE service is still more reliable.

In my view, the main reasons to have developers support their own services fall into (a) there aren't enough SREs to support everything, (b) the service is small enough that investing the kind of manpower SRE would into implementing these best practices would not be cost effective, and (c) SRE support can be used as a carrot to get developers to improve their own services.

Edit: I would add that if the roll of oncall is expected to include only carrying the pager, and not making substantial contributions to improve the reliability of the system, then the author is absolutely right that having an SRE or similar carry the pager has next to no benefit.


Exactly. Even the most trivial of bugs can't be fixed as quickly as Google would need it to be fixed for that to be their go-to strategy. You simply cannot use that as your response to a show-stopping bug if you have stringent up-time requirements.


"The number one cause of outages in production systems at almost every company is bad deployments" [refering to code deployments]

When I read post mortems from companies posted or linked here (e.g. Google, Facebook, ...) it does not seem that outages result from code deployments.

From my experience of 10 years of CTO/VPE I've only seen some outages resulting from deployments (mostly because test data sets were too small and processing in production took much longer resulting in slow responses and then outage).

The majority of outages linked and experienced are either from growing load, introducing new technologies (databases, deployments but the outage was not from code and usually developers could not help) or rolling out configuration changes.

What would be your main reason for outages?


Background: I work on products that are less than 3 years old.

Deployments are the #1 cause of outages. We don't write up fancy postmortems for the vast majority of these outages, because it's things like "We forgot to ship the config files before pushing new code, guess we should automate that now" or "We automatically pushed the config files before code and the code wasn't backward-compatible, guess we should code review for that now." They're easy fixes, the downtime is almost always small, and it's relatively easy to fix in production.

The main cause of multi-day, turns-you-hair-grey outages are cascading failures that start with a deployment. My worst one was a migration that happened at the same time as a bad IOPS day on EBS where we also ran out of disk space because we weren't rotating logs properly. That took some 18 straight hours to clean up, which sucked.

If you ask my customers, AWS issues are the main cause of outages. But really, deployments are the thing that cause the most loss. It's just that most people never catch that downtime because it's an extra 5, 10, 20 minutes of slowness or a maintenance page during off-peak hours.


"maintenance page during off-peak hours."

This is not towards you in any way, I always get angry when there is maintenance in off-peak hours, because usually that means day time in Europe.


That's dependent on how often you ship.

Take your example of a load-based outage. In many applications there is a time-based peak (eg 11am-2pm daily for Facebook, or right before a concert sale opens for Ticketmaster) where load might cause an outage. In the worst case you might have a load-based outage per day. If you only deploy code once a week, then of course for you the majority of outages will be from load, and that's where you focus.

OTOH Facebook pushes their entire front-facing code base twice (three times?) a day. They also twiddle thousands of A/B bucket test switches that turn code on and off, every day. After years of stomping on load-based outages, now their biggest danger is deployment.


> introducing new technologies (databases, deployments but the outage was not from code and usually developers could not help) or rolling out configuration changes.

Are those things not deployments? Rolling out something is a synonym for deploying something in my book. Configuration and code are not such different concepts in practice.


I've worked in an on-call rotation at one company and won't do it again. I was paid time and a half for all time spent dealing with issues while on call as well as a small base amount that was something like 15% base salary for the days you were on call to account for the inconvenience of having to be near a computer, within cell service and able to respond within 20 minutes at any time of the day or night.

I felt like this was fair compensation but I still wouldn't do it again. Getting woken up at 2 A.M. and having to troubleshoot something for an hour and then not being able to fall back asleep or having to interrupt a date or just not planning dates when you're on call is not worth it.

Now my situation was multiple small systems deployed onsite at customer locations and subject to inconsistencies in their networks, weather related outages, failed microwave towers and computer illiterate users. So being on call meant you were almost certain to actually get called. A company with a more centralized failure stack probably goes days or weeks between the on-call person being called.


The article talks about services at Google being relatively stable and SREs there focusing on automating the instability away. A previous comment here also mentions that. In my experience, I find that to be not really true and more of a marketing image. The stress that being a Google SRE poses on family and relationships is huge and the job really is just like DevOps at other large service-based companies. The amount of code you write is orders of magnitude less compared to a software engineer since there is significant tooling available(Google being a mature company). Most of the work done is just operating those tools.

I had an SRE girlfriend who became an ex because of the stresses it placed on our relationship. Although I was the one who helped her land the job and was with her through previous hardships, there were just too many missed dates and too little respect for my time and other stress related issues that breaking up was the only way out for me to regain peace of mind.

Maybe you need a certain sort of person to handle that kind of stress.


Google is a big enough organization that experience varies. When I was an SRE I wrote C++ code pretty much all day every day, even while being oncall, with only the occasional intervention necessary on my part. Teams like you describe are considered to be in "operational overload" and are actively dismantled by senior leadership, with responsibility for the dodgy systems devolving on the people who wrote them in the first place.


"Teams like you describe are considered to be in "operational overload" and are actively dismantled by senior leadership.." -In theory, maybe... But with the expectation to stay in said team for atleast a year or two before switching teams and the associated difficulty in actually doing it, I think it is a pretty bad postion for said SREs to be in.


The perspective here is interesting and totally different from my experience as a sysadmin.

If bad code is the most common problem, then maybe it's time to tighten up the testing and deployment procedures first. The reason operations is a different job is that they take care of much much different parts of the stack. Developers aren't going to be effective at their jobs if they have to also worry about tuning Java GC settings, analyzing database I/O bottlenecks, ensuring network security, worrying about network drivers, open file limits, and MTU size.

In my experience, the stuff that happens in the middle of the night more often involves infrastructural problems that ultimately have nothing to do with the code. And so it makes sense for the developers to sleep. By all means, assign an on-call developer that the operations staff can page when it's determined there's a code problem, but if that has to happen very often, then something else is wrong in your procedures.


I think the most frustrating part of this problem is how disengaged a lot of developers are from operational work. I don't think it's enough that we figure out who to delegate responsibility to. Both SRE/DevOps and developers should be always working together to avoid outages. There are usually things that make this hard, as described by Susan, but there has got to be a way to get people on the same page.

As an operational person I want you to have feature but I don't want you to break things. Developers want to focus on pushing features instead of getting bogged down fixing the work of yesterday. I don't think it's enough to try to make these things work as the teams exist. I think there needs to be this mentality from the get-go to make things good on both sides. Teams need to be engaging either side throughout the entire engineering process, not just when they think they're ready for the other side.


Close and shorten the pain loop as much as possible. If it causes availability pain, it should inflict pain on those who caused it.

If the developer wrote terrible code, the developer should be paged when the code/stack/framework breaks.

If ops/SRE/whatever chose a terrible server platform or cloud provider, they should be paged when the server crashes or goes offline.

Two decades of history has shown that the carrot doesn't work in this age of Internet companies. You gotta use the stick. I wished that the carrot works and there are those altruists who have only worked in ideal environments where the carrot works, but they are the extreme outliers. The average lifespan of companies these days is too short for employees to stick around and actually care too much. All jobs these days are gigs, and most are looking for the next one. Why would you waste time fixing your problems in this context?

Close the pain loop.


The key point here is ownership.

Now there are multiple ways to define and transfer ownership. The primary reason for the split of dev and operations team was so that dev teams are not held back maintaining systems when there's more dev work to be done. However the split only works when deployment is a weekly or monthly activity. For Continuous deployment the dev team should be on call till the knowledge transfer can be done.

Where I work we have a split and our process works (in theory).

1. Developers go through the build, test, deploy process

2. Before deployment the dev and operations team meet and the dev team walks the ops team through the code, key changes, key functionality implemented.

3. Ops team poses their questions e.g. what assumptions were made, what are the possible values for a particular config attribute, etc.

4. Once Ops is comfortable they understand the changes the dev team turns over the application to the ops team.

5. This knowledge transfer happens in an hour long meeting with key stakeholders from both teams present.

6. This process is for weekly or biweekly deployments.

7. For a brand new project/product the dev team does a complete walkthrough with the ops team over a period of 1 week and the dev team provides a 6 weeks "warranty" period for the application where in the dev team is on call.


One of the challenges I have had particularly with small teams (aka startups) is deciding what is worthy of failure and how to avoid fatigue if being too aggressive with what failure is.

I have found if you are not aggressive with what a failure is (aggressive meaning classifying things that are not really fatal as outage... the system is up but there are lots of errors) it will bite you the in the ass in the long run. The small errors become frequent big errors.

The problem is if you are too aggressive you will eventually get alerting fatigue.

I don't have a foolproof solution. I have done things like fingerprinting exceptions and counting them to the extreme of failing really fast (ie crash on any error).

In large part this because small teams just don't have the resources to get this right but still have the demands to deliver more functionality.

I wish the article delved into this more because there are different levels of "its down".


I turned down a job offer with Amazon, partly because of their on-call rotation. They do not have a devops team, or even a unified tech stack. Each team is responsible for the creation, deployment and maintenance of their own code. I have dealt with too much shoddy legacy code in my lifetime, there is no way I will be woken up at 3am to support it.

Many years ago, there was no devops/SRE. You had the developers and a sysadmin team if your company was big enough. The sysadmins did not know about anything at the application level, so developers were always on call. With the advent and the rise of the devops role, developers can now focus with their main task.

I used Hadoop very early on (version 0.12 perhaps?), and I removed Hadoop as a skill on my resume since I did not want to admin a cluster, just do the cool MapReduce programming. Once again, devops to the rescue.


My team doesn't have too much of a problem with DevOps. Of course, there are 2 significant mitigating factors: 1) we have a large team of about 8 developers thus we do on-call for one week every other month, 2) we're a moderately strong XP shop so we pair program & TDD code plus factor integration tests into stories (i.e. we avoid shitty code and "only so-and-so knows that" B.S. problems in the 1st place). I would NOT agree to DevOps on a 3-4 person team w/o some sort of significant stipend/bonus program, and I would _NEVER_ do DevOps in a team that didn't pair & didn't have good testing practices. YMMV.


I know from experience that Amazon engineers are responsible for the services they build. Amazon's motto is to push more operational tasks to the owners while providing them with great(?) monitoring and debugging tools to ease the load.

There was also a Netflix talk about it's approach to operations which was very similar to Amazon's. I feel the way Netflix organizes it's general software processes is a mirror of that at Amazon's .... maybe partly due to the AWS influence.


Previously I worked in a place where it wasn't the case that deploys of new code often brought down the service (a webapp). If the service went down it was usually because some random DB or Kafka topic or something else I don't understand at all took a shit in the middle of the night.

So we just kept deployments on weekdays before about 4pm and if the site went down outside of that, well, it wasn't because of a deploy. And if it was, we were there to fix.


This doesn't really work for a lot of applications. Deploying code can introduce latent faults that only arise during peak (which may happen tomorrow or overnight).


What the article calls "devops" I call just "ops." If the engineers writing the system also run the system, then they are "devops."

"Ops" proper is much older than "at least 20 years" because they trace a direct line back to sysadmins, which have been around since forever.

Out system works OK. We have devs, and ops, and devops. Devs run their new service with help from devops until it's stable and a runbook exists. Then, it's handed off to ops to keep running, with support if it breaks from devs if it's still in active development, or dedicated maintenance devops if it's mature.

Not perfect, but pretty good, and efficiently runs the business as well as letting us iterate.


Most companies I worked at had a layered on call structure where level 1 would be someone like a Customer Relationship Manager, Level 2 would be someone from the "sysadmin" side, Layer 3 - an engineer from the dev side.

Once the issue would get escalated to L3 it became a crapshoot as even having someone from the dev side does not guarantee them knowing anything about the system or a part of the system that is having an issue.


I'm currently a student at Holberton[1], and we did a project where we had to be on call ensuring optimal uptime for a server. It was really sweet to emulate a real work experience, but super stressful. I can't image if I had to be on call more than a night or two a month. [1] https://www.holbertonschool.com




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: