The dangers of resetting everything at once

IvyMike · on May 4, 2015

For those who aren't catching the "May 1 + 248 days" reference: http://arstechnica.com/information-technology/2015/05/01/boe...

zatkin · on May 4, 2015

What is she referring to when she says 208 days in the parenthesized paragraph? (After the 2nd paragraph.)

robotmlg · on May 4, 2015

The 208 day bug: https://www.novell.com/support/kb/doc.php?id=7009834

The 49.7 day bug: https://support.microsoft.com/en-us/kb/216641

_lce0 · on May 4, 2015

Interesting to read.

Apparently it's caused by uptime being stored in a 32-bit integer which simply overflows at that point in time

https://www.ibm.com/developerworks/community/blogs/anthonyv/...

ipozgaj · on May 4, 2015

TSC overflow bug, see https://access.redhat.com/solutions/68466

_dp9d · on May 4, 2015

I work at a telco with a lot of aging systems and servers.

A couple of years ago we had a total power failure of our major data center and central office, including the backup gen so literally everything failed and an entire province of Canada (and most of two more) were without any telco, data, wireless.

It was eye opening and extremely scary to see what happens when everything is turned off at once and you need to bring stuff back up. Some of the equipment had never been turned off, and certainly not all at once.

During the next weeks we encountered many, many catch-22 scenarios like "Before we turn on System A, it needs system B to be online. B needs C, C needs A." Oh.

For reference of how severe this is, paramedics were posting up at major intersections in towns and communities in case people needed help - and the radio stations in half of Canada were telling people what to do. When even 911 doesn't work you have problems.

serve_yay · on May 4, 2015

This is how "if it ain't broke, don't fix it" calcifies. People point to old systems like this (when they are working, that is) all the time as proof that developers are too obsessed with shiny new things. And I'm not saying that isn't true, but there is such a thing as being too conservative.

_dp9d · on May 4, 2015

> there is such a thing as being too conservative

In my experience, a telco is the definition of "too conservative".

OldSchoolJohnny · on May 4, 2015

As a Canadian I would love to see a citation to this because it would have been pretty big news here and I recall nothing about it.

_dp9d · on May 4, 2015

http://www.ctvnews.ca/canada/power-outage-in-yukon-knocks-ou...

It took out everything in the Yukon, and tons of stuff in NWT, Nunavut and Northern BC and Alberta.

digi_owl · on May 4, 2015

I guess the catch-22 came into being because A was at some point bootstrapped from some other system that was since replaced by B or C.

flashman · on May 4, 2015

Considering Boeing said no airframes even came close to 248 days uptime, I would consider this a non-issue.

kens · on May 4, 2015

Also, the affected planes are now required by law to have power deactivated at an interval of no more than 120 days, so they will never get anywhere near 248 days. https://www.federalregister.gov/articles/2015/05/01/2015-100...

brudgers · on May 4, 2015

I suspect that the Federal Regulations also requires the planes not to fail at 248 days (or words to that effect). Specifications are not the thing specified.

Silhouette · on May 4, 2015

The problem isn't necessarily that specific issue, because now that it's known, practical measures can be taken to compensate for the risk.

The problem is that the existence of that one issue demonstrates a serious failure in the tools and development processes and quality checks used to create the control software. There could therefore be other issues in the code that haven't been found yet that could cause similarly catastrophic failures.

Failing to reliably identify this kind of arithmetic error is a huge deal when you're working with high-reliability software. It simply shouldn't happen, and there should have been several different stages in the development process that would have prevented it.

So, the development team should have the book thrown at them until someone figures out exactly how it did happen, closes every related tool and process loophole so it can never happen again, and audits all potentially affected code to make sure the same combination of loopholes hasn't allowed any similar mistakes into production elsewhere in the system.

kwhitefoot · on May 4, 2015

It is an issue because it suggests that both code and requirements were not adequately reviewed. Unless the requirements explicitly stated that the generators would never run for more that 2^31 x 100th of second allowing that counter to overflow without coping properly with it is a failure.

I have some production code that uses ints as serial numbers with a factory method that just hands out the next one in sequence, however, I take care to ensure that there will be no negative consequences if they roll over (they exist only in debug builds).

GeorgeHahn · on May 4, 2015

Forekast event so we don't forget to look out for 787s falling from the sky: https://forekast.com/events/5547023b666b770ea12b0000

iyn · on May 4, 2015

Never heard of Forekast, interesting idea. There is also similar(?) site - PredictionBook (http://predictionbook.com/).

perlgeek · on May 4, 2015

Instead of rebooting all the affected systems, create a monitoring check that alerts the ops team a few days or weeks before the error condition (system will hang/reboot, SSL cert expires) comes to pass.

Then it's at their discretion what do (for example, decide when to reboot).

techsupporter · on May 4, 2015

> alerts the ops team

"Everybody" knows that no good Web 3.0 company has an ops team. We're too expensive and too easily replaced with a handful of shell scripts or can just be fobbed off as another responsibility of the devs who wrote the code. (After all, if the code was perfect, there wouldn't need to be an ops team, right?) Silly rabbit, ops teams are for lumbering companies who "Just Don't Get It."

Signed,

Slightly miffed ops person who thinks he does a lot for his company but feels woefully underappreciated in the new IT.

Silhouette · on May 4, 2015

Don't worry, some of us still love you guys.

Signed,

Developer who wants to develop rather than being a part-time, under-trained, unsupported sysadmin and who finds too many stories about modern devops environments funny for all the wrong reasons

kjs3 · on May 4, 2015

Meh...you want underappreciated? Try being in infosec. Those new-fangled companies don't think there's a thing wrong with "We'll worry about security after we get acquired...", regardless of how sensitive the data their holding is.

Silhouette · on May 4, 2015

And sadly that will continue to be the case until data protection regulators have real teeth, at which point due diligence before any acquisition should obviously include a thorough audit of these areas. A potential acquisition target that hasn't looked after its data properly and is a large regulatory action waiting to happen should then, rightly, be unlikely to exit successfully until they get their house in order.

kjs3 · on May 5, 2015

Gosh, I'm apparently old fashioned here. I would think that "I'm a startup that handles the sensitive information of our users" would immediately segue to "we should take prudent efforts to secure that data", not "fuck security till I'm mandated to do it by regulators, cause fuck those users until there's an exit". Get off my lawn and all that.

Silhouette · on May 5, 2015

I would think that "I'm a startup that handles the sensitive information of our users" would immediately segue to "we should take prudent efforts to secure that data"

I would hope for that, too. That's certainly how my businesses operate.

Sadly, reading sites frequented by the start-up community, including HN, taught me long ago that plenty of entrepreneurial types will feel absolutely no guilt about skipping things like security and privacy safeguards if it gets them significantly more/quicker money. They just hope that they will be able to handle any PR fall-out if it ever becomes necessary, and it's one more risk to manage, nothing more.

If something really bad happens, their back-up plan is simply to fold the business and start a new one. They'll write off the loss without much regard to the customers who had supported them or any damage that might be caused to those customers by the leakage of that sensitive information. In short, your second characterisation is all too realistic.

I think this is almost inevitable as long as the start-up culture is focussed around either having an outside shot at being the next Google/Facebook/Apple, having a realistic chance of being acquired by the current Google/Facebook/Apple within a fairly short period, or throwing it all away and starting again. By its nature, this business attracts gamblers. Lacking any meaningful penalty for not taking proper precautions, not just for the start-up but also for the founders/leadership of the start-up and their investors, the odds are more in favour of those who cut corners. Looking out for your customers can even be a direct competitive disadvantage.

To change the culture, you need to change the attitude of either the founders or the funders. The former would take something like piercing the corporate shield and making the officers of a company personally responsible for negligent data leaks, probably not just in monetary terms but also something they can't shake like barring them from being officers of any other company for some significant period afterwards (thus killing the dump-it-and-start-over strategy). The latter just needs a direct financial penalty severe enough to make cutting the corners at the risk of user data not a good bet, which in practice is probably much easier to achieve, and without the negative side effect of making honest but nervous founders more reluctant to take a risk on starting a business.

techsupporter · on May 5, 2015

Mutual admiration society going on here. Being in Ops, I love our InfoSec guys. Trusting in InfoSec means never having to say "my team leaked what?! to the Internet?"

kjs3 · on May 5, 2015

We got your back, bro.

Kalium · on May 4, 2015

Ah! Someone else understands my pain.

juliansimioni · on May 4, 2015

This is hard won advice.

I've worked at places where we managed lots of systems, and we weren't quite organized enough to jot down when known issues would crop up far in the future and remember them. SSL certs specifically bit us once or twice.

vacri · on May 4, 2015

Stick an SSL expiry warning into your alerting system. We have one in our nagios system - checking once a day, it gives a 45-day warning for an impending expiry.

As long as your alerting system allows custom alerts, you'll always be able to run such a check.

bigiain · on May 4, 2015

This assumes "an alerting system".

Not guaranteed a valid assumption. (In fact, anecdotal evidence suggests it's the exception rather than the rule, at lest for businesses below a certain size.)

jacquesm · on May 4, 2015

That's a very big bad red flag then. If you run even a single service you need something that monitors that service and that something needs to be on a different bit of hardware and needs to be able to reach you even when it can no longer reach its own network uplink.

Silhouette · on May 4, 2015

If you run even a single service you need something that monitors that service and that something needs to be on a different bit of hardware and needs to be able to reach you even when it can no longer reach its own network uplink.

I think that's too much of a generalisation. If you're talking about an established public service, that you're charging real money for, where something that actually matters will be affected by even minor downtime, sure. But if you're talking about a small team or individual, running a new service that does something simple to help someone do something else, you probably have many higher prioritises than that level of monitoring and alerting, but you might still get messed around by something like all your certs expiring overnight.

vacri · on May 4, 2015

The OP said 'lots of systems'. Running 'lots of systems' without an alerting system is begging for trouble.

mellavora · on May 4, 2015

google calendar? My version of it does alerts.

mkesper · on May 4, 2015

You'd have to update that manually, which, in case of emergency operation, will quite sure be forgotten. This has nothing to do with proper alerting.

hueving · on May 4, 2015

The heartbleed reference doesn't make much sense. Allowing active certs to expire has little to do with heartbleed. Either people have process in place to replace expiring certificates or they don't.

It may have produced a global phenomenon where people observe multiple expired certificates around that date; however, there is absolutely no difference to an end-user of a particular site whether the cert is expired in April or October. They are both as equally bad.

In response to heartbleed, a system admin would have gained no advantage by waiting to react and would have been exposing his/her users to MiTM attacks by waiting longer.

bigiain · on May 4, 2015

It makes sense to me - how many of the top stories here do you suppose are HeartBleed cert replacement issues (and look at the dates, and see who looks like they updated their certs days or weeks before the rest of us found out about it):

https://hn.algolia.com/?query=ssl%20expired&sort=byDate&pref...

I suspect the real issue is that in "emergency situations" people get the important shit done (and replace certs immediately), but don't always do the non-emergency process type stuff that'd normally get done when doing important shit by the schedule (updating reminders about cert expiry dates).

My guess is Instagram, senate.gov. gmail, and Docker (amongst others) are going to have ops people wandering around over the next few months saying "Hey, we just got the ssl cert update reminder, but someone's already renewed the cert months ago and didn't update 'the system(tm)'. What gives, people?"

rogerbinns · on May 4, 2015

If they replaced all certs in their organisation because they could have been compromised, then all the replacements are likely to expire at the same time. The article isn't saying that you should wait longer, but that if you do nothing then one day all of the replacements will expire and cause a massive workload to update.

A fix would be to update those new certificates proactively, spreading out the workload and new expiry times. But also marking the calendar for mass expiry day because there are bound to be some that were forgotten.

stevewilhelm · on May 4, 2015

> We'll also assume they have a bug which makes the machine lock up after 300 days of uptime. Nobody knows about this yet, but the bug exists.

> So here's the trick: any time you see an announcement on date X of something bad that happens after item Y has been up for more than Z days, calculate what X + Z is and make a note in your calendar. That's the first possible date you should see a cluster of events beginning.

What? Doesn't this assume the announcement date X was when the bug was introduced into the software (in the example the Linux OS) and installed on your servers?

bigiain · on May 4, 2015

No, She's assuming on date X a whole bunch of people are going to kneejerk react and reboot things (like Boeing 787s) without enough thought or putting any process in place to actually manage or mitigate the problem, then on date X+Z all those things are going to crash at the same time (possibly into many smoking holes in the ground).

stevewilhelm · on May 4, 2015

Ah. my confusion stems from this line.

> So here's the trick: any time you see an announcement on date X of something bad that happens after item Y has been up for more than Z days, calculate what X + Z is and make a note in your calendar.

I thought she was making this suggestion to the 'kneejerk' admins, not us clever admins watching for failures externally.

IvyMike · on May 4, 2015

No.

Let me try restating one sentence. "That's the first possible date you should see a news story or otherwise notice a cluster of events around the world as sysadmins everywhere inadvertently synchronized their systems by rebooting on day X"

Zardoz84 · on May 4, 2015

Interesting. Were I work, We have scheduled a reboot every night some develop Tomcat servers. At same time, we apply automatic database scheme update and automatic updated from the latest version from our local repository. Also, users don't touch any of these servers, so not should be any problem. But I see that here is a mind-share that rebooting Tomcat server pass X time is a good idea as with time they begin to do funny things, but we don't do that.

amelius · on May 4, 2015

Just reboot your systems on a regular basis, and time-interleaved.

weems · on May 4, 2015

Isn't it supposed to be 2020-02-03 00:00:00?