I work at a telco with a lot of aging systems and servers.
A couple of years ago we had a total power failure of our major data center and central office, including the backup gen so literally everything failed and an entire province of Canada (and most of two more) were without any telco, data, wireless.
It was eye opening and extremely scary to see what happens when everything is turned off at once and you need to bring stuff back up. Some of the equipment had never been turned off, and certainly not all at once.
During the next weeks we encountered many, many catch-22 scenarios like "Before we turn on System A, it needs system B to be online. B needs C, C needs A." Oh.
For reference of how severe this is, paramedics were posting up at major intersections in towns and communities in case people needed help - and the radio stations in half of Canada were telling people what to do. When even 911 doesn't work you have problems.
This is how "if it ain't broke, don't fix it" calcifies. People point to old systems like this (when they are working, that is) all the time as proof that developers are too obsessed with shiny new things. And I'm not saying that isn't true, but there is such a thing as being too conservative.
I suspect that the Federal Regulations also requires the planes not to fail at 248 days (or words to that effect). Specifications are not the thing specified.
The problem isn't necessarily that specific issue, because now that it's known, practical measures can be taken to compensate for the risk.
The problem is that the existence of that one issue demonstrates a serious failure in the tools and development processes and quality checks used to create the control software. There could therefore be other issues in the code that haven't been found yet that could cause similarly catastrophic failures.
Failing to reliably identify this kind of arithmetic error is a huge deal when you're working with high-reliability software. It simply shouldn't happen, and there should have been several different stages in the development process that would have prevented it.
So, the development team should have the book thrown at them until someone figures out exactly how it did happen, closes every related tool and process loophole so it can never happen again, and audits all potentially affected code to make sure the same combination of loopholes hasn't allowed any similar mistakes into production elsewhere in the system.
It is an issue because it suggests that both code and requirements were not adequately reviewed. Unless the requirements explicitly stated that the generators would never run for more that 2^31 x 100th of second allowing that counter to overflow without coping properly with it is a failure.
I have some production code that uses ints as serial numbers with a factory method that just hands out the next one in sequence, however, I take care to ensure that there will be no negative consequences if they roll over (they exist only in debug builds).
Instead of rebooting all the affected systems, create a monitoring check that alerts the ops team a few days or weeks before the error condition (system will hang/reboot, SSL cert expires) comes to pass.
Then it's at their discretion what do (for example, decide when to reboot).
"Everybody" knows that no good Web 3.0 company has an ops team. We're too expensive and too easily replaced with a handful of shell scripts or can just be fobbed off as another responsibility of the devs who wrote the code. (After all, if the code was perfect, there wouldn't need to be an ops team, right?) Silly rabbit, ops teams are for lumbering companies who "Just Don't Get It."
Signed,
Slightly miffed ops person who thinks he does a lot for his company but feels woefully underappreciated in the new IT.
Developer who wants to develop rather than being a part-time, under-trained, unsupported sysadmin and who finds too many stories about modern devops environments funny for all the wrong reasons
Meh...you want underappreciated? Try being in infosec. Those new-fangled companies don't think there's a thing wrong with "We'll worry about security after we get acquired...", regardless of how sensitive the data their holding is.
And sadly that will continue to be the case until data protection regulators have real teeth, at which point due diligence before any acquisition should obviously include a thorough audit of these areas. A potential acquisition target that hasn't looked after its data properly and is a large regulatory action waiting to happen should then, rightly, be unlikely to exit successfully until they get their house in order.
Gosh, I'm apparently old fashioned here. I would think that "I'm a startup that handles the sensitive information of our users" would immediately segue to "we should take prudent efforts to secure that data", not "fuck security till I'm mandated to do it by regulators, cause fuck those users until there's an exit". Get off my lawn and all that.
I would think that "I'm a startup that handles the sensitive information of our users" would immediately segue to "we should take prudent efforts to secure that data"
I would hope for that, too. That's certainly how my businesses operate.
Sadly, reading sites frequented by the start-up community, including HN, taught me long ago that plenty of entrepreneurial types will feel absolutely no guilt about skipping things like security and privacy safeguards if it gets them significantly more/quicker money. They just hope that they will be able to handle any PR fall-out if it ever becomes necessary, and it's one more risk to manage, nothing more.
If something really bad happens, their back-up plan is simply to fold the business and start a new one. They'll write off the loss without much regard to the customers who had supported them or any damage that might be caused to those customers by the leakage of that sensitive information. In short, your second characterisation is all too realistic.
I think this is almost inevitable as long as the start-up culture is focussed around either having an outside shot at being the next Google/Facebook/Apple, having a realistic chance of being acquired by the current Google/Facebook/Apple within a fairly short period, or throwing it all away and starting again. By its nature, this business attracts gamblers. Lacking any meaningful penalty for not taking proper precautions, not just for the start-up but also for the founders/leadership of the start-up and their investors, the odds are more in favour of those who cut corners. Looking out for your customers can even be a direct competitive disadvantage.
To change the culture, you need to change the attitude of either the founders or the funders. The former would take something like piercing the corporate shield and making the officers of a company personally responsible for negligent data leaks, probably not just in monetary terms but also something they can't shake like barring them from being officers of any other company for some significant period afterwards (thus killing the dump-it-and-start-over strategy). The latter just needs a direct financial penalty severe enough to make cutting the corners at the risk of user data not a good bet, which in practice is probably much easier to achieve, and without the negative side effect of making honest but nervous founders more reluctant to take a risk on starting a business.
Mutual admiration society going on here. Being in Ops, I love our InfoSec guys. Trusting in InfoSec means never having to say "my team leaked what?! to the Internet?"
I've worked at places where we managed lots of systems, and we weren't quite organized enough to jot down when known issues would crop up far in the future and remember them. SSL certs specifically bit us once or twice.
Stick an SSL expiry warning into your alerting system. We have one in our nagios system - checking once a day, it gives a 45-day warning for an impending expiry.
As long as your alerting system allows custom alerts, you'll always be able to run such a check.
Not guaranteed a valid assumption. (In fact, anecdotal evidence suggests it's the exception rather than the rule, at lest for businesses below a certain size.)
That's a very big bad red flag then. If you run even a single service you need something that monitors that service and that something needs to be on a different bit of hardware and needs to be able to reach you even when it can no longer reach its own network uplink.
If you run even a single service you need something that monitors that service and that something needs to be on a different bit of hardware and needs to be able to reach you even when it can no longer reach its own network uplink.
I think that's too much of a generalisation. If you're talking about an established public service, that you're charging real money for, where something that actually matters will be affected by even minor downtime, sure. But if you're talking about a small team or individual, running a new service that does something simple to help someone do something else, you probably have many higher prioritises than that level of monitoring and alerting, but you might still get messed around by something like all your certs expiring overnight.
The heartbleed reference doesn't make much sense. Allowing active certs to expire has little to do with heartbleed. Either people have process in place to replace expiring certificates or they don't.
It may have produced a global phenomenon where people observe multiple expired certificates around that date; however, there is absolutely no difference to an end-user of a particular site whether the cert is expired in April or October. They are both as equally bad.
In response to heartbleed, a system admin would have gained no advantage by waiting to react and would have been exposing his/her users to MiTM attacks by waiting longer.
It makes sense to me - how many of the top stories here do you suppose are HeartBleed cert replacement issues (and look at the dates, and see who looks like they updated their certs days or weeks before the rest of us found out about it):
I suspect the real issue is that in "emergency situations" people get the important shit done (and replace certs immediately), but don't always do the non-emergency process type stuff that'd normally get done when doing important shit by the schedule (updating reminders about cert expiry dates).
My guess is Instagram, senate.gov. gmail, and Docker (amongst others) are going to have ops people wandering around over the next few months saying "Hey, we just got the ssl cert update reminder, but someone's already renewed the cert months ago and didn't update 'the system(tm)'. What gives, people?"
If they replaced all certs in their organisation because they could have been compromised, then all the replacements are likely to expire at the same time. The article isn't saying that you should wait longer, but that if you do nothing then one day all of the replacements will expire and cause a massive workload to update.
A fix would be to update those new certificates proactively, spreading out the workload and new expiry times. But also marking the calendar for mass expiry day because there are bound to be some that were forgotten.
> We'll also assume they have a bug which makes the machine lock up after 300 days of uptime. Nobody knows about this yet, but the bug exists.
> So here's the trick: any time you see an announcement on date X of something bad that happens after item Y has been up for more than Z days, calculate what X + Z is and make a note in your calendar. That's the first possible date you should see a cluster of events beginning.
What? Doesn't this assume the announcement date X was when the bug was introduced into the software (in the example the Linux OS) and installed on your servers?
No, She's assuming on date X a whole bunch of people are going to kneejerk react and reboot things (like Boeing 787s) without enough thought or putting any process in place to actually manage or mitigate the problem, then on date X+Z all those things are going to crash at the same time (possibly into many smoking holes in the ground).
> So here's the trick: any time you see an announcement on date X of something bad that happens after item Y has been up for more than Z days, calculate what X + Z is and make a note in your calendar.
I thought she was making this suggestion to the 'kneejerk' admins, not us clever admins watching for failures externally.
Let me try restating one sentence. "That's the first possible date you should see a news story or otherwise notice a cluster of events around the world as sysadmins everywhere inadvertently synchronized their systems by rebooting on day X"
Interesting. Were I work, We have scheduled a reboot every night some develop Tomcat servers. At same time, we apply automatic database scheme update and automatic updated from the latest version from our local repository. Also, users don't touch any of these servers, so not should be any problem.
But I see that here is a mind-share that rebooting Tomcat server pass X time is a good idea as with time they begin to do funny things, but we don't do that.