More

laCour · 2025-12-17T17:55:02 1765994102

I'm with Hund. It was only monitoring the cached, unauthenticated news page, which was up throughout the downtime. However, it now monitors the authenticated news landing as well.

laCour · 2025-12-17T17:51:44 1765993904

This was monitoring the unauthenticated news page, which is why it didn't catch it. It now monitors authentication as well. It is not official, and was made by a co-founder years ago.

eddyg · 2025-12-17T18:20:22 1765995622

Thanks! I checked that page and wondered why it stayed green. I resorted to checking https://downforeveryoneorjustme.com/hacker-news

laCour · 2025-12-17T17:49:18 1765993758

I'm with Hund. Our hn.hund.io page did not catch this because it was requesting the cached, unauthenticated page. It now monitors authentication as well.

joncrane · 2025-12-17T17:51:23 1765993883

Thank you. I was thinking myself or my corporate IP was shadowbanned

jonahx · 2025-12-17T17:50:57 1765993857

Is this a mistake by hund, or the configuration of hund by HN?

laCour · 2025-12-17T17:53:21 1765994001

Mistake on our part (Hund) for not monitoring authentication. This page is unofficial and was made by a co-founder several years ago.

sammy2255 · 2025-12-17T20:55:23 1766004923

You should add a graph of visitors per-minute for the status page for the past 24 hours or so. Would really help for situations like this

Cheer2171 · 2025-12-17T18:52:43 1765997563

[flagged]

dang · 2025-12-17T21:58:06 1766008686

Yikes, you can't do this here.

We've banned this account for repeatedly breaking the site guidelines and ignoring our requests to stop.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.

sentrysapper · 2025-12-17T19:30:42 1765999842

What a rude thing to post. Hund, don't listen to this entitled nonsense. There is a reason it's called human error. Companies 100x your size and 10000x your revenue like AWS, Microsoft, CloudFlare, CrowdStrike can't figure out how to accurately provide status dashboards. At least you took the time to explain your mistakes. If anything you got another supporter for your honesty.

seanw444 · 2025-12-17T20:02:11 1766001731

This just in: people make mistakes on occasion.

laCour · on July 12, 2019

"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."

How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.

lethain · on July 12, 2019

(Stripe infra lead here)

This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)

aeyes · on July 12, 2019

You might want to clarify this in the post. To me it reads like you knowingly had degraded infra for days leading up to an incident which might have been preventable had you recovered this instances.

lethain · on July 12, 2019

Thanks for the suggestion, we’re adding a clarifying note to the report’s timeline.

throwaway3489 · on July 12, 2019

I am a curious and very amateur person, but do you think that if "100%" uptime were your goal, this:

"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."

could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?

The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?

what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.

when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?

excuse my ignorance on this subject.

redis_mlc · on July 13, 2019

> could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?

This is a valid question.

As a database and security expert, I carefully weigh database changes. However, developers and security zealots typically charge ahead "because compliance."

Email me if you need help with that.

Thorrez · on July 13, 2019

You could use that same logic to argue that they should never write any new code, just live forever on the existing code.

But customers want new features, so Stripe does changes.

Jorsiem · on July 13, 2019

How do you have a ATM thats not networked?

throwaway3491 · on July 13, 2019

Same user (sorry I guess I didn't enter my password carefully as I can't log in.)

Well I mean they're not exactly on the Internet with an IP address and no firewall, are they? (Or they would have been compromised already.)

Whatever it is, it must be separated off as an "insecure enclave".

So that's why I'm wondering about this technique. You don't just miss out on security updates, you miss performance and architecture improvements, too, if you stop upgrading.

But can that be the path toward 100% uptime? Known bad and out of date configurations, carefully maintained in a brittle known state?

Operyl · on July 13, 2019

Secure .. enclave? I'm sorry but I think you're throwing buzzwords around hoping to hit a homerun here.

nitrogen · on July 13, 2019

No, it's a fair question. The word "enclave" has a general meaning in English as a state surrounded entirely by another, or metaphorically a zone with some degree of isolation from its surroundings.

So the legit question is, can insecure systems (e.g. ancient mainframes) be wrapped by a security layer (WAF, etc.) to get better uptime than patching an exposed system?

throwaway3491 · on July 13, 2019

yes, thank you.

ashelmire · on July 12, 2019

If you can think of every possible failure and create monitoring and reporting for it before it happens, then you're the best dev on the planet.

sithlord · on July 12, 2019

And also have the greatest bosses on the history of earth giving you unlimited time to do this.

raverbashing · on July 13, 2019

And then filtering for a lot of crap and false alarms the tools and supporting infrastructure throws

I kinda lost count of how many times Nagios barfed itself and reported an error while the application was fine

gtirloni · on July 12, 2019

In this environment:

Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.

having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.

runevault · on July 13, 2019

this case that doesn't sound like it was the issue, it was the lack of promotion of new master due to the bug in the system in terms of shard promotion.

NikolaeVarius · on July 12, 2019

In many HA setups, you're supposed to not have to care if any single thing goes down because it should auto recover

The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.

laCour · on July 12, 2019

Right, but they didn't recover speedily. To have the cluster in such a state for so long sounds like poor monitoring to me because this can knowingly interfere with an election later.

kortilla · on July 12, 2019

The health check said it was ok. How would they know it needed to be recovered?

The fault was the bad health check. Not the process.

laCour · on July 13, 2019

They only just clarified that monitoring was in place and they were reporting as healthy. See the comments above.

laCour · on Sept 25, 2018

If I'm not mistaken, Rdio's engineering team helped build Pandora's on-demand service. I've switched to it from Spotify after using Rdio in the past. Its recommendations feel similar to Rdio's, but the interface is a bit lacking. There aren't general recommendations. Instead, you must build a playlist and then you can add a set of recommended songs to the playlist. Of course, there are also radio stations.

laCour · on May 9, 2018

Archive link: https://web.archive.org/web/20180509201622/https://beautiful...

laCour · on April 26, 2018

Lightsail instances are just T2 instances, and they are not exempt from CPU credits.

From https://aws.amazon.com/lightsail/faq/

> Lightsail uses burstable performance instances that provide a baseline level of CPU performance with the additional ability to burst above the baseline.

laCour · on Jan 7, 2018

i5-4690K (L1 4x32KB; L2 4x256KB; L3 6MB): https://i.imgur.com/FUs1isW.png

LG v30 (L2 2MB): https://i.imgur.com/q5R3bLY.png

laCour · on Sept 20, 2017

13:14 UTC through 15:54 UTC here.

laCour · on June 2, 2016

The biggest benefit over StatusPage.io is that our platform's design focuses on automation through integrations with third parties. StatusPage.io pages can only be automated through their direct integration with PagerDuty, or by parsing email notifications (from services like New Relic or Pingdom). We have the ability to add countless integrations with third parties for both monitoring and notifications.

Branding is important for companies, so we've chosen to offer complete white-labeling on our single plan. We leverage Let's Encrypt so that status pages can be instantly configured to use a secure custom domain.

These are just a few of the key differences, our features section outlines these and more.

conorgil145 · on June 4, 2016

FYI, my first question was also "how is this different than StatusPage.io?". It might be a good idea to have a comparison page or make the differentiation very clear in some other way. Your target audience has most likely heard of StatusPage.io (they are the incumbent in my mind) and will likely ask themselves the same question as soon as they land on your site.

If someone is using StatusPage.io, why should they switch to your service? Or, are you only targeting customers who do not already have a status page solution in place?

Best of luck!