Facebook Workplace co-founder launches downtime fire alarm Kintaba

jedberg · on Feb 10, 2020

At Netflix we build internal tools and systems that basically did all this stuff, so it looks interesting, out of the box.

One of the nicest things was that we use a single deployment tool for almost all deployments, and it could insert the deployments into the timeline so we could see every deploy both during and before the incident.

But the biggest issue was voice. We always had a conference call running for incidents because some people would be in a place where they couldn't open chat (driving to the office or similar) and it was always a pain to integrate the voice data.

We got to a point where a recording of the incident call and a transcript could be added to the incident, but during the call the Call Leader usually had to quickly type the voice highlights into the chat.

I'd love to see a real time voice transcription as a feature. But it would have to be pretty good to not just get in the way.

quartz · on Feb 10, 2020

Conference line + transcription support is an awesome idea and something we can likely add without too much trouble!

Netflix is definitely on the good end of companies with custom tooling in this space. Would love to chat more about how you do things if you don't mind me reaching out personally?

jedberg · on Feb 10, 2020

Email is in my profile. Would love to chat with you!

startledmarmot · on Feb 10, 2020

I'd also love to help out on this! We power exactly this kind of voice + transcription + voice application platform stuff at SignalWire and it'd be rad if there was a way we could help get this off the ground.

thedance · on Feb 10, 2020

Why is it better to have the deployment actions added by the deployment tool, rather than having them be implied by counting the population of different build artifacts? It seems like the latter would be more robust against changes in the deployment process, or a proliferation of new deployment processes, or rogue deployments using non-canonical means.

jedberg · on Feb 10, 2020

We had both. But knowing that someone tried a deployment and who was critical because we could ask them why and to explain what they were deploying.

giancarlostoro · on Feb 10, 2020

Interesting, are these systems made by the same team, what kind of technical stack do they use? I'm always curious as to what some use for internal tooling. I've been using Django / CherryPy for internal and external solutions at work for a few years. Python helps me to stay productive.

jedberg · on Feb 10, 2020

I was on the reliability team, and we made some of them. One of our sister teams (same Director) was the monitoring team, and they made some features in the monitoring tools that enabled real time export. And then the team that made the deployment tool was also a sister team that made exports easy.

So it was a few teams all reporting to the same Director. We even had the customer service tools reporting to the same director so we could get some real time info from them (and to them).

Each team built in what they were most comfortable. Most of the tools were built in Java, but my team built in Python (mainly because I write in Python and biased towards hiring people who also like to write in Python).

giancarlostoro · on Feb 10, 2020

Heh, we're a Java and Python shop to a point, we have done a bit of everything here, but my first project here was to rewrite a legacy Java JSP application and I was given some freedom as to what to rewrite it in, I chose Python and CherryPy. No regrets since.

Sidenote: Other reason I asked is cause apparently both Hulu and Netflix are listed on the CherryPy docs as orgs using them.

jedberg · on Feb 10, 2020

We used CherryPy for a lot of our Python tools at Netflix.

I was going to link to some open source with CherryPy, but it looks like they've all be rewritten in Go or Gradle. :(

giancarlostoro · on Feb 10, 2020

Oh interesting, Go is a good choice too I think, I've been fairly impressed with what it has available out of the box.

gonehome · on Feb 10, 2020

I think Google does real time voice transcription for all VTCs that gets indexed and is searchable later.

auspex · on Feb 11, 2020

Try chorus.ai

ryanjodonnell · on Feb 10, 2020

What does it mean to be co-founder of facebook workplace? Arent you just an employee of facebook?

rdslw · on Feb 10, 2020

Yup. Ego massage. Here clickbait.

Paul Graham wrote: never do things for money or ego, but techcrunch clickbaiters writing about facebook employees, sorry, founders, supposedly dont agree.

P.S. this whole title reads like from git manpage generator.

duxup · on Feb 10, 2020

I suppose it implies a leadership role where that department was somewhat autonomous and operated a bit 'like' a founder of a startup. That's what I would assume.

AgloeDreams · on Feb 10, 2020

Exactly, Much like Jon Rubinstein being the inventor of the iPod or Panos Panay being the leader of the Surface Group.

zeroonetwothree · on Feb 11, 2020

It’s some good title inflation for sure

crispyporkbites · on Feb 10, 2020

No. Today I realised that I am also a co-founder, in fact you probably are too!

My investors are a bit tough, I get no equity and no seats on the board (there is no board). Vacation and pension’s good though.

duxup · on Feb 10, 2020

Man in that screenshot it LOOKS like facebook a bit... too much IMO.

Anyway it's an interesting idea. I worked in a support center for decades and demands for updates and managing P1 type situations, giving updates to the dozen interested parties (each of whom were competent in some ways, less so in others), managing misinformation, and the varying politics was a constant hassel for folks who were technical.

There were times where "oh man my phone just went down" (I unplugged it).

That's no small thing to try to deal with, with software.

quartz · on Feb 10, 2020

Kintaba cofounder here-- thanks for the feedback!

We really wanted the interface to feel comfortable for non-technical folks who needed to stay up to date on incidents so we focused on bringing as much of a human element as possible into the dashboard. Hopefully we can find the right balance of friendliness and informativeness, we certainly don't intend to become a social network (but we would love it if we were as easy to use as something like Slack or Discord)...

We experienced the same pain you're describing with keeping others updated and the politics (and subpolitices and subsubpolitics) that come out of major incidents over our careers. Luckily we also saw companies that did it right, usually with custom built tooling. Our biggest discovery was the more open you make the whole incident process the better everyone understands the work being done and also the less of those insidious little subpolitical conversations happen where facts are skewed and people are blamed for process challenges.

It's definitely a big hill to climb.

duxup · on Feb 11, 2020

Yeah we used to do the political stuff via email updates at a place I was at.

It wasn't the worst way but also not a "system" for sure.

Half the battle with political stuff was really was defining the problem in a way that everyone outside engineering understood, and keeping everyone updated / aware of what was factually (not rumors or panic) going on, what was known, not known, what was happening, and when they would get their next update.

And that's not even the technical stuff where folks here are asking about recording conference calls and etc ;)

Lammy · on Feb 10, 2020

Not as much as Phabricator does :) https://phacility.com/phabricator/

giancarlostoro · on Feb 10, 2020

> Man in that screenshot it LOOKS like facebook a bit... too much IMO.

Looks like StackOverflow to me but clean. I like the UI. I don't know if I will ever see that utility where I work since we wikify things like this post-chaos. The office I work in is a small enough team to where we have plenty of flexibility.

duxup · on Feb 10, 2020

To me Facebook and most social media is a sort of comment chaos where nobody really reads anything and just spouts some stuff. Thus my cognitive dissonance with that and this system.

Like when I think of dealing with a situation I see that and think of some sales guy going off in his comment about

OMG THIS IS JUST LIKE THAT OTHER TIME WHY ISN'T THIS FIXED!?!!?

Now the whole comment thread is off track because sales guy brought his own FUD to the game, and god help us all if someone brings it up on a conference call as fact.

Now obviously that is not the fault of the UI if you've got a sales guy does that, but as things scale up organizationally that kinda thing ... happens ;)

Granted this system might be intended for engineering teams, but it's also something that inevitably leaks out when folks find out it is there.

giancarlostoro · on Feb 10, 2020

Heh interesting point there, I wonder if an HN like comment system would help (let users upvote highly relevant material), and maybe allowing supervisors to pin things. We use pinning a lot on Slack so people can quickly find important links or messages.

duxup · on Feb 10, 2020

That might be handy, also probably something like an "athroative user" could respond and even minimize the view or visibility or something.

Even if they don't have an answer some way of making sure FUD doesn't take off too quickly.

I'd say a high % of managing any situation and the politics that follow is accurately defining / revising the problem and making sure everyone is on the same page with it. Dodging / handing the inaccurate information leaks sometimes is a big job.

Granted that's all office politics stuff and not just technical... still gotta solve the problem ;)

mherdeg · on Feb 10, 2020

Yeah feels like there's a lot of room for startups in this space (Blameless has a great demo, too).

No one wants to rewrite PagerDuty internally -- why are people all writing their bespoke incident management, response, review, and reporting toolchains internally too?

jedberg · on Feb 10, 2020

> why are people all writing their bespoke incident management, response, review, and reporting toolchains internally too?

Because the good ones are tightly integrated with the rest of the internal tooling. To use a 3rd party incident management tool usually means you have to run your operations they way they expect to get maximum value. A lot of times its easier to build it yourself based on how you operate.

However, as the way people operate become more standardized, the third party tools will become more useful.

mherdeg · on Feb 10, 2020

Yeah, agree, this is probably the key sticking point: the 3rd party toolchain doesn't do [some key thing] the stakeholders in my process have found valuable. So I end up run some scenario analysis and find that the 3rd party toolchain doesn't shave off enough developer time & ongoing support cost from the rest of the operational work I'm going to have to do anyway to manage an incident. And I load the vendor's web page every couple of months and wistfully wish my process were uncomplicated enough to use their tools.

Here's hoping that the calculus changes either as these tools grow more robust or, like you say, as people begin to manage their software systems in less bespoke ways.

For example, during an incident we'd like to know "what changed between time [X] and [Y]" (deployments, configuration, experiments, other service outages) while we're trying to determine a root cause and fix the problem. And much later, after the incident, we'd like to auto-compute a metric like "what is the change success rate of [Y service / services owned by team Z]?".

This aren't complicated concepts -- it's not like we're trying to reason about causality with machine learning to reduce time-to-resolve for incidents! Still, our incident-management tool will really behave better if it's aware of our what-changed-at-time-X tooling and our incidents-caused-by-changes reporting. If this is an external tool, yikes, now our incident tool has sprung awareness of changes and reporting?!

quartz · on Feb 10, 2020

For a lot of companies I think the internal processes and tools for major incidents have been sort of cobbled together over time. It's somewhat recent that organizations (especially smaller companies) are thinking about major incidents more diligently as an end-to-end process that's connected to but separate from more general ops-alerts.

One of the reasons we started Kintaba was that we're noticing a shift as more and more employees are leaving successful companies that have invested tons of time into their incident management process (most of the bigtech companies) and are now joining/starting their own companies where they want to spin up a mature process quickly.

chadlavi · on Feb 11, 2020

I know all TechCrunch articles are paid placements but this one felt the paid-placementest in a long time.

echan00 · on Feb 11, 2020

Not true at all. My previous startup was written about and we did not pay a dime.

mjayhn · on Feb 10, 2020

Being able to take a snapshot of the chat/repo pushes/CI+CD jobs/grafanas and everything else going on during an outage and separate it by an incident type/tag/hostname/etc for later reference (to write RFOs, to more quickly solve it if it occurs again, to write documentation on, etc) is something I've wanted for awhile so this seems really interesting.

I'm sure that stuff exists (I think Datadog sort of does this, etc) but I've yet to work anywhere that doesn't just create some #shtf-$date slack channel which eventually gets lost in the black hole due to cost prohibition or time required to get it going while a fire is going on.

quartz · on Feb 10, 2020

100%. Finding ways to identify the relevant data related to an incident is something that requires a lot of additional integrations that we're actively working on.

One thing Kintaba is really good at right now is wiring your slack channel directly into the activity log so it's properly attached to the incident and ultimately the postmortem. This helps avoid that channel getting lost over time, but there's still lots to do for sure in organizing all that data to make it more useful after the fact (one thing we currently support is #tagging for quick search within the log).

duxup · on Feb 10, 2020

Agreed, a repo snapshot, every log you can think of, all that good stuff that you do manually to figure out if this = that.

Few things are as annoying as going back in repo time and trying to figure out if you're in quite the same place / right place.

gooeyblob · on Feb 10, 2020

Looks interesting!

A couple notes: - the verification email went to my spam folder on Gmail - acknowledged is misspelled on this image https://kintaba.com/images/collab_splash.png

quartz · on Feb 10, 2020

Yikes-- fixed the spelling, thank you!

Thanks for the spamboxing report. Seems gmail isn't a fan of us today... working on it now.

ThePowerOfFuet · on Feb 10, 2020

Get your SPF, DKIM, and DMARC up to snuff. Dmarcian is a good tool for that. (No connection, just a happy customer.)

jamisteven · on Feb 11, 2020

I can see someone like Servicenow or Splunk wanting to acquire this. Decent idea for organizations that dont have existing frameworks around these types of things. Where I see it not working would be in places where the use of 3rd party apps are prevalent. We have many of these at my place of work (Finance, 70k employees), one of the top apps used by the business to make money is a total black box, when something goes wrong with it there isnt much we can do besides send off the core dumps to the vendor and wait for analysis, with no 2 scenarios ever being the same root cause.

wackget · on Feb 10, 2020

You should know that Techcrunch URLs are blocked for anyone using uMatrix, as per the following discussion: https://news.ycombinator.com/item?id=22228159

Just making you aware, as the site causes lots of problems and is not GDPR compliant. Top comment on the thread above: "Yahoo/Verizon is cancer and should die in a fire."

lostlogin · on Feb 11, 2020

Every thread has a “this website breaks x” section and usually I bypass it. Breaking the ability to go back (iOS) is so gross and many sites seem to do it. Why?

sbryant8 · on Feb 11, 2020

How does this differentiate from something like Opsgenie? They already have a built-in incident command center that can notify everyone in an org, Slack integration and postmortems. Would be interesting to see how all of the features stack up to one another.

motakuk · on Feb 11, 2020

There is one more concept in this field: agree with collaborative nature of incidents and work with them in Slack directly (implemented in https://amixr.io)

lainga · on Feb 11, 2020

With former employees from Kinja and Altaba?

Jestar342 · on Feb 10, 2020

> Verizon Media works with select partners that do not participate in the Interactive Advertising Bureau's Transparency and Consent Framework, or Google framework. All our foundational partners require you to manage your choices directly through their privacy policies. Click on each partner below to access their privacy policy

Yeah. <closes-tab/>

lucb1e · on Feb 10, 2020

This site has some gdpr-violating we-want-it-all compulsory tracking wall it seems, but from the comments, do I understand correctly that a "downtime fire alarm" is a monitoring service that alerts sysadmins at home when a service is down?