Looks like the bug was in a monkey-patched `window.fetch` https://github.com/Pos...

ricardobeat · on Aug 20, 2024

I was poking around to understand how this was not caught in a test - any ordinary fetch call could have triggered the error, and besides how poor coverage it has for all the ways `fetch` can be used, it seems excessive mocking may have played a part: https://github.com/PostHog/posthog-js/blob/main/src/__tests_...

The whole fetch and XHR functions are mocked and become no-ops, so obviously this won't catch any issues when interacting with the underlying (native or otherwise) libraries. They have Cypress set up so I don't see why you'd want to mock the browser APIs.

sethammons · on Aug 20, 2024

I have seen so many mocked tests where you end up asserting the logic in the mock works; effectively testing 1=1.

The number of issues that can be prevented with an acceptance level test that has a user log in and do one simple interaction is amazing. Where I can convince the powers that be, PRs to main are gated by a build that runs, among others, that simple kind of AC test. If it was merged to main, you _know_ it will not totally break production.

We had regular outages with our internal emailing system at a small e-commerce shop. I stepped in and added one test that actually sent an email to a known sink that we could verify and had that test run pre-deploy. We went to zero email outages. Tests had the occasional flake that auto-retried. Also, if your acceptance tests are flaky, how do you know your software isn't? Bad excuse to avoid acceptance level testing

pbasista · on Aug 20, 2024

Thank you for pointing this out. I did not read the post in detail but I was wondering how could a monitoring library cause the entire application to go down. At worst, I thought, it should have failed to process the monitoring events, assuming that it was integrated in a reasonable way.

PostHog patching a very important global function is a feature that should be well-documented so that the people who are using it are aware of it and can be reasonably expected to have it in mind when debugging these seemingly unexplainable issues.

ataru · on Aug 20, 2024

I think it worked as defined, it hogged the POST requests?

x0x0 · on Aug 20, 2024

it's common with all these analytics toolkits. When you're monkeying with such a core api, I have no idea how you can actually test everything.

eg heap analytics still (as of this month) bad touches something inside hotwire and randomly entirely breaks hotwire causing every click to do a full page load. ime, it affects 30-60% of page loads. You can fix it (but thanks for 50+ hours of debugging) but making heap load after all hotwire js.

jjnoakes · on Aug 20, 2024

> I have no idea how you can actually test everything.

While a worthy goal (if unattainable sometimes), in this case, folks would settle for "test anything", which would have caught the regression.

x0x0 · on Aug 20, 2024

They've been hanging out with the CrowdStrike people.

"Test strategy: yolo"