Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks. Particularly on games, where you can easily have event A that triggers B unless C is in X state...

What I want to say is that I've seen what happens in a team with a history of quick fixes and inadequate architecture design to support the complex features. In that case, a proper bugfix could create significant rework and QA.



> Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks

The joys of enterprise software. When searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do what they do how they do. Because they've gone away many years ago.


> searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do [..]

And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?


If you follow the prescribed procedure and involve all required management, it stops being a beginner's mistake; and given reasonable rollback provisions it stops being a mistake at all because if nobody knows what the thing is it cannot be very important, and a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed.


> a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed

Cost efficient for your team’s budget sure, but a 1% chance of a 10+ million dollar issue is worth significant effort. That’s the thing with enterprise systems the scale of minor blips can justify quite a bit. If 1 person operating for 3 months could figure out what something is doing there’s scales where that’s a perfectly reasonable thing to do.

Enterprise covers a while range of situations there’s a lot more billion dollar orgs than trillion dollar orgs so your mileage may very.


If there is a risk of a 10+ million dollar issue there is also some manager whose job is to overreact when they hear the announcement that someone wants to eliminate thing X, because they know that thing X is a useful part of the systems they are responsible for.

In a reasonable organization only very minor systems can be undocumented enough to fall through the cracks.


In an ideal world sure, but knowledge gets lost every time someone randomly quits, dies, retires etc.

Stuff that’s been working fine for years is easy for a team to forget about, especially when it’s a hidden dependency in some script that’s going to make some process quietly fail.


The OP explicitly said "if you involve all required management", and that is key here. Having a process that is responsible for X million dollar of revenue yet is owned by no manager is a liability for the business (as is having an asset in operation that serves no purpose). Identifying that situation in a controlled manner is much better than letting it linger until it surfaces at a moment of Murphy's choosing.

> Stuff that’s been working fine for years is easy for a team to forget about

That's why serious companies have a documentation system describing their processes, tools and dependencies.


The basic premise was it’s no longer obvious if a system is still doing anything useful. If the system had easy to locate documentation saying everything that used it then there wouldn’t be an issue, but that’s very difficult to maintain.

Documentation on every possible system that could use the resource would need to be accurate, complete, have someone locate and actually read it, remember, and communicate it with someone in a relevant meeting which may be taking place multiple levels of management above the reader here. As part of that chain when a new manager shows up and there’s endless seemingly minor details, so even if they actually did encounter that information at some point theirs nothing that particularly calls out as worth remembering at the time.

That’s a lot of individual points of failure which is why I’m saying in the real world even well run companies mess this stuff up.


Well, maybe. See Chesterson's Fence^1

[1] https://theknowledge.io/chestertons-fence-explained/


I have had several things over the course of my career that:

1) I was (temporarily) the only one still at the company who knew why it was there

2) I only knew myself because I had reverse engineered it, because the person who put it there had left the company

Now, some of those things had indeed become unnecessary over time (and thus were removed). Some of them, however, have been important (and thus were documented). In aggregate, it's been well worth the effort to do that reverse engineering to classify things properly.


I've fixed more than enough bugs by just removing the code and doing it the right way.

Of course you can get lost on the way but worst case is you learn the architecture.


If it’s done in a controlled manner with the ability to revert quickly, you’ve just instituted a “scream test[0].”

____

[0] https://open.substack.com/pub/lunduke/p/the-scream-test

(Obviously not the first description of the technique as you’ll read, but I like it as a clear example of how it works)


that's a management/cultural problem. if no one knows why it's there, the right answer is to remove it and see what breaks. If you're too afraid to do anything, for nebulous cultural reasons, you're paralyzed by fear and no one's operating with any efficiency. It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.


This does depend on a certain level of testing (automated or otherwise) for you to even be able to identify what breaks in the first place. The effect might be indirect several times over and you don't see what has changed until it lands in front of a customer and they notice it right away.

Move fast and break things is also a managerial/cultural problem in certain contexts.


> It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.

You can only say with a straight face that if you're not the one responsible to clean up after Musk or whatever CTO sharted across the chess board.

C-levels love the "shut it down and wait until someone cries up" method because it gives easy results on some arbitrary KPI metric without exposing them to the actual fallout. In the worst case the loss is catastrophic, requiring weeks worth of ad-hoc emergency mode cleanup across multiple teams - say, some thing in finance depends on that server doing a report at the end of the year and the C-level exec's decision was made in January... but by that time, if you're in real bad luck, the physical hardware got sold off and the backup retention has expired. But when someone tries to blame the C-level exec, said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained".


It can also be dangerous to be the person who blames execs. Other execs might see you as a snake who doesn't play the game, and start treating you as a problem child who needs to go, your actual contributions to the business be damned. Even if you have the clout to piss off powerful people, you can make an enemy for life there, who will be waiting for an opportunity to blame you for something, or use their influence to deny raises and resources to your team.

Also with enterprise software a simple bug can do massive damage to clients and endanger large contracts. That's often a good reason to follow the Chesterton's fence rule.


C-levels love the "shut it down and wait until someone cries up" method because it gives easy results on some arbitrary KPI metric without exposing them to the actual fallout

It's not in the C-level's job description to manage the daily operations of the company, they have business managers to do that. If there's an expensive asset in the company that's not (actively) owned by any business manager, that's a liability -- and it is in the C-level's job description to manage liabilities.

said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained"

And that's a perfectly valid defense, they're acting true to their role. The failure lies with the business/operations manager not being in control of their process tooling.


The next mistake is thinking that completely re-writing the system will clean out the cruft.


plus report servers and others that run on obsolete versions of Windows/unix/IBM OS plus obsolete software versions.

and you just look at this and thinks: one day, all of this is going to crash and it will never, ever boot again.


I still have nightmares of load bearing Perl scripts and comlink interops, and then of course our dear friend the GAC


And then it turns out the bug is actually very intentional behavior.


In that case, maybe having bug fixing be a two-step process (identify, then fix), might be sensible.


I do this frequently. But sometimes identifying and/or fixing takes more than 2 days.

But you hit on a point that seems to come up a lot. When a user story takes longer than the alloted points, I encourage my junior engineers to split it into two bugs. Exactly like what you say... One bug (or issue or story) describing what you did to typify the problem and another with a suggestion for what to do to fix it.

There doesn't seem to be a lot of industry best practice about how to manage this, so we just do whatever seems best to communicate to other teams (and to ourselves later in time after we've forgotten about the bug) what happened and why.

Bug fix times are probably a pareto distribution. The overwhelming majority will be identifiable within a fixed time box, but not all. So in addition to saying "no bug should take more than 2 days" I would add "if the bug takes more than 2 days, you really need to tell someone, something's going on." And one of the things I work VERY HARD to create is a sense of psychological safety so devs know they're not going to lose their bonus if they randomly picked a bug that was much more wicked than anyone thought.


I like to do this as a two-step triage because one aspect is the impact seen by the user and how many it reaches, but the other is how much effort it would take to fix and how risky that is.

Knowing all of those aspects and where an issue lands makes it possible to prioritise it properly, but it also gives the developer the opportunity hone their investigation and debugging skills without the pressure to solve it at the same time. A good write-up is great for knowledge sharing.


You sound like a great team leader.

Wish there were more like you, out there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: