I can read that "submission and deference" at the casino as conflict avoidance, the arresting officer says to his peers at the station that he "kind of believes" the suspect. He also states at some point that he can't cite (and I infer then release) the suspect because he is not certain who he is, and therefore has to arrest him as a "John Doe" so that his identity can be established. The fact (?) that the suspect now has a police record for this possible farce won't be settled until after the facts are determined in a court of law.
This video demonstrates that when it comes down to it the blunt end of law enforcement is oftentimes a shit show of "seems to work for me" and that goes for facial recognition, shot spotter, contraband dogs, drug & DNA tests, you name it.
Upvoted because I think it's an important topic, but this take causes me to question the motive for the article... which ironically is my big concern with using LLMs to write stuff generally (the unconscious censoring / proctoring of voice and viewpoint):
That means that if an officer is caught lying on the stand – as shown by a
contradiction between their courtroom testimony and their earlier police
report – they could point to the contradictory parts of their report and say,
“the AI wrote that.”
IANAL but if they signed off on it then presumably they own it. Same as if it was Microsoft Dog, an intern, whatever. If they said "the AI shat it" then I'd ask "what parts did you find unacceptable and edit?" and then expect we'd get the juicy stuff hallucinations or "I don't recall". Did they write this, or are they testifying to the veracity of hearsay?
From what I've seen reports written by / for lawyers / jurists / judges already "pull" to a voice and viewpoint; I'll leave it there.
I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.
"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.
Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".
In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.
A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).
Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.
I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).
How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.
Auditing is fundamentally different because it has different durability and consistency requirements. I can buffer my logs, but I might need to transact my audit.
For most cases, buffering audit logs on local storage is fine. What matters is that the data is available and durable somewhere in the path, not that it be transactionally durable at the final endpoint.
What are we defining as “audit” here? My experience is with regulatory requirements, and “durable” on local storage isn’t enough.
In practice, the audit isn’t really a log, it’s something more akin to database record. The point is that you can’t filter your log stream for audit requirements.
Take Linux kernel audit logs as an example. So long as they can be persisted to local storage successfully, they are considered durable. That’s been the case since the audit subsystem was first created. In fact, you can configure the kernel to panic as soon as records can no longer be recorded.
Regulators have never dictated where auditable logs must live. Their requirement is that the records in scope are accurate (which implies tamper proof) and that they are accessible. Provided those requirements are met, where the records can be found is irrelevant. It thus follows that if all logs over the union of centralized storage and endpoint storage meet the above requirements then it will satisfy the regulator.
> Regulators have never dictated where auditable logs must live.
That’s true. They specify that logs cannot be lost, available for x years, not modifiable, accessible only in y ways, cannot cross various boundaries/borders (depending on gov in question). Or bad things will happen to you (your company).
In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”. You need to know that the record has been made durable (across whatever your durability mechanism is, for example a DB with HA + DR), before progressing to the next step. Depending on the stringency, RPO needs to be zero for audit, which is why I say it is a special case.
I don’t know anything about linux audit, I doubt it has any relevance to regulatory compliance.
> In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”
As long as the record can be located when it is sought, it does not matter how many copies there are. The regulator will not ask so long as your system is a reasonable one.
Consider that technologies like RAID did not exist once upon a time, and backup copies were latent and expensive. Yet we still considered the storage (which was often just a hard copy on paper) to be sufficient to meet the applicable regulations. If a fire then happened and burned the place down, and all the records were lost, the business would not be sanctioned so long as they took reasonable precautions.
Here, I’m not suggesting that
“the record is on a single disk, that ought to be enough.” I am assuming that in the ordinary course of business, there is a working path to getting additional redundant copies made, but those additional copies are temporarily delayed due to overload. No reasonable regulator is going to tell you this is unacceptable.
> Depending on the stringency, RPO needs to be zero for audit
And it is! The record is either in local storage or in central storage.
> And it is! The record is either in local storage or in central storage.
But it isn’t! Because there are many hardware failure modes that mean that you aren’t getting your log back.
For the same reason that you need acks=all in Kafka for zero data loss, or synchronous_commit = remote_flush in PostgreSQL, you need to commit your audit log to more than the local disk!
If your hardware and software can’t guarantee that writes are committed when they say they are, all bets are off. I am assuming a scenario in which your hardware and/or cloud provider doesn’t lie to you.
In the world you describe, you don’t have any durability when the network is impaired. As a purchaser I would not accept such an outcome.
> In the world you describe, you don’t have any durability when the network is impaired.
Yes, the real world. If you want durability, a single physical machine is never enough.
This is standard distributed computing, and we’ve had all (most) of the literature and understanding of this since the 70’s. It’s complicated, and painful to get right, which is why people normally default to a DB (or cloud managed service).
The reason this matters for this logging scenario is that I normally don’t care if I lose a bit of logging in a catastrophic failure case. It’s not ideal, but I’m trading RPO for performance. However, when regs say “thou shalt not lose thy data”, I move the other way. Which is why the streams are separate. It does impose an architectural design constraint because audit can’t be treated as a subset of logs.
> If you want durability, a single physical machine is never enough.
It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.
Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.
> when regs say “thou shalt not lose thy data”, I move the other way. Which is why the streams are separate. It does impose an architectural design constraint because audit can’t be treated as a subset of logs.
There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies. Regardless of how you manage them, it doesn't change their fundamental nature. Don't confuse the nature of logs with the level of durability you want to achieve with them. They're orthogonal matters.
> It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.
Certainly you can solve for zero data loss (RPO=0) at the infrastructure level. It involves synchronously replicating that data to a separate physical location. If your threat model includes “fire in the dc”, reliable storage isn’t enough. To survive a site catastrophe with no data loss you must maintain a second, live copy (synchronous replication before ack) in another fault domain.
In practice, to my experience, this is done at the application level rather than trying to do so with infrastructure.
> There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies
It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.
> Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.
I care about matching the solution to the regulation; which varies considerably by country and use-case. However there are multiple cases I have been involved with where the stipulation was “you must prove you cannot lose this data, even in the case of a site-wide catastrophe”. That’s what RPO zero means. It’s DR, i.e., after a disaster. For nearly everything 15 minutes is good, if not great. Not always.
> It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.
If you want synchronous replication across fault domains for a specific subset of logs, that’s your choice. My point is that treating them this way doesn’t make them not logs. They’re still logs.
I feel like we’re largely in violent agreement, other than whether you actually need to do this. I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details. As far as I know, no one - no one - has ever been held to account by a regulator for losing covered data due to a catastrophe outside their control, as long as they took reasonable measures to maintain compliance. RPO=0, in my experience, has never been a requirement with strict liability regardless of disaster scenario.
> I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details.
I can’t go into details about current cases with my current employer, unfortunately. Ultimately, the requirements go through legal and are subject to back and forth with representatives of the government(s) in question. As I said, the problem isn’t passing an audit, it’s getting the initial approval to implement the solution by demonstrating how the requirement will be satisfied. Also, cloud companies are in the same boat, and aren’t certified for use as a result.
This is the extreme end of when you need to be able to say “x definitely happened” or “y definitely didn’t happen” It’s still a “log” from the applications perspective, but really more of a transactional record that has legal weight. And because you can’t lose it, you can’t send it out the “logging” pipe (which for performance is going to sit in a memory buffer for a bit, a local disk buffer for longer, and then get replicated somewhere central), you send it out a transactional pipe and wait for the ack.
Having a gov tell us “this audit log must survive a dc fire” is a bit unusual, but dealing with the general requirement “we need this data to survive a dc fire”, is just another Tuesday. An audit log is nothing special if you are thinking of it as “data”.
You’re a reliability engineer, have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?
> have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?
I have been asked this, yes. But when I tell them what the cost would be to implement synchronous replication in terms of resources, performance, and availability, they usually change their minds and decide not to go that route.
Some kind of ”Error” is of course one of the sane message types. ”Warning” and ”info” might be as well.
”Verbose”, ”debug”, ”trace” and ”silly” are definitely not, as those describe a different thing altogether, and would probably be better instrumented through something like the npm ”debug” package.
Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.
If you have insufficient ingestion rate:
Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.
Traces are for correlated sequences of events where the entire sequence needs to be retained to be useful/coherent. You can drop arbitrary whole sequences to stay within ingestion rate.
Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.
If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.
I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery once the backpressure has resolved itself, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.
> You can drop arbitrary logs to stay within ingestion rate.
Another way I've heard this framed in a production environments ingesting a firehose is: you can drop individual logging events because there will always be more.
It depends. Some cases like auditing require full fidelity. Others don’t.
Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.
The right way to think about logs, IMO, is less like diagnostic information and more like business records. If you change the framing of the problem, you might solve it in different way.
I'm sorry for your loss, but this sounds like an antipattern. Hundreds of emails between co-workers and it was all contemporaneously related to work in progress or cat pictures of your own cats, didn't contain PII or proprietary information of your employer or unaware third parties? And you want it back? From far enough away (that I might as well be in orbit) this seems preferable to an unencrypted drive ending up in somebody's hands for "refurbishment" (cough printers with hard drives).
No one is innocent. I refuse to use LE and operate my own CA instead, and as a consequence of scareware browser warnings I publish http: links instead of https: (if anyone cares, you know to add the "s" don't you?). I run my own mailserver which opportunistically encrypts, and at least when it gets to me it's on hardware which I own and somebody needs a search warrant to access.. as opposed to y'all and your gmail accounts. I do have a PGP key, but I don't include it on the first email with every new correspondent because too many times it's been flagged as a "virus" or "malicious".
Clearly we live in a world only barely removed from crystals and ouija boards.
> Hundreds of emails between co-workers and it was all contemporaneously related to work in progress or cat pictures of your own cats, didn't contain PII or proprietary information of your employer or unaware third parties?
You're merely defining away the problem. You have no idea what was in those emails.
Who knew I’d need to do this? I’d never needed to do this either my emails in the decades prior.
You’ve also got no idea what was in those emails. Could be some valuable knowledge or logs about some crazy rare bug or scenario, and would be useful to review today.
We just turned on S/MIME by default, to “be secure”, whatever that means. There was no warning in the email client about losing access to the email if you lost your keys.
Citing BOFH is all well and good inside certain circles. In the real world, people don’t like spending time or effort on poorly thought out and implemented solutions.
IOW: who owns the backups owns the data... until proven otherwise. My default presumption from space is that 1) there are document management policies and 2) document management policies apply.
Ummm... Google, Amazon, eBay, and PayPal... Facebook, Airbnb, Uber, and the offspring of Y Combinator... doesn't look like a particularly virtuous trajectory to me.
One of the more interesting ways of detecting rail damage, and subsidence in general, is optically detecting noise / distortion in fiber optic cables. An applied case of observables which are the basis for an evaluative (the "signal") being utilized originally to diagnose possible maintenance issues and then going "hey there, wait a sec, there's a different evaluative we can produce from this exhaust and sell".
Location: Tacoma, WA, USA
Remote: Sure, that's an option.
Willing to relocate: No, but willing to drive around the greater Pacific Northwest.
Technologies: Python, Redis, C, DNS, Ignition SCADA / HMI, SQL, Bash, Linux, TCP/UDP,
old school ML & numerical methods, threat hunting, electronics, handy
with a wrench, dangerous with acetylene.
Résumé/CV: https://github.com/m3047 https://www.linkedin.com/in/fred-morris-03b6952/
Email: m3047-wantstobehired-t5b@m3047.net
Let's have a conversation, I like working with people. Not looking for the same old thing, have had a business license since 1984; prior to 2000 about a third of my work was firm bid (why did that go out of fashion?). Technologist and problem solver. Cloud skeptic, but I use what works. I prefer contracts to promises. W2 / contract: it depends, let's choose the correct vehicle for the scenario, and risk has a lot to do with it. Part time, seasonal, campaign... it's all good.
It's been over 20 years, but I used to build batch processing pipelines using SMTP (not Outlook). Biggest "choose your own adventure" aspect is accounting / auditing (making sure you didn't lose batches). I still joke about it as an option; although if somebody took me up on it I'd write up at least a semi-serious proposal.
In the middle are Mule, Rabbit, Kafka, ZMQ.
At the other end is UDP multicast, and I still use that for VM hosts or where I can trust the switch.
This video demonstrates that when it comes down to it the blunt end of law enforcement is oftentimes a shit show of "seems to work for me" and that goes for facial recognition, shot spotter, contraband dogs, drug & DNA tests, you name it.
reply