Hacker Newsnew | past | comments | ask | show | jobs | submit | mrmattyboy's commentslogin

I agree this doens't seem too ambiguous - it's "you may do this.." and they said "or we may do the reverse". If I say you're could prefix something.. the alternative isn't that you can suffix it.

But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world:

1. Changes logic in how CNAME responses are formed

2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response")

3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release".

4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us".

There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks.

I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be


> 4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

"Testing environment" sounds to me like a real network real user devices are used with (like the network used inside CloudFlare offices). That's what I would do if I was developing a DNS server anyway, other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case) and maybe integration/end-to-end tests, which might be running in Alpine Linux containers and as such using musl. If that's indeed the case, I can easily imagine how noone noticed anything was broken. First look at this line:

> Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set:

Now think about what real end user devices are using: Windows/macOS/iOS obviously aren't using glibc and Android also has its own C library even though it's Linux-based, and they all probably fall under the "Most DNS clients don't have this issue.".

That leaves GNU/Linux, where we could reasonably expect most software to use glibc for resolving queries, so presumably anyone using Linux on their laptop would catch this right? Except most distributions started using systemd-resolved (most notable exception is Debian, but not many people use that on desktops/laptops), which is a locally-cached recursive DNS server, and as such acts as a middleman between glibc software and the network configured DNS server, so it would resolve 1.1.1.1 queries correctly, and then return the results from its cache ordered by its own ordering algorithm.


For the output of Cloudflare’s DNS server, which serves a huge chunk of the Internet, they absolutely should have a comprehensive byte-by-byte test suite, especially for one of the most common query/result patterns.

> other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case)

They absolutely should have unit tests that detect any change in output and manually review those changes for an operation of this size.


> Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

This is the part that is shocking to me. How is getaddrinfo not called in any unit or system tests?


As black3r mentioned (https://news.ycombinator.com/item?id=46686096), it is likely rearranged by systemd, therefore only non-systemd glibc distributions are affected.

I would hazard a guess that their test environment have both the systemd variant and the Unbound variants (Unbound technically does not arrange them, but instead reconstructs it according to RFC "CNAME restart" logic because it is a recursive resolver in itself), but not just plain directly-piped resolv.conf (Presumably because who would run that in this day and age. This is sadly just a half-joke, because only a few people would fall on this category.)


> it is likely rearranged by systemd, therefore only non-systemd glibc distributions are affected.

systemd doesn't imply installed and running systemd-resolved though. I believe it's usually not enabled by default.


> I believe it's usually not enabled by default.

Just verify modern OSes now, they definitely do mediate via systemd-resolver (including in server OSes).


Probably Alpine containers, so musl's version instead of glibc's.

I was even more surprised to see that the RFC draft had original text from the author dating back to 2015. https://github.com/ableyjoe/draft-jabley-dnsop-ordered-answe...

We used to say at work that the best way to get promoted was to be the programmer that introduced the bug into production and then fix it. Crazy if true here...


What you're suggesting seems like a spectacular leap. I do not think it is very likely that the unnamed employee at Cloudflare that was micro-optimising code in the DNS resolver is also the author of this RFC, Joe Abley (the current Director of Engineering at the company, and formerly Director of DNS Operations at ICANN).

> I assume some tests at least broke that meant they needed to be "fixed up"

OP said:

"However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC."

One could guess it's something like -- back when we wrote the tests, years ago, whoever did it missed that this was required, not helped by the fact that the spec proceeded RFC 2119 standardizing the all-caps "MUST" "SHOULD" etc language, which would have helped us translsate specs to tests more completely.


You'd think that something this widely used would have golden tests that detect any output change to trigger manual review but apparently they don't.

Oh, they explain, if I understand right, they did the output change intentionally, for performance reasons. Based on the inaccurate assumption that order did not matter in DNS responses -- becuase there are OTHER aspects of DNS responses in which, by spec, order does not matter, and because there were no tests saying order mattered for this component.

> "The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS." [from RFC]

> However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets.

The developer(s) was assuming order didn't matter in general, cause the RFC said it didn't for one aspect, and intentionally made a change to order for performance reasons. But it turned out that change did matter.

Mistakes of this kind seem unavoidable, this one doesn't necessary say to me the developers made a mistake i never could or something.

I think the real conclusion is they probably need tests using actual live network stacks with common components, and why didn't they have those? Not just unit tests or with mocks, but tests that would have actually used real getaddrinfo function in glibc and shown it failing?


Even if there weren't tests for the return order, I would have bet that there were tests of backbone resolvers like getaddrinfo. Is it really possible that the first time anyone noticed that that crashed, or that ciscos bootlooped, was on a live query?

Yes, at least they should test the glibc case.

It's interesting seeing parts of life overlap.

I did music production at the same time as heavily using SVN and starting to use Git - I didn't cross this over at the time. All (in my case) Cubebase files were just -1, -2 suffixes and it worked. I had continuous backups, sure and it just kinda worked at the time.

Given I now use Git heavily in my work/hobby life, when doing other projects (3D models for printing (questionable at best) and artwork (very very very questionable at best)) I definitely wanted to use some sort of SCM. I opted for these for Perforce - mostly to experiment, but also the idea of having binaries in a distributed SCM. Yes, I know Git-LFS _exists_, but also, to me it breaks the idea of what Git is.. relying on a server for binaries in a situations where everything should be distributed.

If I now went back to audio-production, I would probably consider either Perforce or SVN. Perforce only if it were for a single user (because of licensing). The ability to clone/checkout a single directory of a repo at a given point in time natively and make modifications and push them back is almost quite necessary when dealing with very large files.

And I still use SVN for _some_ situations - particularly those where Perforce is overkill and all I want to _always_ HEAD and the rest is history (for manual preservation history) and no such need for merging and branching (thinking Wiki and other plain-text tooling).

In the case of any sort of any binary-merging - I _heavily_ assume this isn't expected in the poster's situation!


I originally wrote the speach in my blog repo, just for writing purposes.

My dad's funeral was yesterday and wondered, maybe, someone might appreciate it - either because they've lost their dad or it makes them appreciate their dad a little more.


I remember reading this and having a mini-midlife-crisis after every read

I documented it this time :sigh: https://github.com/MatthewJohn/terrareg/commit/2231ba733a7f5...


> effectively turning the developer's most trusted assistant into an unwitting accomplice

"Most trusted assistant" - that made me chuckle. The assistant that hallucinates packages, avoides null-pointer checks and forgets details that I've asked it.. yes, my most trusted assistant :D :D


My favorite is when it hallucinates documentation and api endpoints.


Well, "trusted" in the strict CompSec sense: "a trusted system is one whose failure would break a security policy (if a policy exists that the system is trusted to enforce)".


Well my most trusted assistant would be the kernel by that definition


I don't even trust myself, why would anyone trust a tool? This is important because not trusting myself means I will set up loads of static tools - including security scanners, which Microsoft and Github are also actively promoting people use - that should also scan AI generated code for vulnerabilities.

These tools should definitely flag up the non-explicit use of hidden characters, amongst other things.


I wonder which understands the effect of null-pointer checks in a compiled C program better: the state-of-the-art generative model or the median C programmer.


Given that the generative model was trained on the knowledge of the median C programmer (aka The Internet), probably the programmer as most of them do not tend to hallucinate or make up facts.


This kind of nonsense prose has "AI" written all over it. In either case, be it if your writing was AI generated/edited or if you put so little thought into it, it reads as such, doesn't show give its author any favor.


Are you talking about my comment or the article? :eyes:


Sure, you're right in most cases. In the use-case I had, it's a private registry with "immutable" tags (at least enough to stop accidental overwrites - and it is a homelab, so if someone else did it, I'd have worse problems ;))

The point was more about using null_triggers (or `terraform_data` I see) and using the trigger replacement, with the docker resources as purely an illustration.


Absolutely, needs-and-musts, it's certainly not a nice thing.. but again, Terraform isn't a scripting language, so sometimes bits of hack are needed!


Good point - I hadn't actually looked massively hard into solving it with this provider - I had to do it again for another use-case recently and decided to blog about it (and also try my hand at a short post).. but used this example from a while ago because it seemed much more relatable than the latest encounter :D

I guess, assuming you're not building the image, whether you use the data source of image probably isn't too important (assuming the data source is able to lookup images that aren't present on the local machine :thinking:).

Edit: and now I've seen that in the docker image resource, they reference using the data source to be able to track remote image SHA changes, in order to trigger an image re-pull :doh:

Feels like we've gone full-circle with this :D


Great find and post.

I've run into this exact thing. Luckily rebuilding a container doesn't cause downtime for us and 99% of our changes require rebuilding an image, so I've just left it as is...

It is annoying though when we make a small infra change and have to wait for the container image to build...


Similarly, older versions (<3.0) of the provider had a `build` attribute for the `docker_registry_image` resource, which made it possible to build and publish an image to a registry, without causing unnecessary rebuilds if there was no local version of the image on the build host.

Now you have to use the `docker_image` resource to build a local image on the build host, and then use the `docker_registry_image` resource to publish it to the registry. In a CI/CD scenario with ephemeral runners, there will never be a local version of the image on the build host, so the image will always be rebuilt on every Terraform run, even if there are no changes to it.

It's a tricky problem to solve from a provider design standpoint, since building a Docker image necessarily creates a local Docker image on the build host, which may not be a desirable side effect for the `docker_registry_image` resource to have and raises other design questions with no universal answers (Should it delete the local image after building? What if there's already a local image with the same name/tag, but it's not in the Terraform state; should it use the existing one or build a new one and overwrite the existing one? If the `docker_registry_image` resource is removed, should any corresponding local images also be delete? etc.)


I would say yes and no (leaning on the no)...

I think saying you don't have a right is fine... they are providing a service and dictating it's usage and you are using it.

So on the "closing your eyes". On one side, yes, allowing your browser to play the video and YT then being able to treat as a advert view means that youtube gets paid and the creator gets paid.

However... I would personally view this as can a person do this and how it works as a generalisation and I would say "no", because if everyone did this (why does just one person have the right to close their eyes), then (at least I'd imagine) the companies paying for advertising would see a drop in click-throughs and (I don't know what you call it.. but let's just say) more money. They'd then stop paying for adverts. Then no companies would want to pay for adverts and YT is no longer profitable (to YT or the creators).


Even entertaining the idea is extremely disturbing and dystopian. Having control over what we watch and what we listen to should be basic human rights. And those are inalienable, meaning we can't sign away those rights, not in a contract, not in any terms of service.

People who accept that as something a company should be allowed to do are a massive problem. Because of you, they might actually do it. It will start by making sure you cannot mute the sound in any way, designing hardware in a way to enforce that - devices will start overriding the use of external speakers and play ads from internal ones to make absolutely sure you haven't muted it. Next they will force always-on cameras on us which will make sure our eyes are open and looking at the ad. Next we will have brain implants to make sure you're actually paying attention and not thinking about something else.

I find it extremely disturbing that you don't feel disgusted about even thinking of "yes".


This is like saying you have the rights to mute your work meetings. Sure you do, but you just won't be employed anymore. I don't see that as a problem, because being employed and watching YouTube are not essential services nor human rights.


Then the correct solution would be to allow everyone just pay $1 per month for watching Youtube without any ads. Youtube will be funded and users will see no ads. But I suspect that even in that case Youtube will still want to show ads - even though the users would be paying for NOT seeing ads ...


Why is this correct? Youtube is a private company. It's very much allowed to charge $200 per month while playing tons of ads for you. Youtube sets the price and its policy, not the users.


I see what you mean.. but if you take a look at vista vs 7..

Microsoft shoved glass panels, widgets and such down the user's throat in Vista. It was a new look and they wanted to make you realise it. Without spinning a fresh 7 machine now, I'm certain it was very toned down.

But, I could be very wrong about this :D Last time I used Windows was XP (I mean, granted last week) because nostalgia is a real thing :D

Edit: I can't reply (not sure why, thread too deep?) but @cogman10, you're right! My memory is bad :(


> Without spinning a fresh 7 machine now, I'm certain it was very toned down.

I think they were roughly the same

Vista: https://img.sysnettechsolutions.com/What-is-Windows-Vista-Ne...

Windows 7: https://blog.thinprint.com/uploads/tp/sites/6/2019/08/319410...


I never tried with Vista but I customized the appearance of Windows 7 on my old computer and it looked nothing like this. I remember lots of people hated the frosted glass look when it first released, and personally I never got used to it. The OS peaked aesthetically with Windows 2000 and has been on a steady decline since then.


>Without spinning a fresh 7 machine now, I'm certain it was very toned down

Your memories are not serving you well. Windows 7 was very glass and very aero when it was first released.

You can disable the transparency and fake glossiness, but then it's just a pale blue glossy glass colour instead of transparent.


  1) Install any Windows version from the past decade in any machine
  2) Go to 'performance' and remove all visuals
  3) download and install ClassicShell to have a decent Start button
  4) download and use (most are portable) any Privacy settings tool
  5) find, download and install WindowsFirewallControl v4.9.x.x and use it on MediumFiltering with "Display Notifications" (you get the 'ZoneAlarm experience')
  6) Uninstall all the crapware and disable many services (I use SysInternals Autoruns64)(Winternals for the older ones)
  7) Happy Days!!


Wow, that's a lot more work than my Fedora setup.

Windows is only free if you don't value your time, huh?


Microsoft put bread on my table. I started working in a company that was operating 99% of their servers with MS OS. I don't see it that "I am wasting time", I see it that there is this tech called "Windows" and the more I know about Registry, DLLs, etc. is making me better at my job.

If Fedora setup puts food on your table, go for it!! I 'bet' on Microsoft for my career and it has taken me around the world multiple times, so yeah :)

Tbh I don't change machines 'that' often. My laptop is from 2015. My desktop is 3? 4? years old, and I won't be changing either in the next 5-6 years, they do 'ok' for home use.

But every now and then (couple of years) I do buy some second-hand cheap Surface, or HP Elite, or similar Win tablets, I set them up, and keep them on the side (OS, and Firefox only) just in case someone will need an urgent laptop, or I travel somewhere and I want a 'burner' machine with zero data/sw in it.


> Microsoft shoved glass panels, widgets and such down the user's throat

Please listen to yourself. It's just a style, they didn't kill your dog. The UI was fine, beautiful even, especially coming from Lego XP — which if you think about it was really tacky, literally an RGB palette.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: