Hacker Newsnew | past | comments | ask | show | jobs | submit | Matthias247's commentslogin

When I first read about it I assumed it would have been a "poison pill" - a bad config where the ingestion of the config leads the process to crash/restart. And due to that crash on startup, there is no automated possibility to revert to a good config. These things are the worst issues that all global control planes have to deal with.

The report actually seems to confirm this - it was indeed a crash on ingesting the bad config. However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".

The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.


As far as I remember from building these things with others within the async rust ecosystem (hey Eliza!) was that there was a certain tradeoff: if you wouldn’t be able to select on references, you couldn’t run into this issue. However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

I fully agree that this and the cancellation issues discussed before can lead to surprising issues even to seasoned Rust experts. But I’m not sure what really can be improved under the main operating model of async rust (every future can be dropped).

But compared to working with callbacks the amount of surprising things is still rather low :)


Indeed, you are correct (and hi Matthias!). After we got to the bottom of this deadlock, my coworkers and I had one of our characteristic "how could we have prevented this?" conversations, and reached the somewhat sad conclusion that actually, there was basically nothing we could easily blame for this. All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too. Not being able to do that would make it pretty hard to express a lot of things that many applications might reasonably want to express, as you described. I discussed this a bit in this comment[1] as well.

On the other hand, it also wasn't our coworker who had written the code where we found the bug who was to blame, either. It wasn't a case of sloppy programming; he had done everything correctly and put the pieces together the way you were supposed to. All the pieces worked as they were supposed to, and his code seemed to be using them correctly, but the interaction of these pieces resulted in a deadlock that it would have been very difficult for him to anticipate.

So, our conclusion was, wow, this just kind of sucks. Not an indictment of async Rust as a whole, but an unfortunate emergent behavior arising from an interaction of individually well-designed pieces. Just something you gotta watch out for, I guess. And that's pretty sad to have to admit.

[1] https://news.ycombinator.com/item?id=45776868


> All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too.

But it still suggests that `tokio::select` is too powerful. You don't need to get rid of `tokio::select`, you just need to consider creating a less powerful mechanism that doesn't risk exhibiting this problem. Then you could use that less powerful mechanism in the places where you don't need the full power of `tokio::select`, thereby reducing the possible places where this bug could arise. You don't need to get rid of the fully powerful mechanism, you just need to make it optional.


I feel like select!() is a good case study because the common future timeout use-case maps pretty closely to a select!(), so there is only so much room to weaken it.

The ways I can think of for making select!() safer all involve runtime checks and allocations (possibly this is just a failure of my imagination!). But if that's the case, I would find it bothersome if our basic async building blocks like select/timeout in practice turn out to require more expensive runtime checks or allocations to be safe.

We have a point in the async design space where we pay a complexity price, but in exchange we get really neat zero-cost futures. But I feel like we only get our money's worth if we can actually statically prove that correct use won't deadlock, without the expensive runtime checks! Otherwise, can we afford to spend all this complexity budget?

The implementation of select!() does feel way too powerful in a way (it's a whole mini scheduler that creates implicit future dependencies hidden from the rest of the executor, and then sometimes this deadlocks!). But the need is pretty foundational, it shows up everywhere as a building block.


It feels to me like there's plenty of design space to explore. Sure, it's possible to view "selection" as a basic building block, but even that is insufficiently precise IMO. There's a reason that Javascript provides all of Promise.any and Promise.all and Promise.allSettled and Promise.race. Selection isn't just a single building block, it's an entire family of building blocks with distinct semantics.


You must guarantee forward progress inside your critical sections and that means your critical sections are guaranteed to finish. How hard is that to understand? From my perspective this situation was basically guaranteed to happen.

There is no real difference between a deadlock caused by a single thread acquiring the same non reentrant lock twice and a single thread with two virtual threads where the the first thread calls the code of the second thread inside the critical section. They are the same type of deadlock caused by the same fundamental problem.

>Remember too that the Mutex could be buried beneath several layers of function calls in different modules or packages. It could require looking across many layers of the stack at once to be able to see the problem.

That is a fundamental property of mutexes. Whenever you have a critical section, you must be 100% aware of every single line of code inside that critical section.

>There’s no one abstraction, construct, or programming pattern we can point to here and say "never do this". Still, we can provide some guidelines.

The programming pattern you're looking for is guaranteeing forward progress inside critical sections. Only synchronous code is allowed to be executed inside a critical section. The critical section must be as small as possible. It must never be interrupted, ever.

Sounds like a pain in the ass, right? That's right, locks are a pain in the ass.


> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state. Yes, this is awkward, but it's the only logically coherent way. There is probably some macro voodoo that makes it ergonomic. But even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

For the real root cause: https://news.ycombinator.com/item?id=45777234


> No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state.

How does that prevent this kind of deadlock? If the owned future has acquired a mutex, and you return that future from the select so that it might be polled again, and the user assigns it to a variable, then the future that has acquired the mutex but has not completed is still not dropped. This is basically the same as polling an `&mut future`, but with more steps.


> How does that prevent this kind of deadlock?

Like I said, it doesn't:

> even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

It fixes this:

> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

If you want to fix the root cause, see https://news.ycombinator.com/item?id=45777234


Some other material that has been written by me on that topic:

- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1

- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)

- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token


Clarification question:

> When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits.

It only serializes commits where NOTIFY was issued as part of the transaction, right? Transactions which did not call NOTIFY should not be affected?


I think the only guessing situation in the game is the starting point (first click). after that, everything can be solved by deduction.


That's not the case. I have come to such situations. And I suppose the program can identify those as well and put the bomb where you didn't click.


I've played that one a lot (hundreds of games on Evil difficulty). So far there's no guessing required after the initial click. Everything can be deducted. Even though some deductions are really hard (you have to go through mine blocks from multiple starting points and multiple hops in order to determine the next safe block).


How does it work technically?

Does Whatsapp expose these messages via an API? If yes, then it seems like this is not only on Google.

If no: Are they reading data from raw UI widgets? Are they intercepting input controls? Are they intercepting network traffic? That seems unlikely, given its probably end to end encrypted and the decryption happens within the scope of the Whatsapp process.


> If no: Are they reading data from raw UI widgets? Are they intercepting input controls?

Why not... they control the OS, it'd be trivial to add hooks to the "draw widget" command to intercept that it's about to draw a text widget for WhatsApp, and then ask it to log the text.


My understanding (may be wrong):

WhatsApp data is encrypted, however, the keys are on the device itself and accessible on Android. There are many third-party apps that support transferring WhatsApp data from one phone to another, and some even claim so between Android and iOS devices. As I understand, the chats are in some usual database format. So anyone having access to the device can read the data even without WhatsApp being there itself (as far as the data is there).


I don't think it's quite as simple as that. The keys are stored in a storage area that Android locks off as WhatsApp's alone; no other app can get to those keys.

At the very least you'd need to root your device, but even that might not be quite enough going by my memory of trying to export my chats once. I remember the only documented working path included something like installing a shady, modified APK of a legacy WhatsApp version with an outdated encryption method to a second device and then somehow getting the new app to write a backup in the legacy format, to then restore to the fake second device and decrypt. I quit there because the risk of actually losing my entire backup seemed too high. And that was about five years ago, so I'd assume if anything, it's even more difficult today.


Maybe it uses Accessibility...

>When granted, an app with accessibility permission can:

  Read screen content (including text and buttons in other apps)
  Detect user interactions (like taps, swipes, or gestures)
  Navigate between apps and the system UI
  Monitor app launches and foreground/background changes
  Access and control other apps indirectly
  Perform gestures or clicks on behalf of the use


>Does Whatsapp expose these messages via an API?

Whatsapp has dark patterns that "guide" you to "archive" your chats on google drive.


No other app can get to that backup data though except the original one that made the backup. Not even the owner of the account is allowed access to it (which I'm almost sure is a GDPR violation)!

I'm not saying it's impossible that Google just grants their own app an (IMO indefensible) exception to this. But the potential shitstorm would be massive, so I assume they probably use some other way, such as screen recording or accessibility features.


I'm wondering too.

In general this seems to work only if there's a single offline client that accepts the writes.

With limitations to the data scheme (e.g. have distinct tables per client), it might work with multiple clients. However those would need to be documented and I couldn't see anything in this blog post.


Besides whats called out by others: It would be nice to have the usual action of "press both mouse buttons" on a number to open all non flagged fields around it. That reduces the amount of required clicks a bit.

But maybe its intentional to avoid one player scoring too many points at once?


That does work already. You don't even have to use both mouse buttons, just left click is enough.


If you want to stream data inside a HTTP body (of any protocol), then the ReadableStream/WritableStream APIs would be the appropriate APIs (https://developer.mozilla.org/en-US/docs/Web/API/Streams_API) - however at least in the past they have not been fully standardized and implemented by browsers. Not sure what the latest state is.

WebTransport is a bit different - it offers raw QUIC streams that are running concurrently with the requests/streams that carry the HTTP/3 requests on shared underlying HTTP/3 connections and it also offers a datagram API.


Duplex streams are not really a HTTP/2-only feature. You can do the same bidirectional streaming with HTTP/1.1 too. The flow is always: 1. The client sends a header set. 2. It can then start to stream data in the form of an unlimited-length byte-stream to the server. 3. The server starts to send a header set back to the client. 4. The server can then start to stream data in the form of an unlimited-length byte-stream to the client.

There is not even a fixed order between 2) and 3). The server can start sending headers or body data before the client sent any body byte.

What is correct is that a lot of servers and clients (including javascript in browsers) don't support this and make stricter assumptions regarding how HTTP requests are used - e.g. that the request bytes are fully sent before the response happens. I think ReadableStream/WritableStream APIs on browsers were supposed to change that, but I haven't followed the progress in the last few years.

NGINX falls into the same category. It's HTTP/2 support (and gRPC support) had been built with a very limited use-case in mind. That's also why various CDNs and service meshes use different kinds of HTTP proxies - so that various streaming workloads don't break in case way the protocol is used is not strictly request->response.


No browser I'm aware of is planning on allowing the request and response bodies to be streamed simultaneously for the same request using ReadableStream and WriteableStream. When using streaming request bodies, you have to set the request explicitly to half-duplex.

Anyways, yes, this is technically true, but the streaming semantics are not really that well-defined for HTTP/1.1, probably because it was simply never envisioned. The HTTP/1.1 request and response were viewed as unary entities and the fact that their contents were streamed was mostly an implementation detail. Most HTTP/1.1 software, not just browsers, ultimately treat the requests and responses of HTTP as different and distinct phases. For most uses of HTTP, this makes sense. e.g. for a form post, the entire request entity is going to need to be read before the status can possibly be known.

Even if we do allow bidirectional full-duplex streaming over HTTP/1.1, it will block an entire TCP connection for a given hostname, since HTTP/1.1 is not multiplexed. This is true even if the connection isn't particularly busy. Obviously, this is still an issue even with long-polling, but that's all the more reason why HTTP/2 is simply nicer.

NGINX may always be stuck in an old school HTTP/1 mindset, but modern software like Envoy shows a lot of promise for how architecting around HTTP/2 can work and bring advantages while remaining fully backwards compatible with HTTP/1 software.


> I think ReadableStream/WritableStream APIs on browsers were supposed to change that, but I haven't followed the progress in the last few years.

There has been a lot of pushback against supporting full-duplex streams[0].

[0]: https://github.com/whatwg/fetch/issues/1254


> a lot of servers and clients (including javascript in browsers) don't support this

To say nothing about the many http proxies in between.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: