More

sujayakar · 2025-06-12T18:35:52 1749753352

switch to auto mode and it should still work!

ashu1461 · 2025-06-12T18:39:55 1749753595

GPT is working in agent mode, which kind of confirms that claude is hosted on google and GPT probably on MSFT servers / self hosted.

kenhwang · 2025-06-12T20:29:43 1749760183

If you want a stronger confirmation about Claude being hosted on GCP, this is about as authoritative as it gets: https://www.anthropic.com/news/anthropic-partners-with-googl...

mkl · 2025-06-13T01:26:39 1749777999

That's nearly 2.5 years old, an eternity in this space. It may still be true, but that article is not good evidence.

scottmf · 2025-06-12T20:49:13 1749761353

Claude runs on AWS afaik. And OAI on Azure. Edit: oh okay maybe GCP too then. I’m personally having no problem using Claude Code though.

sujayakar · 2025-04-22T15:00:22 1745334022

Here's an interesting negative result.

After watching this video, my first thought was whether recent results from columnar compression (e.g. https://docs.vortex.dev/references#id1) applied "naively" like QOI would have good results.

I started with a 1.79MiB sprite file for a 2D game I've been hacking on, and here are the results:

  PNG: 1.79 MiB
  QOI: 2.18 MiB
  BtrBlocks: 3.69 MiB

(Source: https://gist.github.com/sujayakar/aab7b4e9df01f365868ec7ca60...)

So, there's magic to being Quite OK that is more than just applying compression techniques than elsewhere :)

sujayakar · 2025-03-27T21:43:43 1743111823

Deepseek is already using SSDs for their KV cache: https://github.com/deepseek-ai/3FS

vlovich123 · 2025-03-27T22:20:37 1743114037

You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)

boroboro4 · 2025-03-27T22:33:38 1743114818

They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.

magicalhippo · 2025-03-28T02:47:08 1743130028

https://github.com/deepseek-ai/3FS?tab=readme-ov-file#3-kvca...

KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s

vlovich123 · 2025-03-28T04:18:27 1743135507

That's because DeepSeek uses MLA which apparently does allow offloading the KV cache. That doesn't apply to all models, particularly the open-weight models that are primarily GQA AFAIK.

boroboro4 · 2025-03-28T04:52:24 1743137544

Any models allow offloading KV cache, it’s not a matter of model architecture but only of the implementation. The only somewhat difference can be for non transformer models. But for all attention models it’s the same – blob of data per each token. It’s much worse for older models with MHA because their KV cache is just too big, and it’s better for DeepSeek because their KV cache is the smallest. But it’s alright for current generation of GQA models as well.

vlovich123 · 2025-03-28T07:28:02 1743146882

Are you sure about that? GQA applies self-attention to every KV cache entry. If you're offloading, then you're having to dynamically page in all the KV cache entries into the GPU which is quite slow since the CPU/GPU link only has so much bandwidth. My understanding is that MLA reduces the size of the KV cache & doesn't necessarily attend to every KV token at every step which is why offloading to disk works (i.e. most of the tokens can remain on disk without ever being loaded into the GPU).

boroboro4 · 2025-03-28T12:09:24 1743163764

Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.

As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.

vlovich123 · 2025-03-28T15:42:41 1743176561

Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.

sujayakar · 2025-03-06T19:32:14 1741289534

I really love this space: Navarro's book is an excellent survey.

Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/

sujayakar · 2025-01-12T18:17:18 1736705838

that's roughly what the wasm component model is aiming for!

https://hacks.mozilla.org/2019/11/announcing-the-bytecode-al...

sujayakar · 2025-01-07T19:03:14 1736276594

can you specify the algorithm in more detail?

this looks to be solving a different problem than A*, which operates over discrete graphs. this looks to be operating in 2D continuous space instead.

so, what is the algorithm for finding the optimal point on the obstacle's outline for bypass (4)? is it finding the point on the outline nearest the destination?

then, how do you subsequently "backtrack" to a different bypass point on the obstacle if the first choice of bypass point doesn't work out?

there's something interesting here for trying to directly operate on 2D space rather than discretizing it into a graph, but I'm curious how the details shake out.

Farer · 2025-01-08T03:54:19 1736308459

The algorithm for finding detour points is as follows. In fact, I’ve improved it a bit through research:

1. Detect a collision with an obstacle on the straight path connecting the starting point and the destination. 2. Decide which direction to explore along the obstacle's outline (for now, the side closer to the destination). 3. If the end of the visible outline is reached, search for an appropriate detour point around that outline. 4. Select a detour point where a straight-line movement from the starting point avoids the obstacle, preferably closer to the destination.

---

If the first detour point selection fails, I plan to search in the *opposite direction* along the outline where the obstacle was first encountered. I’m currently working on resolving this part.

You can check out my progress here: https://github.com/Farer/bw_path_finding

sujayakar · 2025-01-01T01:53:01 1735696381

this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.

it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).

smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.

the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.

curiouscoding · 2025-01-01T16:29:05 1735748945

Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")

I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small. That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).

Bulat_Ziganshin · 2025-01-03T17:17:59 1735924679

with m/t, the algorithm is memory-bound, so the performance should be determined strictly by the memory throughput

sujayakar · 2024-12-29T19:20:13 1735500013

I love playing with it at UltraHigh quality and 1 solver iterations. It reminds me of gradually incorporating one ingredient into another when cooking: like incorporating flour into eggs when making pasta.

sujayakar · on Nov 5, 2024

+1. I'd be curious how much of a pessimization to uncontended workloads it'd be to just use `tokio::sync::RwLock`.

and, if we want to keep it as a spinlock, I'm curious how much the immediate wakeup compares to using `tokio::task::yield_now`: https://docs.rs/tokio/latest/tokio/task/fn.yield_now.html

willothy · on Nov 5, 2024

This is an interesting idea. I am gonna try this out - especially with dashmap, I think that could perform very well.

zamalek · on Nov 5, 2024

You could also look into shamelessly "taking inspiration" from async-lock.

sujayakar · on Oct 22, 2024

very cool stuff! I just read the SPFresh paper a few days ago and was wondering if it's been implemented in industry (e.g. Turbopuffer's implementation of SPANN).

I'd be curious how y'all represent the posting lists for each partition in InnoDB:

- what IDs are you storing in the posting lists?

- how are the posting lists represented on disk? are they using compression and/or some form of skip indexing? the paper seemed to use a pretty simple block-based representation, but I'm curious what works well in practice.

- how do the posting list data structures themselves handle incremental updates and MVCC?

tanoku · on Oct 22, 2024

These are very relevant questions! Thank you!

We're storing IDs from a ghost column that is created in the table where you're inserting vector data. This works very well in practice and allows updating the value of the vectors in the table, because they're translated into a delete + insert in the vector index by updating the ghost ID.

We have abstracted away the quantization system from the index; for the initial release, vector data is stored in raw blocks, like in the paper. Query performance is good, but disk usage is high. We're actively testing different quantization algorithms to see which ones we end up offering on GA. We're hoping our beta users will help us guide this choice!

Incremental updates and MVCC are _extremely tricky_, for both correctness and performance. As you've surely noticed, the hard thing here is that the original paper is very focused on LSM trees, because it exploits the fact that LSM trees get compacted lazily to perform incremental updates to the posting lists ('merges'). MySQL (and Postgres, and all relational databases, really) are B-tree based, and in-place updates for B-trees are expensive! I think we came up with very interesting workarounds for the problem, but it's a quite a bit to drill down in a HN comment. Please stay tuned for our whitepaper. :)

sujayakar · on Oct 22, 2024

looking forward to it!

I'd be curious if y'all end up supporting adding filter attributes to the inverted index that can then be pushed down into the posting list traversal.

for example, a restaurant search app may have (1) an embedding for each restaurant but also (2) a cuisine. then, if a restaurant has `cuisine = Italian`, we'd also store its ghost ID in a `cuisine:Italian` posting list.

at query time, the query planner could take a query like `SELECT * FROM t1 WHERE cuisine = 'Italian' ORDER BY DISTANCE(..)` and emit a plan that efficiently intersects the `cuisine:Italian` posting list with the union of the partitions' posting lists.

this feels to me like a potential strength of the inverted indexing approach compared to graph-based approaches, which struggle with general filtering (e.g. the Filtered-DiskANN paper).

Sirupsen · on Oct 29, 2024

tpuf’s ANN index uses a variant of SPFresh, yup. These are the only two production implementations I am aware of. I don’t think it is in production at MSFT yet