It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).
pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.
as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.
In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.
Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.
Chroma certainly doesn't have the most advanced API in this area, but you can for sure store chunks or documents, its up to you. If your document size is too large to generate embeddings in a single forward pass, then yes you do need to chunk in that scenario.
Oftentimes though, even if the document does fit, you choose to chunk anyways or further transform the data with abstractive/extractive summarization techniques to improve your search dynamics. This is why I'm not sure the complexity noted in the article is relevant in anything beyond a "naive RAG" stack. How its stored or linked is an issue to some degree, but the greater more complex smell is in what happens before you even get to that point of inserting the data.
For more production-grade RAG, just blindly inserting embeddings wholesale for full documents is rarely going to get you great results (this varies a lot between document sizes and domains). So as a result, you're almost always going to be doing ahead-of-time chunking (or summarization/NER/etc) not because you have to due to document size, but because your search performance demands it. Frequently this involves more than one embeddings model for capturing different semantics or supporting different tasks, not to mention reranking after the initial sweep.
That's the complexity that I think is worth tackling in a paid product offering, but the current state of the module described in the article isn't really competitive with the rest of the field in that respect IMHO.
(Post co author) We absolutely agree that chunking is critical for good RAG. What I think you missed in our post is that the vectorizer allows you to configure a chunking strategy of your choice. So you store the full doc but then the system well chunk and embed it for you. We don’t blindly embed the full document.
I didn't miss that detail, I just don't think chunking alone is where the complexity lies and that the pgai feature set isn't really differentiated at all from other offerings in that context. My commentary about full documents was responding directly to your comment here in this thread more so than I was the article (you claimed chroma can only insert chunks, which isn't accurate, and I expanded from there).
Yes that is correct, but my position (which perhaps has been poorly-articulated) is that in the non-trivial instances, it is a distinction without difference in the greater context of the RAG stack and related pipelines.
Just allowing for a chunking function to be defined which is called at insertion time doesn't really alleviate the major pain points inherent to the process. Its a minor convenience, but in fact, as pointed out elsewhere in this thread by others, its a convenience you can afford to yourself in a handful of lines of code that you only ever have to write once.
The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.
Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.
That is arguable because while it is a calculated field, it is not a pure one (IO is required), and not necessarily idempotent, not atomic and not guaranteed to succeed.
It is certainly convenient for the end user, but it hides things. What if the API calls to open AI fail or get rate limited. How is that surfaced. Will I see that in my observability. Will queries just silently miss results.
If the DB does the embedding itself synchronously within the write it would make sense. That would be more like elastic search or a typical full text index.
(co-author here) We automatically retry on failures in a while. We also log error messages in the worker (self-hosted) and have clear indicators in the cloud UI that something went wrong (with plans to add email alerts later).
The error handling is actually the hard part here. We don't believe that failing on inserts due to the endpoint being down is the right thing because that just moves the retry/error-handling logic upstream -- now you need to roll your own queuing system, backoffs etc.
Thanks for the reply. These are compelling points.
I agree not to fail on insert too by the way. The insert is sort of an enqueuing action.
I was debating if a microservice should process that queue.
Since you are a PaaS the distinction might be almost moot. An implementation detail. (It would affect the api though).
However if Postgres added this feature generally it would seem odd to me because it feels like the DB doing app stuff. The DB is fetching data for itself from an external source.
The advantage is it is one less thing for the app to do and maybe deals with errands many teams have to roll their own code for.
A downside is if I want to change how this is done I probably can't. Say I have data residency or securiry requirements that affect the data I want to encode.
I think there is much to consider. Probably the why not both meme applies though. Use the built in feature if you can, and roll your own where you can't.
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
Hah! This was actually one of the main algorithmic challenges of adapting DiskANN to PostgreSQL. Yes, I think it's common for these algorithms to assume you know how many results to return ahead of time. But in PostgreSQL that's not how things work -- because of things like post-index-retrieval-filtering the right interface for Postgres is one that just keeps on returning more and more results until all possible matches are exhausted. We solved this by creating a "streaming" version of the search algorithm that keep state like which nodes in the graph have been visited, which have been returned etc.
That's all to say -- Yes we've solved this, there are no arbitrary limits on the number of results returned.
This article doesn't account for the fact that the role of government funding in science is to fund basic science that industry doesn't have the right incentives to fund. Renewables and energy efficiency do just fine with industry-sponsored R&D. Fusion would not.
> Renewables and energy efficiency do just fine with industry-sponsored R&D. Fusion would not.
If Commonwealth Fusion's approach is viable, we may finally be at the stage where industry sponsored R&D can work. Though in fairness the concept had significant development at MIT, which I assume gets a decent amount of public funding. And there is more expensive work to be done after their first demo reactor SPARC.
Those not familiar with CFS: They are building a tokamak with new high field strength superconducting magnets, allowing them to build a system with similar physics to ITER on a MUCH smaller scale. According to them, stronger magnetic field allows for a proportionally smaller machine, and indeed their new magnet design allows for the strongest field for a tokamak ever.
There are other fusion startups, but I can't keep track of them all. As a layperson, I feel like tokamaks are well understood, JET has confirmed good physics for ITER and SPARC and while the engineering challenges in construction are significant, their smaller design allows for a much faster and cheaper design cycle.
However I agree that fusion will not solve the impending climate crisis, because it will still, even if we had the first working power plant in 10-15 years, take way too long to build out all the plants we need. And the first fusion power plants will not be remotely cost competitive with renewables. It is still a very big and complex machine compared to the relatively commodified solar and storage solution. Remember that in 15 years, batteries will be much cheaper.
What may work well, if we get fusion working, is to build out all the renewables we can now, and then in 30 years when those systems need replacement, build out a mixture of fusion and renewables (which will continue to be very useful and cost competitive).
I would love to see more government investment both in renewables build out and in shoring up fusion research. While we MAY be able to get by with industry funding, that is very much not guaranteed. And the build out itself will need support if we ever get it working.
This looks really cool! I'm excited to see a production deployment of DiskANN.
According to the single-threaded QPS experiments, your DiskANN solution should clock in at about 4.5ms latency (1000ms/224QPS) whereas pgvector is about 5.8ms latency (1000ms/173QPS). How is that possible? My (very shallow) knowledge of DiskANN vs HNSW tells me that DiskANN should generally have higher latency than HNSW — DiskANN needs to touch the SSD while HNSW only touches RAM.
Also, compared to pgvector and HNSWPQ in faiss, how much less RAM does your DiskANN-based solution use?
(Blog author here). Thanks for the question. In this case the index for both DiskANN and pgvector HNSW is small enough to fit in memory on the machine (8GB RAM), so there's no need to touch the SSD. We plan to test on a config where the index size is larger than memory (we couldn't this time due to limitations in ANN benchmarks [0], the tool we use).
To your question about RAM usage, we provide a graph of index size. When enabling PQ, our new index is 10x smaller than pgvector HNSW. We don't have numbers for HNSWPQ in FAISS yet.
Timescale is continuing to grow rapidly, and we’re hiring for many roles involving distributed systems and databases.
Folks might know TimescaleDB as “Postgres + time-series”: we’re implemented as an extension to PostgreSQL (not a fork), but focus on the scale, performance, and ease-of-use needed for time-series applications. Automated time/space partitioning, columnar compression, continuous aggregates (incrementally materialized views), time-series analytics, and now a distributed database with horizontal scale-out.
Database code all available on github; we don’t have any “enterprise versions”, so all our database code is available for the community for free. Currently see more than 3 million databases running TimescaleDB a month.
That’s because our commercial focus is on our fully-managed cloud platform, so in addition to folks implementing the core database (C, Rust), we’re also heavily hiring for folks with operational cloud & database experience at scale (Kubernetes with custom k8s operators, Golang, etc), as well as folks interested in providing highly technical support, database release engineering, and building database testing infrastructure (think performance analysis, distributed database correctness testing, etc.) And product folks to support these efforts & all our users!
We are also developing Promscale, which makes storing, analyzing, and managing observability (Prometheus metrics & Open-Telemetry traces now, other signals in development) data easier and allows users to use SQL on that data.
Lots of fun engineering and product roles/work, and an amazing company culture you can read more about here: https://www.timescale.com/careers
Feel free to drop me a DM here with any questions (I’m one of the original engineers at Timescale — and now lead the Promscale team), or apply here at https://www.timescale.com/careers
>Timescale is a all-remote organization; this is a full-time position and can be located anywhere across a wide range of time zones and locations (UTC-8 to UTC+5.5)
It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).
pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.
[1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...