Show HN: LLM App – build a realtime LLM app in 30 lines, with no vector database

janchorowski · on July 27, 2023

To quickly get to the application sources please go to:

- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the simplest contextless app

- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the default app that builds a reactive index of context documents

- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the contextful app reading data from s3

- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the app using locally available models

anupsurendran · on July 28, 2023

Thanks for these links. I also had a thread around alternatives to vector databases going on today on Linkedin https://www.linkedin.com/feed/update/urn:li:activity:7090376... . What is the criteria to go for a vector index vs vector database?

janchorowski · on July 28, 2023

An index is a software component building block, which becomes a database when wrapped with the data management system. We will see more and more traditional databases to add a vector-search index, for instance pgvector makes a vector database out of PostgreSQL.

The LLM App is meant to be self-sufficient and takes a "batteries included" approach to system development - rather than combine several separate applications into a large deploymet, that includes databases, orchestrators, ETL pipelines it combines several software components, such as connectors and indexes into a single app which can be directly deployed with no extra dependencies.

Such an approach should make the deployments easier (there are fewer moving parts to monitor and service), while also being more hackable - e.g. adding some more logic on top of nearest neighbor retrieval is easy and adds only a few statements to the code.

anupsurendran · on July 28, 2023

I understand much better. Thanks. So this is much more programmer extensible and possibly get data from other sources (not just unstructured data).

Arimbr · on July 27, 2023

I see the ingested documents in the data folder don't have an id field, only a doc field.

{"doc": "Using Large Language Models in Pathway is simple: just call the functions from `pathway.stdlib.ml.nlp`!"}

What if I pass two contradictory statements? Is there a way to remove (or better update) a document with a new version?

For example, if I am ingesting some public docs, and I update a doc page. How do I make so that it only takes the answer from the latest document version?

janchorowski · on July 27, 2023

This depends on the data source used. Some track updateable collections, some have a more "append-only" nature. For instance, tracing a database table using CDC+Debezium will support reacting to all document changes out of the box.

For file sources, we are working on supporting file versioning and integration with S3 native object versioning. Then the simply deleting the file or uploading a new version would be sufficient to trigger re-indexing the affected documents.

Arimbr · on July 27, 2023

Hi, interesting!

> Then it processes and organizes these documents by building a 'vector index' using the Pathway package.

What is the Pathway package?

janchorowski · on July 27, 2023

Pathway (https://github.com/pathwaycom/pathway) is a data processing framework we are developing that unifies stream and batch processing of large datasets. It lets developers concentrate on writing the data processing logic, without worrying about tracking changes to data and updating the results. The same code can then be run on batch data (e.g. during testing) or on real-time data streams (i.e. online query processing)

In the LLM app, Pathway allows concentrating on prompt building and querying the LLM APIs as if the corpus of documents were static, while all updates to it are handled by the framework itself.