Hacker Newsnew | past | comments | ask | show | jobs | submit | johnwatson11218's commentslogin

https://github.com/johnwatson11218/LatentTopicExplorer

You have to use docker compose to get to localhost:8000 , there are still bugs but I'm working on it and there was interest expressed in this project on Hacker News a couple of weeks back.


I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.

I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.



Thanks for the supportive comments. I'm definitely thinking I should release sooner rather than later. I have been using LLM for specific tasks and here is some sample stored procedure I had an LLM write for me.

-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --

CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;

    -- Recreate tables in correct dependency order
    CREATE TABLE topic_terms AS
    SELECT
        dt.term_id,
        dot.topic_id,
        COUNT(DISTINCT dt.document_id) as document_count,
        SUM(frequency) as total_frequency
    FROM document_terms dt
    JOIN document_topics dot ON dt.document_id = dot.document_id
    GROUP BY dt.term_id, dot.topic_id;

    CREATE TABLE term_tf AS
    SELECT
        topic_id,
        term_id,
        SUM(total_frequency) as term_frequency
    FROM topic_terms
    GROUP BY topic_id, term_id;

    CREATE TABLE term_df AS
    SELECT
        term_id,
        COUNT(DISTINCT topic_id) as document_frequency
    FROM topic_terms
    GROUP BY term_id;

    CREATE TABLE topic_term_tfidf AS
    SELECT
        tt.topic_id,
        tt.term_id,
        tt.term_frequency as tf,
        tdf.document_frequency as df,
        tt.term_frequency * LN( (SELECT COUNT(id) FROM topics) / GREATEST(tdf.document_frequency, 1)) as tf_idf
    FROM term_tf tt
    JOIN term_df tdf ON tt.term_id = tdf.term_id;

    CREATE TABLE topic_top_terms AS
    WITH ranked_terms AS (
        SELECT
            ttf.topic_id,
            t.term_text,
            ttf.tf_idf,
            ROW_NUMBER() OVER (PARTITION BY ttf.topic_id ORDER BY ttf.tf_idf DESC) as rank
        FROM topic_term_tfidf ttf
        JOIN terms t ON ttf.term_id = t.id
    )
    SELECT
        topic_id,
        term_text,
        tf_idf,
        rank
    FROM ranked_terms
    WHERE rank <= 5
    ORDER BY topic_id, rank;

    RAISE NOTICE 'All topic tables refreshed successfully';
   
EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM; END; $$;

This sounds amazing, totally interested in seeing the approach and repo.

Sounds a lot like Bertopic. Great library to use.

Yes. Please publish. Sounds very interesting

I've read that by the end of ancient Egyptian history they had used tricks like a picture of an eye for the letter or sound 'I' or a picture of a bee for the sound of 'B' there was a complete alphabet embedded within the system. To be literate you had to know the tricks from the ancient and middle kingdoms as well. The result was three complete alphabets, similar to modern Japanese. From that point of view the invention of the alphabet was more of a simplification. This always reminded me of the situation in modern enterprise development where lots of infrastructure was written in-house.


That's a rather confused account of the matter.

The rebus principle where someone might use a depiction of an eye for the sound "I" and so forth is the very basis of the script and was there from the beginning. The complicated part is they'd use words with one to three consonants and strip the vowels. To continue the example, we might use 𓃠 to represent the consonants "ct" and thus use it to write "cat", "cot", and "cut."

There was an inventory of uniconsonantal or uniliteral signs dating back to the very beginning of the language which the ancient Egyptians could have used as an alphabet (or abjad if we want to be pedantic) if they had wanted to, but they never did—at least to write Egyptian. The basis of our alphabet, Proto-Sinaitic script, seems to have come about when speakers of Caananite languages in the Sinai Peninsula borrowed a small number of Egyptian hieroglyphs, assigned them the phonetic value for the thing depicted in their own Caananite language, and they didn't bother with anything other than uniconsonantal signs.

The "three different alphabets" thing is unrelated to any of this. Hieroglyphs and hieratic appear around the same time. Hieroglyphs were used for monuments and more formal contexts. Hieratic is a cursive form of hieroglyphs that was much faster to write with a brush pen and ink. It tended to be used for literature, correspondence, and record-keeping. From what we know of Egyptian scribal education, they started out with hieratic and then moved on to hieroglyphs, with not everyone progressing to the point where they started learning hieroglyphs. This is quite the reversal from how we approach things today, with virtually every student of ancient Egyptian language learning hieroglyphs (specifically, Middle Egyptian) first and then moving on to learning hieratic. Demotic was a later evolution of hieratic. And eventually, the Egyptians wrote their language using a modified Greek alphabet ultimately derived from their hieroglyphs (Coptic).


The alphabet really is a massive simplification: fewer symbols, fewer historical traps, lower onboarding cost


If you want to know what you are up against I highly recommend - https://www.amazon.com/Recoding-America-Government-Failing-D...

This book discusses the IT systems at the IRS and VA and shows the kind of push back you can expect from entrenched players.


I don't know the book, but I hope it isn't yet another complaint about bureaucracy in need of "business thinking". I.e., how deep are they digging to find the real players? Because you can bet your home this "AI" initiative is just another instance of Elite Capture¹ here. The last thing any government needs right now is letting its policies and implementations being steered by (and made dependent on) hallucinating "AI", whose ownership ultimately is in the hands of the democracy destroying tech oligarchs.

1. https://en.wikipedia.org/wiki/Elite_capture


They talk about the specific systems in terms of legacy code and how far removed government agencies are from automated testing and other modern, best practices. It has been a couple of years since I read it but I recall a part about a business process at the IRS that that people don't start learning until they have been there for about 17 years - due to the complexity. It talks about how there had been failed attempts to migrate to a new database, some of the data is now duplicated but the upgrade is de-funded so all the new code has to be aware that data may be duplicated.

I'm not sure if this book got into it but I've also read that the IRS has assembly code from the 1960s that is very optimized and only a few devs can work on it. ChatGPT knows a lot about this history as well.


It’s not that. It’s a several-years-old (and still very good) book by one of the original leaders of the USDS, a group which put many of the proposals described therein into practice.


I have a pdf of this book and was using LLM to translate the old code into modern, idiomatic python and it is very cool. I wonder if somebody will re-release it with modern code and tooling? In fact , google Gemini was able to do it on the fly using the posted links.


I have a pipeline in Docker compose that starts up postgresql on one container and a python container. The python scripts will recursively read all the pdf files in a directory, use pdf plumber to parse the text to store in a postgres table. Then I use sentence_transformers to take 100 char, w/ 10 char overlap, chunks and embed each section as a 384D vector which is written back to the db. Then I average all the chunks to create a single embedding for the entire pdf file. I have used numpy as well as built in postgres functions to average and it fast either way.

Then I use HMAP + DBSCAN to create a 2D projection of my dataset. DBSCAN writes the clusters to a csv file. I read that back in to create topics, docs2topcs join table. Then I join each topic into a mega doc and consider the original corpus, I compute tf-idf, using only db functions. This gives me the top 5 or so terms per topic and serves as useful topic labels.

I can do 30 to 50 docs in an couple of hours. I imported 1100 pdf files and it took all weekend on an old gaming laptop w/ a ssd. I have a gpu, and I think the embedding steps would go faster but I'm still doing it all synchronously w/o any parallel processing.


I just got a project running whereby I used python + pdfplumber to read in 1100 pdf files, most of my humble bundle collection. I extracted the text and dumped it into a 'documents' table in postgresql. Then I used sentence transformers to reduce each 1K chunk to a single 384D vector which I wrote back to the db. Then I averaged these to produce a document level embedding as a single vector.

Then I was able to apply UMAP + HDBSCAN to this dataset and it produced a 2D plot of all my books. Later I put the discovered topic back in the db and used that to compute tf-idf for my clusters from which I could pick the top 5 terms to serve as a crude cluster label.

It took about 20 to 30 hours to finish all these steps and I was very impressed with the results. I could see my cookbooks clearly separated from my programming and math books. I could drill in and see subclusters for baking, bbq, salads etc.

Currently I'm putting it into a 2 container docker compose file, base postgresql + a python container I'm working on.


This sounds like an interesting project. Do you have any plans to publish a tutorial/journal article + source code?


Everyone is talking about having LLMs write software but what about having them delete code? That can be very hard in a legacy enterprise environment. I think dead code detection overlaps with security and that is a good way to sell that kind of code clean up. Having LLMs review your architecture is a fun exercise, being able to incorporate that feedback is a good measure for the dev teams.


I think the Oracle Transaction Manager is one of the best pieces of software that I had to work with in a professional settings. Lots of other stuff in an enterprise setting is very flaky and follows trends but the Oracle internals seem very nice.


My prompt that I couldn't get the LLM to understand was the following. I was having it generate images of depressing offices with no windows and with lots of depressing, grey cubicles with paper all over the floor. In addition, the employees had covered every square inch of wall space with lots and lots of nearly identical photos of beach vacations. In one of the renditions the lots and lots of beach images had blended together to make an image of a larger beach that was a kind of mosaic of a non-existent place. Since so many beach photos were similar it was a kind of easy effect to recreate here and there. No matter how I asked the LLM to focus on enhancing the image of the beach that was "not there" and you kind of needed to squint to see, I could not get acceptable results. Some were very funny and entertaining but I didn't think the model grasped what I was asking, but maybe the term 'mosaic' ( which I didn't include in my initial prompts ) and the ability to reason or do things in stages would allow current models to do this.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: