Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.


They don't use huggingface, but some of the modern approaches for topic modeling use variational auto-encoders, see:

Open-SESAME (2017): https://arxiv.org/abs/1706.09528 / https://github.com/swabhs/open-sesame

VAMPIRE (2019): https://arxiv.org/abs/1906.02242 / https://github.com/allenai/vampire


Thanks! I hadn’t seen VAMPIRE! So stoked to see a new approach to topic modeling. SVD etc are very much a local max


Big transformers neural network are probably overkill for topic modeling. More traditional methods implemented in Gensim or scikit learn such as tfidf vectors followed by SVD (aka LSI) or LDA or NMF are probably just fine to extract topics (soft clustering).


The reason is that you do not need to finely understand the structure of individual sentences to group documents by similar topics. Word order does not matter much for this task. Hence the success of methods that use Bag of Words (eg TFIDF) as their input representation.


It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.

On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?


If I understand you problem clearly, you can use TFIDF to reduce the weight of meaningless words.


It’s not meaningless words - it’s common English words that are overloaded and I think considering their position in sentences instead would give better results.

I haven’t yet tried TFIDF though so I’ll see what that will do.


With the appropriate amount of data of course.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: