Are there examples on how this can be used for topic modeling, document similari...

rococode · on Jan 13, 2020

They don't use huggingface, but some of the modern approaches for topic modeling use variational auto-encoders, see:

Open-SESAME (2017): https://arxiv.org/abs/1706.09528 / https://github.com/swabhs/open-sesame

VAMPIRE (2019): https://arxiv.org/abs/1906.02242 / https://github.com/allenai/vampire

samcodes · on Jan 14, 2020

Thanks! I hadn’t seen VAMPIRE! So stoked to see a new approach to topic modeling. SVD etc are very much a local max

ogrisel · on Jan 13, 2020

Big transformers neural network are probably overkill for topic modeling. More traditional methods implemented in Gensim or scikit learn such as tfidf vectors followed by SVD (aka LSI) or LDA or NMF are probably just fine to extract topics (soft clustering).

ogrisel · on Jan 13, 2020

The reason is that you do not need to finely understand the structure of individual sentences to group documents by similar topics. Word order does not matter much for this task. Hence the success of methods that use Bag of Words (eg TFIDF) as their input representation.

orestis · on Jan 13, 2020

It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.

On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?

nestorD · on Jan 13, 2020

If I understand you problem clearly, you can use TFIDF to reduce the weight of meaningless words.

orestis · on Jan 14, 2020

It’s not meaningless words - it’s common English words that are overloaded and I think considering their position in sentences instead would give better results.

I haven’t yet tried TFIDF though so I’ll see what that will do.

oddnearfuture · on Jan 14, 2020

With the appropriate amount of data of course.