Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What has worked best for us is to narrow the category as much as possible before attempting to do unsupervised clustering.

We focus solely on sports and our classifiers(supervised) reduce the scope first to the sport and then to the specific team before we apply any sort of clustering (k-means, LDA, etc). That allows us to reduce the vocabulary to what is mostly a list of named entities for the sport/team and key words such as 'injury', 'quarterback', etc. With a significantly reduced vocabulary, even algorithms such as Hierarchical LDA work surprisingly well.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: