I've tried posting this on Stack Exchange but no luck so figured I might have more luck here:
I'm implementing a simple sentiment analysis algorithm where the authors of the paper have a word list for positive and negative words and simply count the number of occurrences of each in the analysed document and give it a sentiment score the document with:
sentiment = (#positive_matches - #negative_matches) / (document_word_count)
This is normalising the sentiment score by document length BUT the corpus of negative words is 6 times larger than the positive word corpus (around 300 positive words and 1800 negative words) so by the measure above, the sentiment score will likely be negatively biased since there are more negative words to match than positive words.
How can I correct for the imbalance in the length of the positive vs. negative corpuses?
When I run calculate the above sentiment score, I get around 70% of my 2000 document set with negative sentiment scores BUT there is no a priori reason that my document set should be biased towards the negative and I would expect the true 'unobserved' sentiment of the documents to be approximately symmetrical with around half the documents positive and half negative.
I need to somehow come up with a methodology that results in representative sentiment scores to remove the bias introduced by asymmetrical word lists.
Any thoughts / ideas much appreciated :)
(2) the right way to do this is to train a probability estimator on your scores, that is, put +/- labels on some of your documents, then apply logistic regression.
http://en.wikipedia.org/wiki/Logistic_regression
A lot of machine learning people think this is harder than it is and worry more about regularization, overfitting and such, but in the case of turning a score into a probability estimator you are (a) fitting a small number of variables and (b) if you have a lot of data and make a histogram you will ALWAYS get a logistic curve for any reasonable score, I think it has something to do with the central limit theorem.
This seems to be one of the best kept secrets in machine learning. I used to be the bagman who supplied data to people at the Cornell CS department and we ran into a problem where there was an inbalance in the positive and negative set and in that case the 0 threshold for the SVM is not in the right place because it gets the wrong idea about the prior distribution and T Joachims told us to do the logistic regression trick.
Also if you read the papers about IBM Watson they tried just about everything to fit probability estimators and wound up concluding that logistic regression "just works" almost all the time.