Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Sentiment Analysis – how to handle biased word list lengths?
51 points by markovbling on Feb 20, 2015 | hide | past | favorite | 44 comments
I've tried posting this on Stack Exchange but no luck so figured I might have more luck here:

I'm implementing a simple sentiment analysis algorithm where the authors of the paper have a word list for positive and negative words and simply count the number of occurrences of each in the analysed document and give it a sentiment score the document with:

sentiment = (#positive_matches - #negative_matches) / (document_word_count)

This is normalising the sentiment score by document length BUT the corpus of negative words is 6 times larger than the positive word corpus (around 300 positive words and 1800 negative words) so by the measure above, the sentiment score will likely be negatively biased since there are more negative words to match than positive words.

How can I correct for the imbalance in the length of the positive vs. negative corpuses?

When I run calculate the above sentiment score, I get around 70% of my 2000 document set with negative sentiment scores BUT there is no a priori reason that my document set should be biased towards the negative and I would expect the true 'unobserved' sentiment of the documents to be approximately symmetrical with around half the documents positive and half negative.

I need to somehow come up with a methodology that results in representative sentiment scores to remove the bias introduced by asymmetrical word lists.

Any thoughts / ideas much appreciated :)



(1) sentiment analysis is the one area where bag of words really goes to die; there is a limit to how good the results you get will be and it won't be good.

(2) the right way to do this is to train a probability estimator on your scores, that is, put +/- labels on some of your documents, then apply logistic regression.

http://en.wikipedia.org/wiki/Logistic_regression

A lot of machine learning people think this is harder than it is and worry more about regularization, overfitting and such, but in the case of turning a score into a probability estimator you are (a) fitting a small number of variables and (b) if you have a lot of data and make a histogram you will ALWAYS get a logistic curve for any reasonable score, I think it has something to do with the central limit theorem.

This seems to be one of the best kept secrets in machine learning. I used to be the bagman who supplied data to people at the Cornell CS department and we ran into a problem where there was an inbalance in the positive and negative set and in that case the 0 threshold for the SVM is not in the right place because it gets the wrong idea about the prior distribution and T Joachims told us to do the logistic regression trick.

Also if you read the papers about IBM Watson they tried just about everything to fit probability estimators and wound up concluding that logistic regression "just works" almost all the time.


Expanding on the point that bag of words doesn't work that great for sentiment... A good example of why are phrases like "not bad", "not great", "not a good idea". Just scoring based on unigrams really doesn't capture the context well, and people use negators a lot. You could maybe try filtering these out or detecting them with some clever rules.

I did a bit of work on this using the JDPA Sentiment Corpus for my thesis about 5 years ago. Its hand annotated for things like negators and inversions of sentiment. There's a bunch of code and examples here: https://verbs.colorado.edu/jdpacorpus/

Warning: code/corpus is academic licensed, but even reading the papers may give you some ideas.


It is worse than than because of:

(i) negators aren't always next to the word they negate, to get good accuracy you need more of a parse (ii) sentiment is highly dependent on the domain. For instance, if one was looking at people's opinions on stocks, "Buy" and "Sell" are considered sentiment but these are emotionally neutral words in general. (iii) Also there is sarcasm, which sometimes people can't figure out right.


This sounds like a great idea!

I am familiar with logistic regression (studying to be an actuary) but the problem is that my documents are unlabeled: I literally have 2000+ unlabeled documents and a list of positive and negative words.

I'm willing to label 10% (~200) documents but how should I 'score' them? On a scale of [-1,1]? Just {-1,0,1} for negative, neutral, positive? How do I create a training set?

Also can you point me in the direction of some of the implementation details e.g. how do you translate text into logistic regression model?

I would also like to implement POS tagging and 2-grams (e.g. "not bad" != "bad") - any advice on incorporating this into the system?

Thank you for your input!


We had to do something similar in our real world example where we had the labels but were unsure if the labels were truly accurate or not.

We used a technique similar to LSA.

Our first step was to build a bag of words and construct a scaled TF off of that. Then we verified the label for about 10% of the data and used that as our training set. Using cosine similarity (which we calculated using matrix multiplication of tfs) we found top n labeled documents that were similar to document in question to decide the label of remaining 90% documents.

Once we had this dataset we ran it against logistic regression as well by training on same 10% and use remaining to find the label. Interestingly document similarity was only slightly better than logistic regression. Logistic was 10 times faster.

I think this approach worked for us because we had somewhat of a mutually exclusive set of words for one or other label. This may not work in sentiment analysis where same word can have different meanings depending on surrounding words. N-Grams and then TF on it might help in that case.


If you are not using bag of words, then what features are you fitting your logistic regression with?


I see you have mentioned TF-IDF as something which you are planning to try. That should be interesting.

The way I see it, (and i may very well be slightly off point) you have a corpus of 2000 docs 2 lists -> [Wpos] & [Wneg] with count[Wneg] a factor more than count[Wpos]

if you compute a [0-1] normalized tf-idf score for each term in the set [Wpos] & [Wneg] and sum them up for all words in each of those two sets, you get a score proportional to the count of positive words & negative words. Normalized here would mean using relative frequencies, rather than absolute freq. [I prefer calling the latter term counts]

This puts document_word_count based normalization out of picture and makes it implicit in the tf-idf step.

Now you have Two numbers, Sum(Positive normalized TF-IDFs) and Sum(Negative Normalized TF-IDFs) which you can individually normalize for your list sizes, and then use the two scores for sentiment classification. A dirty hack, and somewhat inefficient if you don't maintain a reverse index.

Second approach could be this. Use your Word List, both positive and negative, to do a Okapi BM25 scoring against your docs using the list as the query set. So you would get a BM25score for your docs. and you can use that to define sentiments.

Corpus - D Di = Document in the corpus you want to classify Query1 = {set of positive words} Query2 = {set of negative words}

PositiveScore = BM25(Query1, Di ) NegativeScore = BM25(Query2, Di )

Some Combination to do classification. if Positive > Negative Score : call it positive!

Just a thought. BM25 has some flexibility in tuning it for length normalizations. Check footnote.

PS: There is the British National Corpus too for word frequencies :)

[1]BM25 and normalizations. http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-...


Wow, thank you so much for pointing out BM25 - hadn't heard of it but looks very cool. Implementing it ASAP.


You could use the word frequency lists at http://www.wordfrequency.info/ in order to normalize, e.g. add up the frequencies of the positive words and the negative words, and divide the number of matches by these frequencies.


Great idea! I've looked at term-weighting approaches such as TF-IDF but I don't have a training set of positive / negative sentences so would have to term-weight just the occurances of each of positive/sentiment list and compute a net sentiment on this basis.

Will implement and see if this fixes the bias introduced by asymmetrical corpus sizes.

Fundamentally, I'm not sure there is a 'solution' available or even necessary to the issue of a bigger negative word list than positive word list. So what if there are more ways of saying negative things, all that matters is how many times positive or negative things are said. My problem then, however, is even if your range of terms that can create a match is greater for negative words, you should still have representative positive/negative counts and that doesn't explain why I'm getting much more negative sentiment scores than would be expected given my test documents.


I think the issue is whether

P(detected|positive) and P(detected|negative)

are the same, right? It's whether they have equal coverage of all possible phrases. Having more positive or negative words, as you say, doesn't inherently bias things. Is your hypothesis that your corpus doesn't skew negative, which seems to be the basis of this question, correct? Can you do some manual sampling to get a good bound on it?


Totally agree

I suspect the bias currently giving me P(detected|negative)>P(detected|positive) is resulting from my simplification to looking only at 1-grams


If I understood correctly, you are trying to get a sentiment that is always correct for single sentences but that can extrapolate word frequencies if they don't appear in your list. I.e. for large neutral documents you want it to be neutral, although your negative words match statistically more often.

My intuition tells me that you can't really do both: Either use only those words in your dictionary and get the behaviour right for single sentences, or extrapolate as if both sets where the same size (weighted average, would be the easiest). By extrapolating you may assume, that for each positive match that you get, you'll miss other positive matches. That means you generally underestimate positive matches, compared to negative matches. This only works on large datasets.

But really, how bad is it to get a sentiment of -0.17 for a single sentence? It tells you that it was a negative sentence but that you have a high chance that there was a positive word in there that you missed, which is what you need to implement to get neutral sentiment for large neutral documents.


1. You can try using bi-grams or even tri-grams to make you word list a little more precise.

2. Create a validation by manually identifying each review as positive or negative. Each time you modify your algorithm run it through your validation set and note the results in a spreadsheet. If you don't do that, you'll never know if and how you've improved the results. The bigger the validation set the better. Similarly, you can use part of your validation set as a training set into a classifier.

3. Find a scale that works to bias your score. For example, I would try to bias your negative score using a log scale. The fewer negative words you have the more they are worth, the more you have the less they are worth.


Definitely think I should look at using bi-grams and tri-grams

Interesting reflection on society if there are more 1-gram ways of communicating negativity than positivity e.g. I'm more inclined to say 'terrible' for something very bad while it feels more natural to say 'very good' than 'excellent'. If that makes any sense :)


I found this paper useful for a side project I worked on a few months ago, one that made use of n-grams in a naive bayesian classifier:

http://arxiv.org/pdf/1305.6143v2.pdf

and the lead authors's github repos are:

https://github.com/vivekn/sentiment https://github.com/vivekn/sentiment-web

He's implemented 'negative bi-gram detection' (my phrasing, not his) with this function:

https://github.com/vivekn/sentiment/blob/master/info.py#L26-...

...which I found useful as a jumping off point. Good luck!


If you think the true sentiment is symmetric, you can just change the decision threshold so that your algorithm answers positively about half the time. Just say positive when the sentiment is greater than the mean sentiment over your training set.


This is an interesting approach to normalisation - will give it a go :)


Two simple things you could do:

1. Insert each negative example six times into your training set (or weight negative examples accordingly, ie use #positive matches - 6 * #negative matches / (2 * positive word count) as your score

2. Take your distribution of sentiment scores as calculated over held out data (or the training set itself, but be warned that this will skew your results), and calculate the mean and standard deviation. Normalize your results by subtracting the mean and dividing by the standard deviation. You can then say that positive sentiment is > 0 and negative sentiment < 0, with the absolute value being the strength of the classification.


I have a list of positive and negative words and a set of documents which I want to score so not sure if I have a 'training set'.

I think you mean to upweight my positive list by 6 (since it is 1/6 of the size of the negative list) but the problem with this is the same as my reply to the other comment where you just shift the bias:

Consider the sentence: 'there are strong and weak divisions in company X's Europe operations'

The only word matches in your word lists are 'strong' on your positive list and 'weak' on your negative list.

If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.


You are right; this does just shift the bias, which is sometimes all you need (you have a simple algorithm, presumably for a reason).

I did misunderstand that you don't have a training set, just a list of positive and negative words. You could still apply a similar idea.

You could test your hypothesis that the score is biased by looking at the average number of positive and negative words per document, and slightly modify your factors. For example if you found that the average document had 6 negative words and 4 positive words, but you think that the average sentiment is neutral across your documents, you could multiply the positive word count by 1.5. It's a less brutal way to accomplish a similar outcome without increasing the complexity of your algorithm.

Otherwise, you will need to use an algorithm that has more discrimination power, and this will likely mean you need a training set. You can go very deep down that rabbit hole, but I would consider starting with Naive Bayes which is essentially learning a weight per positive and negative word and combines them in a similar manner to how you're doing so now. It has the advantage of being a simple algorithm.


Reweighting sentiment by looking at the number of occurrences of positive and negative words in my assumed neutral corpus is a great idea :)

Will implement and report back

I've looked into using Naive Bayes but my understanding is you need labeled training documents and then I face the problem of scoring documents which introduces subjectivity compared to just counting the 'sentiment words'.

I understand complexity is needed to deal with negation ('not bad' != 'bad') but I'd imagine that the sentiment scoring process would be the same regardless of algorithm which brings us back to the problem of how to correct bias in 'word list' asymmetries


I like your "2." suggestion more, because the initial sentiment score distribution can be not normal.

So there is an option to try making it normal by taking logarithm for example and calculating mean, etc. after that.


I would still expect it to tend towards the normal distribution across a large set of documents. If you model positive and negative word counts as a binomial distribution, you have the the difference of two samples from different binomial distributions which would still tend towards normal (I think, though I'm not 100% sure, certainly it's true within my experience). A logarithm would skew away from positive to negative sentiment and is undefined for negative values.


It only tends to a Normal distribution if you estimate P(negative|matches in -ve list) & P(positive|matches in +ve list) with an unbiased, consistent estimator.

A simple 1-gram model like in the question does not model many complexities of natural language e.g. negation ("not bad" != "bad") so you would expect your estimator to over-represent the dictionary with more words that are equal to their adverb-adjusted equivalent. e.g. "not bad" can be described as 'terrible' more readily than 'very good' can be described as excellent since people assign a hyperbolic weighting to their own happiness (utility theory 101)

The sentiment would only tend to a normal distribution if we had perfect estimators for document sentiment which requires advanced POS tagging and models more complex than a 1-gram bag of words aggregation :)


I meant -0.167 + 1 = 0.83 > 0 therefore positive sentiment :)


Only using match-counts is imo a bit simplistic (don't get me wrong, simplistic can be good). Do you have any information like "how many times does this negatively annotated word occur in a document"; then you can use a simple calculation (like cosine-similarity) to calculate a measure of matching with said case.

Also, consider using bigrams (i.e. word-pairs) to do sentiment matching which will make matching more precise.


Cosine-similarity is a measure of comparison between 2 vectors so what would you use for these 2 vectors in this case?

Definitely going to look into n-grams for production implementation but right now trying to resolve the negative bias issue


Word counting can be pretty ropey but there are some things that you should check out. Of the 1800 negative words how many of them actually occurred in the documents?

Or you could simply count negative words as 0.44 rather than 1 (800/1800 if those numbers are correct).

Is "not" a negative word? This might be causing problems with things like "This cake is not bad" which has positive sentiment even if it has 2 negative words


I tried weighting the terms by the relative sizes of the corpuses (as you suggest) but the problem is you just shift the bias instead of removing it.

Consider the sentence: 'there are strong and weak divisions in company X's Europe operations'

The only word matches in your word lists are 'strong' on your positive list and 'weak' on your negative list.

If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.

:)


If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.

If you want to stick to simple counting (it's a fun exercise at least ;)) and L is the large lexicon and S the small, why don't you:

- Generate L' by randomly picking |S| words from L.

- Compute the score using L' and S.

- Rinse and repeat for the same text N times.

- Compute an aggregate over the N scores, e.g. average, score with the largest number of hits from L, score with the largest number of hits from both, ...

This way, the lexicons have the same size during each scoring attempt, but you do use the extra vocabulary of the larger lexicon.

Ps.: don't stick to simple counting. It doesn't work ;).


Ahh resampling |S| words from L is a great idea! :)

I know simple counting not the greatest approach but I started out by trying to replicate a research paper put out by a stock broker (not the most advanced research haha!)

I would love some suggestions for a method that does work :)


On second thought, not sure resampling |S| words from L is a good idea because you want 100% coverage of your sentiment universe and an asymmetrical corpus is not a priori incorrect so resampling does not solve anything except reweighting match lists which may not even lead to symmetrical sentiment since the distribution of words amongst the language (TF-IDF helps here) is not necessarily the same in your negative and positive lists


Yeah, I realise what I suggested was a hack. But simple word counting is so error prone that the real answer is not to use it.

I am curious if the word 'not' was in your negative list, it is quite a difficult word to handle. "I was not disappointed" against "I will not do that again".

Have you looked to see if you are having runs of negative terms? "bad" counts as 1, "fucking awful" counts as 2. If you are getting runs of negative terms perhaps you might score them as 1/1 for the first term in a run, 1/2 for the second term, 1/3 down to 1/n. Of course given these are journals then perhaps such phrases are unlikely to occur :) Unlike on twitter and the comment section of blogs.

Could it be that the style of writing is producing this bias? I am reading a book on statistical inference and it is continually pointing out how not to use the techniques because they could lead to erroneous conclusions. I suspect that it would score badly with simple word counting.


Haha! :)

Totally agree - definitely need to do something about negation e.g. "Not bad" != "bad"

My understanding is that this is usually handled using a list of adverbs e.g. 'not' / 'very' ('very bad' > 'bad') etc.

Not sure of a better approach than word counting though?


Not sure of a better approach than word counting though?

There are many better approaches, assuming that you have annotations for supervised learning. E.g.:

http://www.socher.org/uploads/Main/SocherPenningtonHuangNgMa... http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf


Yes, presumably the larger corpus would tend to contain more "rare" words than the smaller. I think you may want to weigh individual terms by their inverse frequency in the full corpus or a representative sample (Look up tf-idf if you haven't already). I'm not sure this is a magic bullet, but it might.


Do you have the gold standard labels of your dataset? Can you ensure that the amount of pos/neg labels is symmetrical?

You can heuristically tune the weights of your lexicon to fit your intuition, but evidence is necessary to progress adequately.

In case you find unbalanced amount of examples, apply an unbalanced effectiveness score like the F-measure to obtain a fair performance of your system.


This is part of my problem - I don't have a labeled dataset outside of my 'positive words' / 'negative words' lists.

I don't think asymmetrical test-sets would be a problem if I had training data for documents since you can reweight to compensate - it would seem my problem is that over-representing the universe of matches for negative points due to a bigger 'negative word list' is introcucing bias and I'm not sure how to solve that.

Please see my reply on reweighting in this thread (if you reweight positive words to normalize the over-represented negative word count then a neutral sentence will have a positive sentiment score)


Assuming there is no inherent bias in terms of sentiment and vocabulary, one approach would be to repeatedly randomly sample 300 negative words from the corpus and generate a vector of sentiments. You could then average the elements of the vector to get an average sentiment, or use another metric from basic stats. That could decrease the bias.


but wouldn't you miss sentiment terms in the text if you sample a subset of your negative dictionary?


You can try to find a dataset that contains the equal number of positive and negative documents (sentences, etc.) and use it as the validation set. I.e. to tune your hyperparameters on it.

In the simple case your hyperparameter can be α in

sentiment = (α * #positive_matches - #negative_matches) / (document_word_count)


Out of interest, did you define some kind of measure by which you can test how well the chosen method performs?

(There are a lot of suggestions here, so it would be nice if at least you could choose the "best" one)


To be honest I haven't thought about measuring the performance of different approaches but I have thought about a metric which will signal poor performance and right now I'm interested in eliminating poor performance in my simplistic methodology.

What I mean is that if my output 'sentiment scores' are skewed towards the negative and centered around a negative number (~70% of my documents are scored as 'negative sentiment' using the above scheme) then I know that my model is broken because I know that my document "test set" is necessarily neutral (or even positively skewed).

Mathematically, I need to ensure my 'cost function' is a reflection of reality which means my sentiment scores at the end of the day need to be symmetrical or slightly positive skew with a mean of approximately 0.

I can use regularization to trick the model into looking like this but I don't want to overfit and I can't think of a theoretical reason that I'm getting negative bias for this simple model except perhaps that my positive corpus is missing 'true positive sentiment' (ala a parameter versus an estimate in frequentist statistics) which could be the byproduct of the simplistic 'bag of words' assumption (words are assumed independent) aka breaking down my analysis on the basis of 1-grams (single words). As per my other comment, 2-grams could be the more natural way to express positive sentiment while 1-gram negative words are more readily available. Perhaps this is a sign of our pessimism as a species haha :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: