I tried weighting the terms by the relative sizes of the corpuses (as you sugges...

microtonal · on Feb 20, 2015

If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.

If you want to stick to simple counting (it's a fun exercise at least ;)) and L is the large lexicon and S the small, why don't you:

- Generate L' by randomly picking |S| words from L.

- Compute the score using L' and S.

- Rinse and repeat for the same text N times.

- Compute an aggregate over the N scores, e.g. average, score with the largest number of hits from L, score with the largest number of hits from both, ...

This way, the lexicons have the same size during each scoring attempt, but you do use the extra vocabulary of the larger lexicon.

Ps.: don't stick to simple counting. It doesn't work ;).

markovbling · on Feb 20, 2015

Ahh resampling |S| words from L is a great idea! :)

I know simple counting not the greatest approach but I started out by trying to replicate a research paper put out by a stock broker (not the most advanced research haha!)

I would love some suggestions for a method that does work :)

markovbling · on Feb 21, 2015

On second thought, not sure resampling |S| words from L is a good idea because you want 100% coverage of your sentiment universe and an asymmetrical corpus is not a priori incorrect so resampling does not solve anything except reweighting match lists which may not even lead to symmetrical sentiment since the distribution of words amongst the language (TF-IDF helps here) is not necessarily the same in your negative and positive lists

peterhi · on Feb 20, 2015

Yeah, I realise what I suggested was a hack. But simple word counting is so error prone that the real answer is not to use it.

I am curious if the word 'not' was in your negative list, it is quite a difficult word to handle. "I was not disappointed" against "I will not do that again".

Have you looked to see if you are having runs of negative terms? "bad" counts as 1, "fucking awful" counts as 2. If you are getting runs of negative terms perhaps you might score them as 1/1 for the first term in a run, 1/2 for the second term, 1/3 down to 1/n. Of course given these are journals then perhaps such phrases are unlikely to occur :) Unlike on twitter and the comment section of blogs.

Could it be that the style of writing is producing this bias? I am reading a book on statistical inference and it is continually pointing out how not to use the techniques because they could lead to erroneous conclusions. I suspect that it would score badly with simple word counting.

markovbling · on Feb 20, 2015

Haha! :)

Totally agree - definitely need to do something about negation e.g. "Not bad" != "bad"

My understanding is that this is usually handled using a list of adverbs e.g. 'not' / 'very' ('very bad' > 'bad') etc.

Not sure of a better approach than word counting though?

microtonal · on Feb 20, 2015

Not sure of a better approach than word counting though?

There are many better approaches, assuming that you have annotations for supervised learning. E.g.:

http://www.socher.org/uploads/Main/SocherPenningtonHuangNgMa... http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

troels · on Feb 20, 2015

Yes, presumably the larger corpus would tend to contain more "rare" words than the smaller. I think you may want to weigh individual terms by their inverse frequency in the full corpus or a representative sample (Look up tf-idf if you haven't already). I'm not sure this is a magic bullet, but it might.