I tried weighting the terms by the relative sizes of the corpuses (as you suggest) but the problem is you just shift the bias instead of removing it.
Consider the sentence: 'there are strong and weak divisions in company X's Europe operations'
The only word matches in your word lists are 'strong' on your positive list and 'weak' on your negative list.
If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.
If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.
If you want to stick to simple counting (it's a fun exercise at least ;)) and L is the large lexicon and S the small, why don't you:
- Generate L' by randomly picking |S| words from L.
- Compute the score using L' and S.
- Rinse and repeat for the same text N times.
- Compute an aggregate over the N scores, e.g. average, score with the largest number of hits from L, score with the largest number of hits from both, ...
This way, the lexicons have the same size during each scoring attempt, but you do use the extra vocabulary of the larger lexicon.
Ps.: don't stick to simple counting. It doesn't work ;).
Ahh resampling |S| words from L is a great idea! :)
I know simple counting not the greatest approach but I started out by trying to replicate a research paper put out by a stock broker (not the most advanced research haha!)
I would love some suggestions for a method that does work :)
On second thought, not sure resampling |S| words from L is a good idea because you want 100% coverage of your sentiment universe and an asymmetrical corpus is not a priori incorrect so resampling does not solve anything except reweighting match lists which may not even lead to symmetrical sentiment since the distribution of words amongst the language (TF-IDF helps here) is not necessarily the same in your negative and positive lists
Yeah, I realise what I suggested was a hack. But simple word counting is so error prone that the real answer is not to use it.
I am curious if the word 'not' was in your negative list, it is quite a difficult word to handle. "I was not disappointed" against "I will not do that again".
Have you looked to see if you are having runs of negative terms? "bad" counts as 1, "fucking awful" counts as 2. If you are getting runs of negative terms perhaps you might score them as 1/1 for the first term in a run, 1/2 for the second term, 1/3 down to 1/n. Of course given these are journals then perhaps such phrases are unlikely to occur :) Unlike on twitter and the comment section of blogs.
Could it be that the style of writing is producing this bias? I am reading a book on statistical inference and it is continually pointing out how not to use the techniques because they could lead to erroneous conclusions. I suspect that it would score badly with simple word counting.
Yes, presumably the larger corpus would tend to contain more "rare" words than the smaller. I think you may want to weigh individual terms by their inverse frequency in the full corpus or a representative sample (Look up tf-idf if you haven't already). I'm not sure this is a magic bullet, but it might.
Consider the sentence: 'there are strong and weak divisions in company X's Europe operations'
The only word matches in your word lists are 'strong' on your positive list and 'weak' on your negative list.
If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.
:)