But does this work this same way with a 140 character limit since people are oft...

jey · on May 14, 2009

As long as you have an appropriate training set, I don't see why it would be significantly different. You might run into problems if you train the classifier on some radically different dataset but then try to run it against twitter accounts.

daveying99 · on May 14, 2009

The twitter API gives access to a large number of a user's latest tweets. So it's a series of 140 characters which makes the gender classifier more accurate.

kngspook · on May 14, 2009

Yeah, but the API doesn't give you access to those users' gender (since Twitter doesn't ask/store that info), which means you have no way to tell the classifier "Here are 500 males' tweets, here are 500 females' tweets."

I suppose you could manually find males and females and train based on their body of works, but it won't be great; you'll likely run into a selection bias.

But if you seeded with that approach, and then used a SpamAssassin-style auto-learner...maybe you'd have a chance?

I suppose this is a case where you don't want "Perfect" to get in the way of "Good enough", especially since it will never be perfect...