Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It looks like you put together your codebook by hand (containing phrases like "http://" and ".com"), right? Wouldn't it be better to automatically extract it from a sample text corpus? Otherwise, you'd be introducing your own biases.


Hello, no I build it with a Ruby script. But I inserted two codes by hand, http:// and .com. All the rest is based on a probability-length based weight, and a test data that is the following: english books in .txt format, and different .html pages from wikipedia.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: