Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For 1 you mean you want to build a parser for arbitrary html that your crawler returns. Hard problem, as others have said. My advice:

1. Use an html parsing library. Beautiful soup (python) or hpricot (ruby) are good building blocks.

2. Practice manually building parsers for a few sites, then see if it leads you to any insights about how to generalize the process.

3. Ignore everything else until you do 2. Just use wget as your crawler. Skip the visual interface for now; just parsing arbitrary pages is a hard enough problem to bite off.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: