Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Building a bot to crawl all of the Internet and save it to an index is a fairly straightforwards task. As Google and Pagerank proved though, it's the algorithm you use to search that index that's valuable. Any idiot can try and run grep against said index and give 30,000 results, of which the one you want is on page 53. So writing the crawler to build the index isn't really a competitive advantage.

Why then reinvent the wheel and spend untold amount of resources re-crawling the web, when Bing will let you use theirs? What secret sauce for crawling web pages does doing your own crawl bring to the table?



You're conflating crawling with querying/ranking in a weird way. And: grep - are you serious?

(Yes, you also namedropped Pagerank for some odd reason.)

The thing is, though: You can't easily outsource the crawling and then do the quering/ranking inhouse. The reverse index and various other data structures you need are inherently tied to the data structures from the crawler output. This is a very large amount of data and it's changing often.

The outsourcing that is being done is at the "search query to results" level. That is why this is so disappointing.


You are taking his hyperbole of what an idiot would do as a literal example.

Please, relax.


If you don’t care about the latest data, the commoncrawl dataset is a good start.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: