Building a bot to crawl all of the Internet and save it to an index is a fairly ...

tpmx · on May 20, 2023

You're conflating crawling with querying/ranking in a weird way. And: grep - are you serious?

(Yes, you also namedropped Pagerank for some odd reason.)

The thing is, though: You can't easily outsource the crawling and then do the quering/ranking inhouse. The reverse index and various other data structures you need are inherently tied to the data structures from the crawler output. This is a very large amount of data and it's changing often.

The outsourcing that is being done is at the "search query to results" level. That is why this is so disappointing.

kmbfjr · on May 21, 2023

You are taking his hyperbole of what an idiot would do as a literal example.

Please, relax.

speedgoose · on May 21, 2023

If you don’t care about the latest data, the commoncrawl dataset is a good start.