Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

45% of his total monthly bandwidth used. So say he's getting 50 gigs of traffic a month, the bots make up 22.5 gigs of that


OK. So how does this speed anything up?

EDIT: To clarify, how does reducing the amount of bandwidth used speed up anything? Why am I being downvoted for this?


Google doesn't know when you will update content on your site so it has to almost constantly hit it with crawler bots. The idea would be to move towards a push-based model wherein sites could push updates to an open index instead of waiting for bots to crawl the site and use up extra bandwidth.

Of course, if this is the primary issue, I don't see why Google couldn't just implement a closed push-based index. When your site updates, you push the changes to Google. The index is still closed but it solves the bandwidth problem without opening Google resources.


Google wouldn't trust the sites to do this correctly. So even if they did provide some push mechanism I think they'd still send bots to your server to be sure.


That's quite possible, but in that case an open index would be even more useless since Google would just have to duplicate it internally anyway.


It seems like the web server should be able to report what has changed since a timestamp. The first step when crawling a site would be first to do something like an HTTP STATUS request, if that succeeds the bot could merely crawl the pages that have changed. Or if the bot did not trust the site, it could always do the full crawl and verify the result.


That sounds rather like the DNS "NOTIFY" signal, which a master sends to its slaves (secondary nameservers) to trigger a zone transfer when DNS data has been updated. Without that secondary servers are forced to download the zone data e.g. every hour.


It speeds up the web, because we can't our entire data set in cache -- "we" being anyone with a large data set and "can't" being defined by available hardware (costs) and time (lost working on "real" problems). Google will crawl several million pages, a day, which can be quite costly to keep hot in a cache -- not to, again, mention resources, time, and money not being spent on improving general speed.

Considering there are dozens of bots that will crawl one's site in "random" order, they all begin to wreak havoc on caches, as each robot and the humans don't browse in uniform.


You are right. I guess the idea that it would speed up the internet is because less traffic would yield less congestion, or faster server response.

People who downvoted you didn't understood or think your comment is not pertinent.


His hope is that sites would only need to be indexed once. It would only be the 4.9GB for Google and he would save the 4.8GB used by Yahoo and the 11.27GB used by the other search engines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: