Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is ridiculous to me that people view public web pages as something that shouldn't be archived, if anything it provides illuminating snapshots to the state of the web at certain dates.

The archive.org team does follow robots.txt and I believe they remove content retroactively meaning if you update your site with a robots.txt it will delete the old content (which I think sucks).



> The archive.org team does follow robots.txt and I believe they remove content retroactively meaning if you update your site with a robots.txt it will delete the old content (which I think sucks).

Indeed, especially since most domain parking garbage sites seem to have robots.txt files for some crazy reason.


> Indeed, especially since most domain parking garbage sites seem to have robots.txt files for some crazy reason.

Presumably to avoid being plagued (in terms of load and bandwidth costs) by the numerous crawling bots looking to update their caches of pages that no longer exist on those domains.


serving 404s is super cheap actually.


It depends on the setup.

I've seen a CMS brought almost to its knees because the previous owner of that IP had a site that had lots of distinct pages on it. Since every page in the CMS was stored in a DB it took a DB lookup to find out whether the incoming URL existed or not. Caching/varnish wouldn't help as there were hundreds of thousands of different incoming URLs and none will be in the cache because they don't exist.

About 20% of the hits to one site I look after are 404 because they're from the previous site hosted on that IP address. Luckily the vast majority of URLs have a specific prefix so it's a simple rule in the apache config to 404 them without having to got to disk to check for the existence of any files. It still counts against my bandwidth utilisation too (both incoming request and outgoing 404).


>The archive.org team does follow robots.txt and I believe they remove content retroactively meaning if you update your site with a robots.txt it will delete the old content (which I think sucks).

Every time the "Change Facebook back to the way it was!" brigade came out, I would link to the wayback machine's copy of facebook.com from 2005 and say "Is this what you want??". Now I can't do that anymore because of stupid robots.txt.


I hope they have backup of this old content. This robot.txt policy is crap. robot.txt should not be taken into account retroactively when the site owner has changed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: