I've donated to the Internet Archive and I'm a big fan of Jason Scott, but the Internet Archive is not an archive. Any site on it can go down without warning, thanks to the fact that they apply current robots.txt rules to past archives. Once a domain squatter or regretful admin forbids archivebot (or crawlers in general), archive.org's copy goes down.
This has ruined many supposedly permanent links. The infamous "She's a Flight Risk" blog from a decade ago is down.[1] My first website is missing. Even public domain stuff like NASA's report on nuclear propulsion is gone.[2]
With just a small rule change (obey robots.txt at the time of crawling), they could eliminate the risk of a page disappearing. Instead, we're stuck with a slower version of the link rot we're used to. It doesn't stop me from supporting them, but it's incredibly frustrating.
The issue is that archive.org exists in a quasi-legal gray area. Imagine that you decided to selectively archive comic strips that are accessible online and created a "Comics Library" site. I expect you'd get some cease and desist letters.
Archive.org has mostly gotten away with what they do based on the fact that they try to be comprehensive, they don't post ads or charge for access, AND that they won't display your site if you ask them (through robots.txt). Yes it's frustrating but barring some legal ruling that cements their right to archive and offer access to copyrighted works, I don't expect it to change.
Just for the record, we are an archive, just one with some policies (that I myself have been working to have tuned) that you might not like. I understand that you don't like the policy.
Wow, the Jason Scott! While you're here, can you please answer two questions:
Does the Wayback Machine retain data excluded by new robots.txt rules? (In other words: If you change your policy in the future, can the change be retroactive?)
Why does archive.org keep this policy? It drastically limits what The Wayback Machine could be. I've searched quite a bit, but I haven't found a satisfying answer.[1]
I don't think you can claim that when second-parties dictate your retention of existent data[0].
An archive is a place to which I can confidently go to retrieve a document in line with the retainer's retention and access policy. If the retainer doesn't control the retention of existing documents then... it's not really an archive. It's just an ephemeral store which may or may not still have the document in which I'm interested.
[0] in terms of raw etymology you are correct, in that arkheia just meant 'public records'. But having to regress to the Greek origin is a bit of a stretch.
Your definition of archive is somewhat eccentric. Lots of archives limit access, or remain entirely "dark", or cull their holdings when legal or budgetary limits are hit. They're still 'archives'.
So much content has been lost and sadly the Internet Archive is not an archive whilst this policy exists which evicts all historical content when a updated robots.txt is found.
I have been for quite some time been donating 1TB-2TB/month of bandwidth in support of Jason Scott and the ArchiveTeam. This has been my way of supporting the Archive.org project and would consider increasing the bandwidth donation in exchange for resolution to the robots.txt bit-rot.
I won't link to the leaderboard but for reference I'm currently working on the TwitPic project and am within the top 10; same username as HN.
As someone who once helped maintain the exclusion-mechanisms, my personal opinion is:
The retroactive-robots.txt policy made sense originally as a way of reducing risk from angry-rightsholders, while minimizing burdens on staff time, and had little downside when the history-of-the-web was short, and most domains were still under their original ownership. It was a toggle any webmaster could throw, with no support/maintenance effort required at IA.
It obviously sucks now, more than a decade after it was adopted as a quick fix, but would take some dedicated policy and technical design to gracefully replace. For example, many rightsholders may be relying on the old behavior. But, the IA hasn't yet been able to prioritize creation of a new scheme.
A new process could involve a policy where someone claiming to be the original site/content rightsholder asserts that as of some boundary date, for example when they ceded or sold the domain, later robots.txt should not affect earlier content. (Such a boundary could also be clearly-indicated in Wayback summary/calendar pages.) Then, presumption would flip to showing the earlier content, unless some other rightsholder (such as the current domain-holder) formally claims ("under penalty of perjury", etc) that it's their material and they wish the block to stay in place.
It'd be a bit in overall shape like the DMCA takedown/counter-notify procedure, as if the robots.txt was a sloppy takedown request. Squatters making a false claim to ownership of older content would be forced to go on record and take some risk. Ideally any dispute could then proceed between the other two parties, leaving the IA out of it. Again, this would be similar to how the DMCA tries to leave ISPs/caches/hosters out of takedown legal battles.
I doubt most new domain-owners are strongly or consciously trying to hide the past; it's usually an automatic choice made for other reasons. A few might be holding the history hostage to make ownership of the domain more valuable – "buy it to re-access the past". Some others might be embarrassed of the previous unaffiliated content – but a clearer Wayback UI indicating changes-of-management could help allay that concern. Unfortunately, for both good and bad reasons, there is no easy/reliable/canonical source of all domain-ownership info over time.
From asking around they do retain the original data, but exclude it from public results following a robots.txt exclusion. It's something I do wish they would re-consider though as it limits the purpose.
I hope your claim about them keeping the data is correct, but... If a tree falls in the woods and robots.txt later excludes it, does archive.org make a sound?
While the distinction may be useful to someone with access to the data, it doesn't matter to me. From my point of view, the URL goes from available to unavailable. It's indistinguishable from typical link rot.
The Internet Archive has kept their robots.txt policy for over a decade now, despite constant requests to change it. I doubt they'll change it any time soon.
Maybe if it gets enough attention, and/or the Internet Archive people get enough e-mails about this problem (Wayback Machine obeying robots.txt) then they'll change their mind.
They need a whole $75/person instant as compared to Wikipedia's $3/person instant and they're not resorting to shouty, loud boxes that open modals on mobile platforms.
I don't know about mobile (I've only noticed a small banner on at the bottom of the page on WP), but it appears they're using the exact same style of banner that Wikimedia use? They even explicitly give thanks to Wikimedia in the footer of their fundraising banner. I think both services are great, so I don't really mind if either of them decide to optimize their conversion rates, just like any other startup would.
Archive.org has been an absolute godsend for historians and others who use rare books. Case in point: when I was doing PhD research in Lisbon three years ago, I had to search several rare book shops and ended up paying 80 euros for a very rare 19th century Portuguese book I needed for my research. Here it is on archive.org in multiple editions, all text-searchable: https://archive.org/search.php?query=duarte%20ribeiro%20de%2...
I hate to sounds bitter, but I've been a donator for ~8 years, and the one time I asked for help (really: filed a bug report re: an archive they said they had but which didn't acually resolve & which I needed access to) they did not get back to me, ever, despite numerous support requests and additional donations.
I'm not sure whether that means they need more funding or whether they're simply unresponsive, but it definitely didn't help me solve my problems with a trademark troll.
Sounds reasonable. I did just have to use them as "backup of last resort" to recover a web site. A little scraping turned "sorry, its gone", into "web site is back online".
This is something I am happy to donate for. These guys have been plugging away quietly for a long time to give us a record that would otherwise literally disappear into thin air.
You can get your site taken down, if that's your worry. But like generally on the internet, your best recourse is to control your online presence at the source. Don't post regrettable things publicly, and you should be fine. If other people do it to you, your issue really would be with them; fortunately, if something terrible is done, you've the Archive to prove the damage and back you up.
It really is a different beast from your normal panopticon internet tracking. This is more akin to showing up in the newspaper, rather than a surveillance camera.
Well, we have access to letters written by Thomas Jefferson and letters written by Emily Dickinson they probably didn't think we should have. (They were private letters, after all.) But we have a more comprehensive view of these people because of it.
This has ruined many supposedly permanent links. The infamous "She's a Flight Risk" blog from a decade ago is down.[1] My first website is missing. Even public domain stuff like NASA's report on nuclear propulsion is gone.[2]
With just a small rule change (obey robots.txt at the time of crawling), they could eliminate the risk of a page disappearing. Instead, we're stuck with a slower version of the link rot we're used to. It doesn't stop me from supporting them, but it's incredibly frustrating.
1. https://web.archive.org/web/*/http://www.aflightrisk.blogspo...
2. http://web.archive.org/web/20121029225832/http://ntrs.nasa.g...