Donate to the Internet Archive

chroma · on Dec 11, 2014

I've donated to the Internet Archive and I'm a big fan of Jason Scott, but the Internet Archive is not an archive. Any site on it can go down without warning, thanks to the fact that they apply current robots.txt rules to past archives. Once a domain squatter or regretful admin forbids archivebot (or crawlers in general), archive.org's copy goes down.

This has ruined many supposedly permanent links. The infamous "She's a Flight Risk" blog from a decade ago is down.[1] My first website is missing. Even public domain stuff like NASA's report on nuclear propulsion is gone.[2]

With just a small rule change (obey robots.txt at the time of crawling), they could eliminate the risk of a page disappearing. Instead, we're stuck with a slower version of the link rot we're used to. It doesn't stop me from supporting them, but it's incredibly frustrating.

1. https://web.archive.org/web/*/http://www.aflightrisk.blogspo...

2. http://web.archive.org/web/20121029225832/http://ntrs.nasa.g...

ghaff · on Dec 11, 2014

The issue is that archive.org exists in a quasi-legal gray area. Imagine that you decided to selectively archive comic strips that are accessible online and created a "Comics Library" site. I expect you'd get some cease and desist letters.

Archive.org has mostly gotten away with what they do based on the fact that they try to be comprehensive, they don't post ads or charge for access, AND that they won't display your site if you ask them (through robots.txt). Yes it's frustrating but barring some legal ruling that cements their right to archive and offer access to copyrighted works, I don't expect it to change.

fiatjaf · on Dec 11, 2014

It's not your site that they are shutting down if you bought the domain after someone used it for years before.

textfiles · on Dec 11, 2014

Just for the record, we are an archive, just one with some policies (that I myself have been working to have tuned) that you might not like. I understand that you don't like the policy.

chroma · on Dec 11, 2014

Wow, the Jason Scott! While you're here, can you please answer two questions:

Does the Wayback Machine retain data excluded by new robots.txt rules? (In other words: If you change your policy in the future, can the change be retroactive?)

Why does archive.org keep this policy? It drastically limits what The Wayback Machine could be. I've searched quite a bit, but I haven't found a satisfying answer.[1]

1. https://archive.org/post/1019415/retroactive-robotstxt-remov... contains links to previous discussion on archive.org.

textfiles · on Dec 11, 2014

Answer One: Archive.org does the right thing. Answer Two: Long ago agreed to this policy: http://web.archive.org/web/20130628205733/http://www2.sims.b...

The irony that the policy can only be viewed through the wayback machine is not lost on me.

dingaling · on Dec 11, 2014

> Just for the record, we are an archive

I don't think you can claim that when second-parties dictate your retention of existent data[0].

An archive is a place to which I can confidently go to retrieve a document in line with the retainer's retention and access policy. If the retainer doesn't control the retention of existing documents then... it's not really an archive. It's just an ephemeral store which may or may not still have the document in which I'm interested.

[0] in terms of raw etymology you are correct, in that arkheia just meant 'public records'. But having to regress to the Greek origin is a bit of a stretch.

gojomo · on Dec 11, 2014

Your definition of archive is somewhat eccentric. Lots of archives limit access, or remain entirely "dark", or cull their holdings when legal or budgetary limits are hit. They're still 'archives'.

ghuntley · on Dec 11, 2014

A million times this.

So much content has been lost and sadly the Internet Archive is not an archive whilst this policy exists which evicts all historical content when a updated robots.txt is found.

I have been for quite some time been donating 1TB-2TB/month of bandwidth in support of Jason Scott and the ArchiveTeam. This has been my way of supporting the Archive.org project and would consider increasing the bandwidth donation in exchange for resolution to the robots.txt bit-rot.

I won't link to the leaderboard but for reference I'm currently working on the TwitPic project and am within the top 10; same username as HN.

gojomo · on Dec 11, 2014

As someone who once helped maintain the exclusion-mechanisms, my personal opinion is:

The retroactive-robots.txt policy made sense originally as a way of reducing risk from angry-rightsholders, while minimizing burdens on staff time, and had little downside when the history-of-the-web was short, and most domains were still under their original ownership. It was a toggle any webmaster could throw, with no support/maintenance effort required at IA.

It obviously sucks now, more than a decade after it was adopted as a quick fix, but would take some dedicated policy and technical design to gracefully replace. For example, many rightsholders may be relying on the old behavior. But, the IA hasn't yet been able to prioritize creation of a new scheme.

A new process could involve a policy where someone claiming to be the original site/content rightsholder asserts that as of some boundary date, for example when they ceded or sold the domain, later robots.txt should not affect earlier content. (Such a boundary could also be clearly-indicated in Wayback summary/calendar pages.) Then, presumption would flip to showing the earlier content, unless some other rightsholder (such as the current domain-holder) formally claims ("under penalty of perjury", etc) that it's their material and they wish the block to stay in place.

It'd be a bit in overall shape like the DMCA takedown/counter-notify procedure, as if the robots.txt was a sloppy takedown request. Squatters making a false claim to ownership of older content would be forced to go on record and take some risk. Ideally any dispute could then proceed between the other two parties, leaving the IA out of it. Again, this would be similar to how the DMCA tries to leave ISPs/caches/hosters out of takedown legal battles.

I doubt most new domain-owners are strongly or consciously trying to hide the past; it's usually an automatic choice made for other reasons. A few might be holding the history hostage to make ownership of the domain more valuable – "buy it to re-access the past". Some others might be embarrassed of the previous unaffiliated content – but a clearer Wayback UI indicating changes-of-management could help allay that concern. Unfortunately, for both good and bad reasons, there is no easy/reliable/canonical source of all domain-ownership info over time.

Springtime · on Dec 11, 2014

> archive.org's copy goes down.

From asking around they do retain the original data, but exclude it from public results following a robots.txt exclusion. It's something I do wish they would re-consider though as it limits the purpose.

chroma · on Dec 11, 2014

I hope your claim about them keeping the data is correct, but... If a tree falls in the woods and robots.txt later excludes it, does archive.org make a sound?

While the distinction may be useful to someone with access to the data, it doesn't matter to me. From my point of view, the URL goes from available to unavailable. It's indistinguishable from typical link rot.

The Internet Archive has kept their robots.txt policy for over a decade now, despite constant requests to change it. I doubt they'll change it any time soon.

cogburnd02 · on Dec 11, 2014

There's a forum here:

https://archive.org/post/406632/why-does-the-wayback-machine...

Maybe if it gets enough attention, and/or the Internet Archive people get enough e-mails about this problem (Wayback Machine obeying robots.txt) then they'll change their mind.

https://archive.org/about/bios.php

Here's a link with the director's email:

http://brewster.kahle.org/about/

striking · on Dec 11, 2014

They need a whole $75/person instant as compared to Wikipedia's $3/person instant and they're not resorting to shouty, loud boxes that open modals on mobile platforms.

I love you, Internet Archive.

cmelbye · on Dec 11, 2014

I don't know about mobile (I've only noticed a small banner on at the bottom of the page on WP), but it appears they're using the exact same style of banner that Wikimedia use? They even explicitly give thanks to Wikimedia in the footer of their fundraising banner. I think both services are great, so I don't really mind if either of them decide to optimize their conversion rates, just like any other startup would.

benbreen · on Dec 11, 2014

Archive.org has been an absolute godsend for historians and others who use rare books. Case in point: when I was doing PhD research in Lisbon three years ago, I had to search several rare book shops and ended up paying 80 euros for a very rare 19th century Portuguese book I needed for my research. Here it is on archive.org in multiple editions, all text-searchable: https://archive.org/search.php?query=duarte%20ribeiro%20de%2...

allworknoplay · on Dec 11, 2014

I hate to sounds bitter, but I've been a donator for ~8 years, and the one time I asked for help (really: filed a bug report re: an archive they said they had but which didn't acually resolve & which I needed access to) they did not get back to me, ever, despite numerous support requests and additional donations.

I'm not sure whether that means they need more funding or whether they're simply unresponsive, but it definitely didn't help me solve my problems with a trademark troll.

brewsterkahl · on Dec 11, 2014

There have been a bunch more donations since this has been posted on HN. The easiest way to see this is the subset that is donated via bitcoins: https://blockchain.info/address/1Archive1n2C579dMsAu3iC6tWzu... more disks and bandwidth!

dmethvin · on Dec 11, 2014

Just in case you'd like to look at their financials before donating, they are at faqs.org [1]. I couldn't find the official IRS Form 990 online.

[1] http://www.faqs.org/tax-exempt/CA/Internet-Archive.html#anal...

toomuchtodo · on Dec 11, 2014

Their 2012 990PF: http://207.153.189.83/EINS/943242767/943242767_2012_09c9e75b...

http://www.eri-nonprofit-salaries.com/?FuseAction=NPO.Summar...

pkaye · on Dec 11, 2014

I just donated... I use them all the time to look up old computer magazines from the 90s.

cogburnd02 · on Dec 11, 2014

That's how I read BYTE. It's also a good source/reference for (some (specific)) WP articles.

https://archive.org/details/byte-magazine

http://en.wikipedia.org/w/index.php?title=CHIP-8&diff=585383...

http://en.wikipedia.org/w/index.php?title=Halt_and_Catch_Fir...

pkaye · on Dec 11, 2014

Look how many computer magazines are available: https://archive.org/details/computermagazines

jws · on Dec 11, 2014

Sounds reasonable. I did just have to use them as "backup of last resort" to recover a web site. A little scraping turned "sorry, its gone", into "web site is back online".

benjaminRRR · on Dec 11, 2014

This is something I am happy to donate for. These guys have been plugging away quietly for a long time to give us a record that would otherwise literally disappear into thin air.

sfilmeyer · on Dec 11, 2014

Here's a link to the donation page (rather than to the main page, which happens to have a donation banner now): https://archive.org/donate/index.php .

sctb · on Dec 11, 2014

Thanks, we changed it from https://archive.org.

equivocates · on Dec 11, 2014

Are you kidding me, I'm trying to take down my internet history, not keep it around.

HCIdivision17 · on Dec 11, 2014

You can get your site taken down, if that's your worry. But like generally on the internet, your best recourse is to control your online presence at the source. Don't post regrettable things publicly, and you should be fine. If other people do it to you, your issue really would be with them; fortunately, if something terrible is done, you've the Archive to prove the damage and back you up.

It really is a different beast from your normal panopticon internet tracking. This is more akin to showing up in the newspaper, rather than a surveillance camera.

[0] https://archive.org/about/faqs.php#29

cogburnd02 · on Dec 11, 2014

Except _internet_history_ is no-one's to control; once something is on the internet, it belongs to The People.

abootstrapper · on Dec 11, 2014

Right, like revenge porn and Celebgate! Once someone has seen it online, everyone has the RIGHT to see it until the end of time.

voltagex_ · on Dec 11, 2014

Who gets to decide what is and isn't of historical value?

cogburnd02 · on Dec 11, 2014

Well, we have access to letters written by Thomas Jefferson and letters written by Emily Dickinson they probably didn't think we should have. (They were private letters, after all.) But we have a more comprehensive view of these people because of it.