Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

it's easy to retrieve google caches with a ruby script. here's one i used in the past:

http://pastie.org/739757

edit: if you use this, add a sleep! whoops. i didn't get banned though, shrug.



Warrick works better for that, at least: http://warrick.cs.odu.edu/warrick.html

It sleeps in between queries, so you don't get temporarily banned from Google.

I think it's not currently working for Yahoo or MSN/Bing. Fixing that might be easier than doing everything else manually.

Edit: I've gotten a response from Frank McCown, creator of Warrick, that he's looking into it.

Edit 2: He'll try to update it next week.


Warrick looks like exactly what he needs.


His biggest problem appears to be the images (and possibly other resources included in the pages). It's pretty much a given he'll be able to recover the text itself.


The permanent loss of the images makes it a greater tragedy since half the content in any given post of his consists of images.


There are many, many images in the pinboard archive, a couple of hundred posts' worth. I don't know if he also has other sources from which to retrieve them, he doesn't seem to have grabbed them from pinboard yet. But a good chunk of his stuff will be recovered, images and all.


He wrote a blog post (maybe more than one) about how he was hosting his images from Amazon S3. Did he not follow through, or did he switch away from that?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: