I love tarsnap (I'm using it in production now). But, I wish it were a little easier to use ...
For example, each of my servers creates a backup named something of the form 'machineName-epochTime' every night. I wish there was some built in way to delete all but the last N backups. I ended up writing a small Python script to take care of it, but it relies on text-munging, etc and seems brittle.
Something that would automate a 'grandfather, father, son' style rotation scheme would be appreciated too.
In principle, it isn't too hard for me to script any of this functionality - but maybe I'll make a mistake in the portion of my code that handles my Tarsnap key, or the portion that deletes backups. I'd rather pay someone smarter than me (e.g., cpercival) to specialize on making my backups work. That is, after all, the basic premise behind the Tarsnap business model.
Would tarsnap be a solution for long-term archival of logfile data? I'm working on a data mining project of the "Let's store everything & figure out what we do with it later" type. My servers generate about 2GB of data (zipped) every day. We plan to store an 'analysis' dataset of the last 3 months on S3 and run a batch of Hadoop/Pig/MapReduce jobs every night on EC2.
My question: what would be the most cost-efficient long-term archival solution (I can live with slow access-times) of Apache logs? Does tarnsap offer any benefit here? Are there any compression solutions specific for Apache logs? Other ideas?
First, this is not that much data (~180GB). Is there a particular reason not to just throw it on a hard disk on some machine that doesn't do too much during the night and write a trivial Perl script?
Secondly, (g)zip may not the best solution here. A quick unscientific test on ~3MB of Apache log data (in the default Common Logfile format): gzip or zip produce ~240KB of data, xz (formerly lzma) gets it down to ~80KB (using -9e) or ~96KB (using the default option).
In my quick unscientific test, xz can decompress data about half as quickly as gzip and about ten times faster than bzip2. It's very likely able to keep up with your disk.
Rsync.net has offered similar services for quite a while now. It's a little more product-ized, they have more options for service, and they guarantee you can talk to an engineer at any time.
It's a little more expensive than Dr. Percival -- or maybe not, depending on your access patterns and volume discounts. But I'm a bit leery of trusting an organization that is just one guy who already has a day job.
Oh, I'm sorry, I made a silly assumption. From the "Dr." handle, and the FreeBSD contributions, I assumed you were an academic who had a sort of side business going.
The doctorate just means that I spent years at a university in the past, not that I'm still at a university. :-)
I am still very academic-minded, and serve my alma mater in a voluntary capacity on a few committees, but Tarsnap is absolutely a real business and is what I spend the vast majority of my time on.
Actually, I went looking for this exact product but for windows earlier. Sort of like R1Soft, but for windows + encrypted with a key only I know. (I dont trust mozy/carbonite/etc.. )
Yes, several people are using Tarsnap via Cygwin. It's not something I recommend to the general public, but I imagine the readership of Hacker News wouldn't have any difficulty with this.
I've been using it on Cygwin since the beginning. Frankly, its fairly straight forward. The 'configure, make, make install' works out of the box and afterwards you do the same thing you'd do a *nix box. I've had no trouble at all.
Pre-compiled clients are on my to-do list. Obviously I want to do this is a systematic manner so that my release process is more repeatable than "find boxes running the following operating systems: ... and borrow them for a few minutes to build a binary".
I'm not sure if it's possible to build binaries using cygwin which will then run without having cygwin installed. If not, this would turn into "port tarsnap to Windows", which is also on my to-do list, but much lower down.
With MinGW you can cross-compile from Linux (and perhaps also from FreeBSD). Imho Cygwin is not suitable for applications like tarsnap that should be very reliable and fast.
When you finally do have a pre-compiled windows app, sans cygwin, I'll have more clients to refer to tarsnap.
One client I'm dealing with now has health care data needing backup. I wouldn't trust other online backup services with this, but would trust tarsnap. For now, we're using USB hard drives.
Why are citizens and residents of Canada not allowed to use tarsnap? What do you have against Canucks?
I don't have anything against Canadians — in fact, I am one. I do have something against sales tax. Dealing with federal and provincial sales taxes would not only mean dealing with extra paperwork; it would also mean figuring out whether the government considers me to be selling software (the tarsnap client code), providing a service within Canada (since I'm resident here), or providing a service outside of Canada (since the tarsnap service is provided via hardware in the US) -- not to mention questions like whether tarsnap is "data warehousing", "data processing", "telecommunications", or something else entirely. How tarsnap is classified would determine if I'd have to charge sales tax, how much, and to whom — and I'm guessing that those answers are different between federal and provincial taxes, too. From what I've read about sales taxes I'm reasonably confident about one thing, however: I don't need to charge sales tax to non-Canadians.
So for the moment, I'm taking the path of least resistance: Don't allow Canadians to use tarsnap, and spend my time writing code instead of trying to figure out how complicated sales tax laws, which were written by people who never imagined the internet or online services, apply to tarsnap.
I think Colin would do well to lead his explanation with a cleaned up version of the second paragraph "So for the moment, I'm taking the path of least resistance..."