Sunday, January 25, 2015

Backing up your data with Amazon S3 Glacier and rsnapshot. A complete guide, Part 1.


In this first part I'll tell you when to consider Amazon Glacier or not, compare full backups to incremental backups, and explain why you shouldn't "put all files in the same basket".

When to consider Glacier, and when not

Glacier is a great storage solution offered by Amazon for about $0.012 per GB, supposing :

  • You want something cheap but reliable ;
  • You understand that by "Glacier" Amazon means that your files are frozen, it takes a while to get to the glacier and heat up your data so you can retrieve it ;-)
  • You almost never need to access the data from the server (doing so will cost you something, and you will have to wait about 4 hours before getting a download link) ;
  • You already have some primary backup storage (a second disk will do) where you can restore data immediately if needed ;
  • You understand that Glacier is only meant to protect your data in case of fire or other major events, not simply to restore a file deleted by mistake on the "live" storage ;
  • You don't plan to delete your files less than 90 days after uploading them (otherwise it will cost something) ;
  • You are OK with the principle of storing and retrieving archives instead of single files.

With these considerations in mind, if the delay (~4 hours) to retrieve your data is unacceptable you are looking at the wrong product, try regular Amazon S3 storage. It costs 3 times as much but it's no slower than downloading this web page.
In fact there are plenty of use cases where Amazon Glacier is not the right solution except if you are willing to accept its limitations.

Full backups and incremental backups explained

If you copy a folder with all the links (cp -a src dest), you are doing a full backup. If the source folder is 100 GB and you want to keep the backup for the last 7 days, you will need 700 GB of storage, and it will take 20 to 25 minutes to copy. If you have 1 TB, we are talking about 3 to 4 hours !

The nice thing about full backups is that you can browse the backup just like you would with the "live" copy because it's a plain old regular folder! There is no need to extract archives or to use the backup solution's command-line client.

But as you can see, full backups use a lot of storage and are not particularly quick. The alternative is incremental backups. Instead of making a whole copy of the source folder, you only do it the first time. The next time only the differences get saved. So if you add one character to a text file and that's all you did, the second backup is only 1 byte (I am simplifying but you get the idea). The technical term to describe this would be a "delta".
A good command-line program to make incremental backups is rdiff-backup. 
One big flaw of this system is that you can't access the files directly because the complete content of the file is splitted across backups. You will need to rebuild it from all the small pieces.
What incremental backup kind of people usually do is to create a full backup every other week or so to mitigate the problem.

My personal preference is rsnapshot. It's probably the best of both worlds. It gives you full backup-like folders while saving only the files that changed. So yes, if you change one byte in the file, a complete copy is made. That's the price to pay.

The little magic trick that rsnapshot uses is hard links. You see, when you list the content of a folder or you type rm somefile you are only dealing with a symbolic name to a record (also named "inode") on the file system (the i-node contains all sorts of metadata but not the name). It means two things: not only nothing is erased from the disk when you ask to "remove a file", but that you can have two filenames pointing at the same content on the disk. This principle is known as a "hard link".
A "symbolic link" on the other hand is the UN*X equivalent of shortcuts on Microsoft Windows. The "shortcut filename" points to an link-type i-node which points to the real i-node. If the real i-node marked as removed, the shortcut gets broken.
This means that rsnapshot never stores the same file with the same content more than once and explains why the very first time you run rsnapshot it takes much longer than say the exact same command run one hour later. That's why it is advised to run the first backup manually instead of letting cron do it, so you can verify it works like it should and because it takes a long time.

The dilemma

There is one problem with rsnapshot. If you make an archive of the last folder which is supposed to be only a few megabytes bigger than the folder from an hour ago, you end up with the full 100 GB backup. You can send it on Glacier and it will be great because when the time comes, you'll get a full copy requiring almost no more work than extracting it.
The bad news: you will need to pay to store the same file again and again.

Incremental backups are much less practical to store on Glacier. First you have to keep a log of some sort to know when you stored the first version of the file and where are all deltas you need to build the version of the file you are interested in. This is very complex and cannot be done by hand.

Not all files are born equal

I have 200 GB of data to backup. But here's the thing: you are probably like me, 90 % of it is made of files that never change and take a lot of space. These can be photos and videos. They never change and "incremental backups" are useless on that kind of files.
You must be very picky when choosing the folders you want to automatically backup.
This way you don't make useless copies of files you know will never change and you reduce your costs.

Stuff you are working on gets royal privileges

I've got two requirements regarding files related to projects I am currently working on: there must be at least two copies accessible immediately and they must be synchronized as often as possible.
This can be achieved with versioning systems such as Git if you are working on code, or with Dropbox, Copy, Box, OwnCloud, ... for everything else.
If anything happens to my laptop, I can open a browser on another computer and access my files in less than a minute.
You think that's excessive? Imagine you are in a rush and you have only a few (dozen) minutes to print a paper, a Master thesis, the e-ticket for your flight in 3 hours, the PowerPoint presentation that begins in 10 minutes...

There's a rule of thumb in the storage world:
The more often the data needs to be accessed, the fastest the retrieval, and the higher the cost.

You should still save these files in the "slow backup system" because you shouldn't trust Dropbox and alike to have multiple copies of your files in several locations and they usually delete the old versions after a few months.

Continue to Part II

1 comment: