Chapter 7. Open-Source Near-CDP

Chapter 8 discusses commercial backup software and the concept of near-continu ous data protection (near-CDP) systems, which is basically replication tied with some type of snapshot product. Before moving to the commercial side of things, this chapter examines three open-source near-CDP systems. The first section covers rsync with snapshots and explains the concept, originally popularized by Mike Rubel. The rsnapshot section explains an open-source project based on that concept. It’s designed to automate the backup of user files and does not work very well with databases and other large files. Finally, the rdiff-backup section explains a different project with a number of features, including advanced metadata capabilities and an ability to handle large files.

Tip

This chapter was contributed by Michael Rubel, David Cantrell, and Ben Escoto. Mike is a graduate student in aeronautics at Caltech, where he keeps several backups of his thesis (which he hopes to finish soon). David doesn’t believe in quiche. Ben is currently an actuarial analyst at Aon Re.

Replication by itself wouldn’t work. You can’t just have one system rsync to another system as a backup because logical corruption (such as a virus or user error) would be replicated to the backup. You need to have some way of keeping multiple versions of data on the target so that you can go back to a previous version when necessary.

If you can do this, you can have a continuously incremental system that always has a full backup of any version available. This requires, of course, enough space to store one full copy plus incremental backups. It can even be employed in a variety of ways: push-to-server (as just described), pull-from-source, or locally (for example, to a removable hard disk drive).

rsync with Snapshots

The idea of using replication to create a snapshot-like system was popularized by Mike Rubel in an online article about rsync with snapshots (http://www.mikerubel.org/computers/rsync_snapshots/). The idea is both simple and powerful.

You first have to copy one directory to another directory. To do this, copy the source directory to a target directory (cp –a /home /backups/home.0). Because you will need them later, the target directory must support hard links. (Hard links are covered in more detail later in this chapter.)

Next you need to make another “copy” of the data you backed up; this copy is used as the previous version when you update this version. To do this, copy the target directory to a .<n> version, where n is any digit. If you do this using hard links, the second “copy” won’t take up any space, so use the –l option of cp to tell it to use hard links when it makes the copy (cp –al /backups/home.0 /backups/home.1). Now there are two identical copies of our source directory on our backup system (/ backups/home.0 and /backups/home.1) that take up the size of only one copy.

Now that you’ve copied the backup to another location, it’s time to make another backup. To do this, identify any files in the source that are new or changed, remove them in the target directory if they’re there, then copy them to the target directory. If it’s an updated version of a file already in our target directory, it must be unlinked first. You can use the rsync command to do this all in one step (rsync --delete –av / home/. /backups/home.0/). This step is the heart of the idea. By removing a file that was already in the backup directory (/backups/home.0), you sever the hard link to the previous version (/backups/home.1). But since you used hard links, the previous version is still there (in /backups/home.1), and the newer version is in the current backup directory (in /backups/home.0).

To make another backup, move the older directory (/backups/home.1) to an older number (/backups/home.2), then repeat the hard-link copy, unlink, and copy process. You can do this as many times as you want and keep as many versions as you want. The space requirements are modest; the only increase in storage is for files that are new with that backup.

An Example

Let’s back up the directory /home. The example directory has three files:

$ echo "original myfile" >/home/myfile.txt
$ echo "original myotherfile" >/home/myotherfile.txt
$ echo "original mythirdfile" >/home/mythirdfile.txt
$ ls -l /home
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

Now let’s create a copy of /home in /backups/home.

$ cp -a /home /backups/home.0
$ ls -l /backups/home.0
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

$ du -sb /backups
58      /backups

Note that each file shows a 1 in the links column, there is one copy of the /home directory in /backups and it contains the same files as /home, and the entire /backups directory takes up 58 bytes, which is the same as the number of bytes in all three files. Now let’s create a second copy using hard links:

$ cp -al /backups/home.0 /backups/home.1
$ ls -l /backups/*
/backups/home.0:
total 3
-rw-r--r--  2 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.1:
total 3
-rw-r--r--  2 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

$ du -sb /backups
58      /backups

Now you can see that there are two copies of /home in /backups, each contains the same files as /home, and they still take up only 58 bytes because we used hard links. You should also note that the links column in the ls -l listing now contains a 2. Now let’s change a file in the source directory:

$ ls -l /home/myfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 /home/myfile.txt
$ echo "LET'S CHANGE MYFILE" >/home/myfile.txt
$ ls -l /home/myfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 20 Jul  8 19:41 /home/myfile.txt

Please note that the size and modification time of myfile.txt changed. Now it’s time to make a backup. The process we described earlier would notice that /home/myfile.txt has changed and that it should be removed from the backup directory and copied from the source. So let’s do that:

$ rm /backups/home.0/myfile.txt
$ cp -a /home/myfile.txt /backups/home.0/myfile.txt
$ ls -l /backups/*
/backups/home.0:
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.1:
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

$ du –sb /backups
78

Now you can see that myfile.txt in /backups/home.1 has only one link (because we removed the second link in /backups/home.0), but it still has the same size and modification date as before. You can also see that /backups/home.0 now contains the new version of myfile.txt. And, perhaps most importantly, the size of /backups is now the original size (58 bytes) plus the size of the new version of myfile.txt (20 bytes), for a total of 78 bytes. Now let’s get ready to make another backup. First, we have to create the older versions by moving directories around:

$ mv /backups/home.1 /backups/home.2
$mv /backups/home.0 /backups/home.1

Then we need to create the new previous version using cp -al:

$ cp -al /backups/home.1 /backups/home.0
$ ls -l /backups/*
/backups/home.0:
total 3
-rw-r--r--  2 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.1:
total 3
-rw-r--r--  2 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.2:
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

$ du -sb /backups
78      /backups

Now we have /backups/home.2, which contains the oldest version, and /backups/home.1 and /backups/home.0, which both contain the current backup. Please note that the size of /backups hasn’t changed since the last time we looked at it; it’s still 78 bytes. Let’s change another file and back it up:

$ echo "LET'S CHANGE MYOTHERFILE" >/home/myotherfile.txt
$ rm /backups/home.0/myotherfile.txt
$ cp -a /home/myotherfile.txt /backups/home.0/myotherfile.txt
$ ls -l /backups/*
/backups/home.0:
total 3
-rw-r--r--  2 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 25 Jul  8 19:45 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.1:
total 3
-rw-r--r--  2 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.2:
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

$ du -sb /backups
103     /backups

You can see that /backups/home.0 now contains a different version of myotherfile.txt than what is in the other directories, and that the size of /backups has changed from 78 to 103, which is a difference of 25—the size of the new version of myotherfile.txt. Let’s prepare for one more backup:

$ mv /backups/home.2 /backups/home.3
$ mv /backups/home.1 /backups/home.2
$ mv /backups/home.0 /backups/home.1
$cp -al /backups/home.1 /backups/home.0

Now we’ll change one final file and back it up:

$ echo "NOW LET'S CHANGE MYTHIRDFILE" >/home/mythirdfile.txt
$ rm /backups/home.0/mythirdfile.txt
$ cp -a /home/mythirdfile.txt /backups/home.0/mythirdfile.txt
$ ls -l /backups/*
/backups/home.0:
total 3
-rw-r--r--  3 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 25 Jul  8 19:45 myotherfile.txt
-rw-r--r--  1 cpreston mkgroup-l-d 29 Jul  8 19:51 mythirdfile.txt

/backups/home.1:
total 3
-rw-r--r--  3 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 25 Jul  8 19:45 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.2:
total 3
-rw-r--r--  3 cpreston mkgroup-l-d 20 Jul  8 19:41 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

/backups/home.3:
total 3
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 myfile.txt
-rw-r--r--  2 cpreston mkgroup-l-d 21 Jul  8 19:35 myotherfile.txt
-rw-r--r--  3 cpreston mkgroup-l-d 21 Jul  8 19:35 mythirdfile.txt

$ du –sb /backups
132

Again, the total size of /backups has changed from 103 bytes to 132, a difference of 29 bytes, which is the size of the new version of mythirdfile.txt.

The proof is in the restore, right? Let’s look at all versions of all files by running the cat command against them:

$ cat /backups/home.3/*
original myfile
original myotherfile
original mythirdfile
$ cat /backups/home.2/*
LET'S CHANGE MYFILE
original myotherfile
original mythirdfile
$ cat /backups/home.1/*
LET'S CHANGE MYFILE
LET'S CHANGE MYOTHERFILE
original mythirdfile
$ cat /backups/home.0/*
LET'S CHANGE MYFILE
LET'S CHANGE MYOTHERFILE
NOW LET'S CHANGE MYTHIRDFILE

You can see that the oldest version (/backups/home.3) has the original version of every file, the next newest directory has the modified myfile.txt, the next newest version has that file and the modified myotherfile.txt, and the most recent version has all of the changed versions of every file. Isn’t it a thing of beauty?

Beyond the Example

The example, while it works, is simple to understand but kind of clunky to implement. Some of the steps are rather manual, including creating the hard-linked directory, identifying the new files, removing the files you’re about to overwrite, and then sending the new files. Finally, our manual example does not deal with files that have been deleted from the original; they would remain in the backup. What if you had a command that could do all that in one step? You do! It’s called rsync.

The rsync utility is a very well-known piece of GPL’d software, written originally by Andrew Tridgell and Paul Mackerras. If you have a common Unix variant, you probably already have it installed; if not, you can install a precompiled package for your system or download the source code from rsync.samba.org and build it yourself. rsync’s specialty is efficiently synchronizing file trees across a network, but it works well on a single machine, too. Here is an example to illustrate basic operation.

Suppose you have a directory called <source>/ whose contents you wish to copy into another directory called <destination>/. If you have GNU cp, you can make the copy like this:

$ cp -a source/. destination/ 

The archive flag (-a) causes cp to descend recursively through the file tree and to preserve file metadata, such as ownerships, permissions, and timestamps. The preceding command first creates the destination directory if necessary.

Tip

It is important to use <source>/. and not <source>/* because the latter silently ignores top-level files and subdirectories whose names start with a period (.). Such files are considered hidden and are not normally displayed in directory listings, but they may be important to you!

However, if you make regular backups from <source>/ to <destination>/, running cp every time is not efficient because even files that have not changed (which is most of them) must be copied every time. Also, you would have to periodically delete <destination>/ and start fresh, or backups of files that have been deleted from <source>/ will begin to accumulate.

Fortunately, where cp falls short at copying mostly unchanged filesystems, rsync excels. The rsync command works similarly to cp but uses a very clever algorithm that copies only changes. The equivalent rsync command would be:

$ rsync -a source/. destination/ 

Warning

rsync is persnickety about trailing slashes on the source argument; it treats <source> and <source>/ differently. Using the trailing /. is a good way to avoid ambiguity.

You’ll probably want to add the --delete flag, which, in addition to copying new changes, also deletes any files in <destination>/ that are absent (because they have presumably been deleted) from <source>/. You can also add the verbose flag (-v) to get detailed information about the transfer. The following command is a good way to regularly synchronize <source>/ to <destination>/:

$ rsync -av --delete source/. destination/ 

rsync is good for local file synchronization, but where it really stands out is for synchronizing files over a network. rsync’s unique ability to copy only changes makes it very fast, and it can operate transparently over an ssh connection. To rsync from /<source>/ on the remote computer example.oreilly.com to the local directory /<destination>/ over ssh, you can use the command:

$ rsync -av --delete username@example.oreilly.com:/source/. /destination/ 

That was pull mode. rsync works just as well in push mode from a local /<source>/ to a remote /<destination>/:

$ rsync -av --delete /source/. username@example.oreilly.com:/destination/ 

As a final note, rsync provides a variety of --include and --exclude options that allow for fine-grained control over which parts of the source directory to copy. If you wish to exclude certain files from the backup—for example, any file ending in .bak or certain subdirectories—they may be helpful to you. For details and examples, see the rsync manpage.

Hard links are an important Unix concept to understand if you’re going to use this technique. Every object in a Unix filesystem, which includes every directory, symbolic link, named pipe, and device node, is identified by a unique positive integer known as an inode number. An inode keeps track of such mundane details as what kind of object it is, where its data lives, when it was last updated, and who has permission to access it. If you use ls to list files, you can see inode numbers by adding the -i flag:

$ ls -i foo 
409736 foo

What does foo consist of? It consists of an inode (specifically, inode number 409736) and some data. Also—and this is the critical part—there is now a record of the new inode in the directory where foo resides. It now lists the name foo next to the number 409736. That last part, the entry of a name and inode number in the parent directory, constitutes a hard link.

Most ordinary files are referenced by only one hard link, but it is possible for an inode to be referenced more than once. Inodes also keep a count of the number of hard links pointing to them. The ls –l command has a column showing you how many links a given file has. (The number to the left of the file owner’s user ID is the number of hard links—one in the following example.)

$ ls -l foo
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 foo

To make a second link to a file within a filesystem, use the ln command. For example:

$ ln foo bar 
$ ls -i foo bar
409736 foo 
409736 bar

Tip

Hard links, like the one illustrated here, can be created only for files within the same filesystem.

Now the names foo and bar refer to the same file. If you edit foo, you are simultaneously editing bar, and vice versa. If you change the permissions or ownership on one, you’ve changed them on the other, too. There is no way to tell which name came first. They are equivalent.

When you remove a file, all you’re removing is a link to that inode; this is called unlinking. An inode is not actually released until the number of links to it drops to zero. (Even then, the inode is freed only when the last process has finished using it, which is why you can remove the executable of a running program.) For example, ls -l now tells you that foo has two links. If you remove bar, only one remains:

$ ls -l foo
-rw-r--r--  2 cpreston mkgroup-l-d 16 Jul  8 19:35 foo
$ rm bar
$ ls -l foo
-rw-r--r--  1 cpreston mkgroup-l-d 16 Jul  8 19:35 foo

The situation would have been the same if you’d removed foo and run ls -l on bar instead. If you now remove foo, the link count drops to zero, and the operating system releases inode number 409736.

Here is a summary of some of the important properties of hard links. If foo and bar are hard links to the same inode:

  • Changes to foo immediately affect bar, and vice versa.

  • Changes to the metadata of foo—the permissions, ownership, or timestamps—affect those of bar as well, and vice versa.

  • The contents of the file are stored only once. The ln command does not appreciably increase disk usage.

  • The hard links foo and bar must reside on the same filesystem. You cannot create a hard link in one filesystem to an inode in another because inode numbers are unique only within filesystems.

  • You must unlink both foo and bar (using rm) before the inode and data are released to the operating system.

In the previous section, you learned that ln foo bar creates a second hard link called bar to the inode of file foo. In many respects, bar looks like a copy of foo created at the same time. The differences become relevant only when you try to change one of them, examine inode numbers, or check disk space. In other words, as long as in-place changes are prohibited, the outcomes of cp foo bar and ln foo bar are virtually indistinguishable to users. The latter, however, does not use additional disk space.

Suppose you wanted to make regular backups of a directory called <source>/. You might make the first backup, a full copy, using rsync:

$ rsync -av --delete source/. backup.0/

To make the second backup, you could simply make another full copy, like so:

$ mv backup.0 backup.1 
$ rsync -av --deletesource/. backup.0/ 

That would be inefficient if only a few of the files in <source>/ changed in the interim. Therefore, you create a second copy of backup.0 to use as a destination (since you’re using hard links, it won’t take up any more space). You can use two different techniques to make backups in this way. The first is a bit easier to understand, and it’s what we used in the example. The second streamlines things a bit, doing everything in one command.

GNU cp provides a flag, -l, to make hard-link copies rather than regular copies. It can even be invoked recursively on directories:

$ mv backup.0 backup.1 
$ cp -al backup.1/. backup.0/ 
$ rsync -av --delete source/. backup.0/

Putting cp -al in between the two backup commands creates a hard-linked copy of the most recent backup. You then rsync new changes from <source>/ to the backup.0. rsync ignores files that have not changed, so it leaves the links of unchanged files intact. When it needs to change a file, it unlinks the original first, so its partner is unaffected. As mentioned before, the --delete flag also deletes any files in the destination that are no longer present in the source (which only unlinks them in the current backup directory).

rsync is now taking care of a lot. It decides which files to copy, unlinks them from the destination, copies the changed files, and deletes any files it needs to. Now let’s take a look at how it can handle the hard links as well. rsync now provides a new option, --link-dest, that will do this for you, even when only metadata has changed. Rather than running separate cp -al and rsync stages, the --link-dest flag instructs rsync to do the whole job, copying changes into the new directory, and making hard links where possible for unchanged files. It is significantly faster, too.

$ mv backup.0 backup.1 
$rsync -av --delete --link-dest=../home.0 /home/. /backups/home/ 

Warning

Notice the relative path of ../ for home.0/. The path for the --link-dest argument should be relative to the target directory—in this case, /backups/home. This has confused many people.

A simple example script

The following script can be run as many times as you want. The first time it runs, it creates the first “full backup” of /home in /backups/home.inprogress and moves that directory to /backups/home.0 upon completion. The next time through, rsync creates a hard-linked copy of /backups/home.0 in /backups/home.inprogress, then uses that directory to synchronize to, updating any files that have changed, after first unlinking them. This script then keeps three versions of the backups.

rsync -av --delete --link-dest=../home.0 /home/. /backups/home.inprogress/
[ -d /backups/home.2 ] && rm -rf /backups/home.2
[ -d /backups/home.1 ] && mv /backups/home.1 /backups/home.2
[ -d /backups/home.0 ] && mv /backups/home.0 /backups/home.1
[ -d /backups/home.inprogress ] && mv /backups/home.inprogress /backups/home.0
touch backups/home.0

This is a very basic script. If you’re serious about implementing this idea, you have two options: either go to Mike Rubel’s web page at http://www.mikerubel.org/computers/rsync_snapshots or look at the section on rsnapshot later in this chapter. It is a full implementation of this idea, complete with a user group that supports it.

Restoring from the Backup

Because backups created this way are just conventional Unix filesystems, there are as many options for restoring them as there are ways to copy files. If your backups are stored locally (as on a removable hard disk) or accessible over a network filesystem such as NFS, you can simply cp files from the backups to /home. Or better yet, rsync them back:

$ rsync -av --delete /backups/home.0/. /home/ 

Be careful with that --delete flag when restoring: make sure you really mean it!

If the backups are stored remotely on a machine you can access by ssh, you can use scp or rsync rather than ssh. Other simple arrangements are also possible, such as placing the directories somewhere that is accessible to a web server.

Things to Consider

Here are a few other things to consider if you’re going to use the rsync method for creating backups.

How large is each backup?

One drawback of the rsync/hard-link approach is that the sharing of unchanged files makes it deceptively hard to define the size of any one backup directory. A normally reasonable question such as, “Which backup directories should I erase to free 100 MB of disk space?” cannot be answered in a straightforward way.

The space freed by removing any one backup directory is the total disk usage of all files whose only hard links reside in that directory, plus overhead. You can obtain a list of such files using the find command, here applied to the backup directory /backups/home.1/:

$ find /backups/home.1 -type f -links 1 -print 

The following command prints their total disk usage:

$ du -hc 'find /backups/home.1 -type f -links 1 -print' | tail -n 1 

Deleting more than one backup directory usually frees more than the sum of individual disk usages because it also erases any files that were shared exclusively among them.

Tip

This command may report erroneous numbers if the source data had a lot of hard-linked files.

A brief word about mail formats

There are a number of popular mail storage formats in use today. The venerable mbox format holds all messages of a folder in one large flat file. The newer maildir format, popularized by Qmail, allows each message to be a small file. Other database mail stores are also in use.

Of these, maildirs are by far the most efficient for the rsync/hard-link technique because their structure leaves most files (older messages) unchanged. (This would be true of the original rsync/hard-link method and of rsnapshot, which is covered later in the chapter.) For mbox format mail spools, consider rdiff-backup instead.

Other useful rsync flags

If you have a slow network connection, you may wish to use rsync’s --bwlimit flag to keep the backup from saturating it. It allows you to specify a maximum bandwidth in kilobytes per second.

If you give rsync the --numeric-ids option, it ignores usernames, avoiding the need to create user accounts on the backup server.

Backing up databases or other large files that keep changing

The method described in this chapter is designed for many small files that don’t change very often. When this assumption breaks down, such as when backing up large files that change regularly (such as databases, mbox-format mail spools, or UML COW files), this method is not disk-space efficient. For these situations, consider using rdiff-backup, covered later in this chapter.

Backing up Windows systems

While rsync works under cygwin, issues have been reported with timestamps, particularly when backing up FAT filesystems. Windows systems traditionally operate on local time, with its daylight savings and local quirks, as opposed to Unix’s universal time. At least some file timestamps are made only with two-second resolution. Consider giving rsync a --modify-window of 2 on Windows.

Large filesystems and rsync’s memory scaling

While it has improved in recent years, rsync uses a lot of memory when synchronizing large file trees. It may be necessary to break up your backup job into pieces and run them individually. If you take this route, it may be necessary to manually --delete pieces from the destination that have been deleted from the server.

Atomicity and partial transfers

rsync takes a finite period of time to operate, and when a source file changes while the backup is in progress, a partial transfer error (code 23) may be generated. Only the files that have changed may not be transferred; rsync completes the job as much as possible.

If you run backups only when your system is relatively static, such as in the middle of the night for an office environment, partial transfers may never be a problem for you. If they are a problem, consider making use of rsync’s --partial option.

rsnapshot

rsnapshot is an open-source backup and recovery tool based on Mike Rubel’s original concept. It is designed to take a great idea and make it more accessible to users and more useful for larger environments. Under the hood, rsnapshot works in the same way as Mike Rubel’s original scripts covered in the first part of the chapter. It uses hard links to conserve space and rsync to copy changes and break the hard links when necessary.

The major differences between rsnapshot and Mike Rubel’s original script are that rsnapshot: [1]

  • Is written in Perl because Perl skills are readily available, and Perl can more easily parse a configuration file.

  • Supports multiple levels of backup (daily, weekly, and monthly) from one script.

  • Backs up multiple directory trees from multiple sources, with some optional fine-tuning of what rsync parameters are used.

  • Supports locking so that two backups don’t run at once.

  • Has syslog support.

  • Supports pre- and post-backup scripts, which are useful for things like taking dumps of databases.

Platform Support

rsnapshot is most at home on a Unix-like platform, such as Linux, FreeBSD, and Solaris, but it can be used to back up data on non-Unix platforms. rsnapshot itself needs to run somewhere that rsync can deal with hard links, which means somewhere like Unix. For these purposes, Mac OS X counts as a Unix platform, although there are some issues with Mac OS X’s case-insensitive filesystem (HFS+) and resource forks. Thankfully, this is very rarely an issue. Resource forks are nowhere near as important in Mac OS X as they used to be in “classic” Mac OS. Please read the section on this topic in Chapter 3.

Tip

Classic Mac OS is not supported at all. While it may be possible to get an rsync server running on it and for rsnapshot to back it up, no changes will be made to rsnapshot to support this.

rsnapshot can back up Windows systems, but running the rsnapshot server on Windows is not supported because of the lack of hard-link support. Also, because permissions are so different in the Windows world compared with the Unix world, they will almost certainly not be backed up properly. Timestamps on files may not be backed up properly either, because Windows seems to store timestamps in the local timezone whereas Unix uses GMT. However, with these caveats, rsnapshot is being used successfully by many people to back up their Windows machines. To back up permissions, you can run a script before the backup to dump information about file permissions and users to a file that is then included in the snapshot. When you recover a file from backup, you then restore the correct permissions to files either by hand or with a script, referring to that file. Obviously, you should test this procedure before relying on it, to make sure you are recording enough data and that your script is correctly reading the file!

Finally, some obscure features of Unix filesystems aren’t backed up because rsync doesn’t support them: ACLs and filesystem extended attributes. These are rarely used, and you can use the same workaround described for Windows. No scripts of this nature are supplied with rsnapshot because there is so much variety from one platform to another (which is the same reason that rsync itself doesn’t support them).

Tip

rdiff-backup often supports backing up ACLs.

When Not to Use rsnapshot

With even the slightest change to a file, rsnapshot puts a new copy of the file into your backup. This works well if you add a file, remove a file, or make large changes to a file. However, it does fall down if you either make small changes to a big file or are constantly changing a file. In those cases, you lose all the benefits of hard links. (rdiff-backup is covered later in the chapter.) Files where these sorts of change are common are logfiles and mboxes. Normally, these make up only a small proportion of the data on a machine, and so the amount of wasted space is actually quite small compared to the size of a backup. However, in the specific cases of mail servers and syslog servers, pretty much the only changes to files are of this sort. In those cases, we recommend using rdiff-backup instead.

Setting Up rsnapshot

rsnapshot packages are available for several Linux distributions and can be easily installed using the native package manager. If your distribution or OS doesn’t have such a package available or if you want to be up to date with the most recent version, you can download a gzipped tar file from:

http://www.rsnapshot.org/downloads.html

To install that version, first uncompress the file:

 $ gzip -d rsnapshot-version.tar.gz | tar xvf -

Then change to the directory this command created:

$ cd rsnapshot-version

There is a file of instructions called INSTALL, but, in summary, for a fresh install you would normally need to issue these commands:

$ ./configure
$ su
 (you will be asked for your password)
# make install
#cp /etc/rsnapshot.conf.default /etc/rsnapshot.conf

You then edit the file /etc/rsnapshot.conf to tell it where to put your backups, what to back up, and so on. There are copious instructions in the example config. Most users need to change only these settings:

backup

What to back up

snapshot_root

Where to put your backups

interval

When to back up

include and exclude

Which files rsync pays attention to

You can check your new configuration by typing:

$ rsnapshot configtest

If it all looks good, create cron jobs to run rsnapshot automatically. You can put them in root’s cron file like this:

[email protected] # so that I get emailed any errors
00 00 * * *    /usr/local/bin/rsnapshot daily    # every day at midnight
00 22 * * 6    /usr/local/bin/rsnapshot weekly   # every Saturday at 22:00
00 20 1 * *    /usr/local/bin/rsnapshot monthly  # the first of the month at 20:00

If you want to back up data from other machines, use ssh (which must be installed on both ends of the connection) and keys. There’s a good description of how to do this securely here:

http://troy.jdmz.net/rsnapshot/

There’s also a web frontend to the CVS repository here:

http://rsnapshot.cvs.sourceforge.net/rsnapshot/rsnapshot/

The rsnapshot Community

Open-source software lives or dies by its support. Thankfully, rsnapshot is well supported by an active mailing list that users are encouraged to subscribe to. It is used for reporting problems, discussing their fixes, announcing new releases, and so on. You can join the list or search the archives here:

https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss

Many people post patches and add-ons to the list, some of which are committed to CVS by the small number of developers. If you’ve got good ideas and have posted good patches to the list, getting permission to commit to CVS is easy. Of course, everyone can look at and download the CVS repository anonymously.

rsnapshot is stable software. There are no plans to make major changes in how it works. The sort of changes you can expect to see are:

  • More add-on scripts for reporting

  • More database support scripts

  • More configuration options

  • Occasional bug fixes

rdiff-backup

rdiff-backup is a program written in Python and C that uses the same rolling-checksum algorithm that rsync does. Although rdiff-backup and rsync are similar and use the same algorithm, they do not share any code and must be installed separately.

When backing up, both rsnapshot and rdiff-backup create a mirror of the source directory. For both, the current backup is just a copy of the source, ready to be copied and verified like an ordinary directory. And both can be used over ssh in either push or pull mode. The most important conceptual differences between rsync snapshots and rdiff-backup are how they store older backups and how they store file metadata.

An rsync-snapshot system basically stores older backups as complete copies of the source. As mentioned earlier in the chapter, by being clever with hard links, these copies do not take long to create and usually do not take up nearly as much disk space as unlinked copies. However, every distinct version of every file in the backup is stored as a separate copy of that file. For instance, if you add one line to a file or change a file’s permissions, that file is stored twice in the backup archive in its entirety. This can be troublesome especially with logfiles, which grow slightly quite often.

On the other hand, rdiff-backup does not keep complete copies of older files in the backup archive. Instead, it stores only the compressed differences between current files and their older versions, called diffs or deltas. For logfiles, rdiff-backup would not keep a separate copy of the older and slightly shorter log. Instead, it would save to the archive a delta file that contains the information “the older version is the current version but without the last few lines.” These deltas are often much smaller than an entire copy of the older file. When a file has changed completely, the delta is about the same size as the older version (but is then compressed).

When an rdiff-backup archive has multiple versions of a file, the program stores a series of deltas. Each one contains instructions on how to construct an earlier version of a file from a later one. When restoring, rdiff-backup starts with the current version and applies deltas in reverse order.

Besides storing older versions as deltas instead of copies, rdiff-backup also stores the (compressed) metadata of all files in the backup archive. Metadata is data associated with a file that describes the file’s real data. Some examples of file metadata are ownership, permissions, modification time, and file length. This metadata does not take up much space because metadata is generally very compressible. Newer versions go further and store only deltas of the metadata, for even more space efficiency.

At the cost of some disk space, storing metadata separately has several uses: first, data loss is avoided even if the destination filesystem does not support all the features of the source filesystem. For instance, ownership can be preserved even without root access, and Linux filesystems with symbolic links, device files, and ACLs can be backed up to a Windows filesystem. You don’t have to examine the details of each filesystem to know that the backup will work. Second, with metadata stored separately, rdiff-backup is less disk-intensive on the backup server. When backing up, rdiff-backup does not need to traverse the mirror’s directory structure to determine which files have changed. Third, metadata such as SHA-1 checksums can be used to verify the integrity of backups.

Advantages

Here are some advantages of using rdiff-backup instead of an rsync script or rsnapshot:

Backup size

Because rdiff-backup does not store complete copies of older files but only the compressed differences between older and current files, backups generally consume less disk space.

Easier-to-use

Unlike rsync, rdiff-backup was written originally for backups. It has sensible defaults (so no need for the -av –-delete -e ssh options) and fewer quirks (for instance, there is no distinction between <destination>, <destination>/, and <destination>/.).

Preserves all information

With rsync, all information is stored in the filesystem itself. If you log in to your backup repository as a nonroot user (generally a good idea), the rsync method forgets who owns all your files! rdiff-backup keeps a copy of all metadata in a separate file, so no information is lost, even if you aren’t root or if you back up to a different kind of filesystem.

Handy backup features

rdiff-backup has several miscellaneous handy features. For example, it keeps detailed logs on what is changing and has commands to process those logs so that you know which files are using up your space and time. Also, newer versions keep SHA-1 checksums of all files so you can verify the integrity of backups. Some rsync scripts have similar features—check their documentation.

Disadvantages

Let’s be honest. rdiff-backup has some disadvantages, too:

Speed

rdiff-backup consumes more CPU than rsync and is therefore slower than most rsync scripts. This difference is often not noticeable when the bottleneck is the network or a disk drive but can be significant for local backups.

Transparency

With rsync scripts, all past backups appear as copies and are thus easy to verify, restore, and delete. With rdiff-backup, only the current backup appears as a true copy. (Earlier backups are stored as compressed deltas.)

Requirements

rdiff-backup is written in Python and requires the librsync library. Unless you use a distribution that includes rdiff-backup (most of them include it), installation could entail downloading and installing other files.

Quick Start

Here’s a basic, but complete, example of how to use rdiff-backup to back up and restore a directory. Suppose the directory to be backed up is called <source>, and we want our archive directory to be called <destination>:

$ rdiff-backup source destination

This command backs up the <source> directory into <destination>. If you look into <destination>, you’ll see that it is just like <source> but contains a directory called <destination>/rdiff-backup-data where the metadata and deltas are stored. The rdiff-backup-data directory is laid out in a fairly straightforward way—all information is either in (possibly gzipped) text files or in deltas readable with the rdiff utility—but we don’t have the space to go into the data format here.

The first time you run this command, it creates the <destination> and <destination>/rdiff-backup-data directories. On subsequent runs, it sees that <destination> exists and makes an incremental backup instead. For daily backup usage, no special switches are necessary.

Suppose you accidentally delete the file <source>/foobar and want to restore it from backups. Both of these commands do that:

$ cp -a destination/foobar source
$ rdiff-backup -r now destination/foobarsource

The first command works because <destination>/foobar is a mirror of <source>/foobar, so you can use cp or any other utility to restore. The second command contains the - r switch, which tells rdiff-backup to enter restore mode, and restore the specified file at the given time. In the example, now is specified, meaning restore the most recent version of the file. rdiff-backup accepts a large variety of time formats.

Now suppose you realize you deleted the important file <source>/foobar a week ago and want to restore. You can’t use cp to restore because the file is no longer present in <destination> in its original form (in this case it’s gzipped in the <destination>/rdiff-backup-data directory). However the -r syntax still works, except you tell it 7D for seven days:

$ rdiff-backup -r 7D destination/foobarsource

Finally, suppose that the <destination> directory is getting too big, and you need to delete older backups to save disk space. This command deletes backup information more than one year old:

$ rdiff-backup –-remove-older-than 1Ydestination

Just like rsync, rdiff-backup allows the source or destination directory (or both) to be on a remote computer. For example, to back up the local directory <source> to the <destination> directory on the computer host.net, use the command:

$ rdiff-backup sourceuser@host.net::destination

This works as long as rdiff-backup is installed on both computers, and host.net can receive ssh connections. The earlier commands also work if [email protected]::<destination> is substituted for <destination>.

Windows, Mac OS X, and the Future

Although rdiff-backup was originally developed under Linux for Unix-style systems, newer versions have features that are useful to Windows and Mac users. For instance, rdiff-backup can back up case-sensitive filesystems and files whose names contain colons (:) to Windows filesystems. Also, rdiff-backup supports Mac resource forks and Finder information, and is easy to install on Mac OS X because it is included in the Fink distribution. Unfortunately, rdiff-backup is a bit trickier to install natively under Windows; currently, cygwin is probably the easiest way.

Future development of rdiff-backup may consist mostly of making sure that the newer features like full Mac OS X support are as stable as the core Unix support, and adding support for new filesystem features as they emerge. For more information on rdiff-backup, including full documentation and a pointer to the mailing list, see the rdiff-backup project home page at http://rdiff-backup.nongnu.org/.

Tip

BackupCentral.com has a wiki page for every chapter in this book. Read or contribute updated information about this chapter at http://www.backupcentral.com.



[1] Some of these differences no longer apply because Mike has continued to update his script. Specifically, the new version of Mike Rubel’s script supports multiple trees and locking, and it also has a syslog support wrapper.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset