Chapter 8 discusses commercial
backup software and the concept of
near-continu
ous data protection (near-CDP) systems, which is basically
replication tied with some type of snapshot product. Before moving to the commercial side of
things, this chapter examines three open-source near-CDP
systems. The first section covers rsync
with snapshots
and explains the concept, originally popularized by Mike Rubel. The rsnapshot
section explains an open-source project based on that concept. It’s
designed to automate the backup of user files and does not work very well with databases and
other large files. Finally, the rdiff-backup
section
explains a different project with a number of features, including advanced metadata
capabilities and an ability to handle large files.
This chapter was contributed by Michael Rubel, David Cantrell, and Ben Escoto. Mike is a graduate student in aeronautics at Caltech, where he keeps several backups of his thesis (which he hopes to finish soon). David doesn’t believe in quiche. Ben is currently an actuarial analyst at Aon Re.
Replication by itself wouldn’t work. You can’t just have one system rsync
to another system as a backup because logical corruption
(such as a virus or user error) would be replicated to the backup. You need to have some way
of keeping multiple versions of data on the target so that you can go back to a previous
version when necessary.
If you can do this, you can have a continuously incremental system that always has a full backup of any version available. This requires, of course, enough space to store one full copy plus incremental backups. It can even be employed in a variety of ways: push-to-server (as just described), pull-from-source, or locally (for example, to a removable hard disk drive).
The idea of using replication to create a snapshot-like system was popularized by Mike
Rubel in an online article about
rsync
with snapshots (http://www.mikerubel.org/computers/rsync_snapshots/). The idea is both simple
and powerful.
You first have to copy one directory to another directory. To do this, copy the source
directory to a target directory (cp
–a
/home
/backups/home.0
). Because you will need them later, the
target directory must support hard links. (Hard links are covered in
more detail later in this chapter.)
Next you need to make another “copy” of the data you backed up; this copy is used as
the previous version when you update this version. To do this, copy the target directory
to a .<n> version, where n is any digit. If you do this using hard links, the second
“copy” won’t take up any space, so use the –l
option of
cp
to tell it to use hard links when it makes the
copy (cp
–al
/backups/home.0
/backups/home.1
). Now there are two identical copies of
our source directory on our backup system (/
backups/home.0 and /backups/home.1) that take up the size of only one copy.
Now that you’ve copied the backup to another location, it’s time to make another
backup. To do this, identify any files in the source that are new or changed, remove them
in the target directory if they’re there, then copy them to the target directory. If it’s
an updated version of a file already in our target directory, it must be unlinked first.
You can use the rsync
command to do this all in one
step (rsync
--delete
–av
/
home/.
/backups/home.0/
). This step is the heart of the idea.
By removing a file that was already in the backup directory (/backups/home.0), you sever the hard link to the previous version (/backups/home.1). But since you used hard links, the previous
version is still there (in /backups/home.1), and the
newer version is in the current backup directory (in /backups/home.0).
To make another backup, move the older directory (/backups/home.1) to an older number (/backups/home.2), then repeat the hard-link copy, unlink, and copy process. You can do this as many times as you want and keep as many versions as you want. The space requirements are modest; the only increase in storage is for files that are new with that backup.
Let’s back up the directory /home. The example directory has three files:
$echo "original myfile" >/home/myfile.txt
$echo "original myotherfile" >/home/myotherfile.txt
$echo "original mythirdfile" >/home/mythirdfile.txt
$ls -l /home
total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt
Now let’s create a copy of /home in /backups/home.
$cp -a /home /backups/home.0
$ls -l /backups/home.0
total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $du -sb /backups
58 /backups
Note that each file shows a 1
in the links
column, there is one copy of the /home directory in
/backups and it contains the same files as
/home, and the entire /backups directory takes up 58 bytes, which is the same as the number of
bytes in all three files. Now let’s create a second copy using hard links:
$cp -al /backups/home.0 /backups/home.1
$ls -l /backups/*
/backups/home.0: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $du -sb /backups
58 /backups
Now you can see that there are two copies of /home in /backups, each contains the
same files as /home, and they still take up only 58
bytes because we used hard links. You should also note that the links column in the
ls
-l
listing now contains a 2
. Now let’s change a file in the source directory:
$ls -l /home/myfile.txt
-rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 /home/myfile.txt $echo "LET'S CHANGE MYFILE" >/home/myfile.txt
$ls -l /home/myfile.txt
-rw-r--r-- 1 cpreston mkgroup-l-d 20 Jul 8 19:41 /home/myfile.txt
Please note that the size and modification time of myfile.txt changed. Now it’s time to make a backup. The process we described earlier would notice that /home/myfile.txt has changed and that it should be removed from the backup directory and copied from the source. So let’s do that:
$rm /backups/home.0/myfile.txt
$cp -a /home/myfile.txt /backups/home.0/myfile.txt
$ls -l /backups/*
/backups/home.0: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $du –sb /backups
78
Now you can see that myfile.txt in /backups/home.1 has only one link (because we removed the second link in /backups/home.0), but it still has the same size and modification date as before. You can also see that /backups/home.0 now contains the new version of myfile.txt. And, perhaps most importantly, the size of /backups is now the original size (58 bytes) plus the size of the new version of myfile.txt (20 bytes), for a total of 78 bytes. Now let’s get ready to make another backup. First, we have to create the older versions by moving directories around:
$mv /backups/home.1 /backups/home.2
$mv /backups/home.0 /backups/home.1
Then we need to create the new previous version using cp
-al
:
$cp -al /backups/home.1 /backups/home.0
$ls -l /backups/*
/backups/home.0: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.2: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $du -sb /backups
78 /backups
Now we have /backups/home.2, which contains the oldest version, and /backups/home.1 and /backups/home.0, which both contain the current backup. Please note that the size of /backups hasn’t changed since the last time we looked at it; it’s still 78 bytes. Let’s change another file and back it up:
$echo "LET'S CHANGE MYOTHERFILE" >/home/myotherfile.txt
$rm /backups/home.0/myotherfile.txt
$cp -a /home/myotherfile.txt /backups/home.0/myotherfile.txt
$ls -l /backups/*
/backups/home.0: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 25 Jul 8 19:45 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.2: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $du -sb /backups
103 /backups
You can see that /backups/home.0 now contains a different version of myotherfile.txt than what is in the other directories, and that the size of /backups has changed from 78 to 103, which is a difference of 25—the size of the new version of myotherfile.txt. Let’s prepare for one more backup:
$mv /backups/home.2 /backups/home.3
$mv /backups/home.1 /backups/home.2
$mv /backups/home.0 /backups/home.1
$cp -al /backups/home.1 /backups/home.0
Now we’ll change one final file and back it up:
$echo "NOW LET'S CHANGE MYTHIRDFILE" >/home/mythirdfile.txt
$rm /backups/home.0/mythirdfile.txt
$cp -a /home/mythirdfile.txt /backups/home.0/mythirdfile.txt
$ls -l /backups/*
/backups/home.0: total 3 -rw-r--r-- 3 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 25 Jul 8 19:45 myotherfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 29 Jul 8 19:51 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 3 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 25 Jul 8 19:45 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.2: total 3 -rw-r--r-- 3 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.3: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $du –sb /backups
132
Again, the total size of /backups has changed from 103 bytes to 132, a difference of 29 bytes, which is the size of the new version of mythirdfile.txt.
The proof is in the restore, right? Let’s look at all versions of all files by
running the cat
command against them:
$cat /backups/home.3/*
original myfile original myotherfile original mythirdfile $cat /backups/home.2/*
LET'S CHANGE MYFILE original myotherfile original mythirdfile $cat /backups/home.1/*
LET'S CHANGE MYFILE LET'S CHANGE MYOTHERFILE original mythirdfile $cat /backups/home.0/*
LET'S CHANGE MYFILE LET'S CHANGE MYOTHERFILE NOW LET'S CHANGE MYTHIRDFILE
You can see that the oldest version (/backups/home.3) has the original version of every file, the next newest directory has the modified myfile.txt, the next newest version has that file and the modified myotherfile.txt, and the most recent version has all of the changed versions of every file. Isn’t it a thing of beauty?
The example, while it works, is simple to understand but kind of clunky to
implement. Some of the steps are rather manual, including creating the hard-linked
directory, identifying the new files, removing the files you’re about to overwrite, and
then sending the new files. Finally, our manual example does not deal with files that
have been deleted from the original; they would remain in the backup. What if you had a
command that could do all that in one step? You do! It’s called rsync
.
The
rsync
utility is a very well-known piece of
GPL’d software, written originally by Andrew Tridgell and Paul Mackerras. If you have a
common Unix variant, you probably already have it installed; if not, you can install a
precompiled package for your system or download the source code from rsync.samba.org and build it yourself. rsync
’s specialty is efficiently synchronizing file trees
across a network, but it works well on a single machine, too. Here is an example to
illustrate basic operation.
Suppose you have a directory called <source>/ whose contents you wish to copy into another directory
called <destination>/. If you have
GNU
cp
, you can make the copy like this:
$cp -a
source
/.
destination
/
The archive flag (-a
) causes cp
to descend recursively through the file tree and to
preserve file metadata, such as ownerships, permissions, and timestamps. The preceding
command first creates the destination directory if necessary.
It is important to use <source>/. and not <source>/* because the latter silently ignores top-level files and subdirectories whose names start with a period (.). Such files are considered hidden and are not normally displayed in directory listings, but they may be important to you!
However, if you make regular backups from <source>/ to <destination>/, running cp
every
time is not efficient because even files that have not changed (which is most of them)
must be copied every time. Also, you would have to periodically delete <destination>/ and start fresh, or backups of files
that have been deleted from <source>/ will
begin to accumulate.
Fortunately, where cp
falls short at copying
mostly unchanged filesystems, rsync
excels. The
rsync
command works similarly to cp
but uses a very clever algorithm that copies only
changes. The equivalent rsync
command would
be:
$rsync -a
source
/.
destination
/
rsync
is persnickety about trailing slashes on
the source argument; it treats <source> and
<source>/ differently. Using the trailing
/. is a good way to avoid ambiguity.
You’ll probably want to add the
--delete
flag, which, in addition to copying new
changes, also deletes any files in <destination>/ that are absent (because they have presumably been
deleted) from <source>/. You can also add the
verbose flag
(-v
) to get detailed information about the
transfer. The following command is a good way to regularly synchronize <source>/ to <destination>/:
$rsync -av --delete
source
/.
destination
/
rsync
is good for local file synchronization, but
where it really stands out is for synchronizing files over a network. rsync
’s unique ability to copy only changes makes it very
fast, and it can operate transparently over an ssh
connection. To rsync
from /<source>/ on the remote computer example.oreilly.com to the local directory /<destination>/ over ssh
, you can
use the command:
$rsync -av --delete
username
@example.oreilly.com:/
source
/. /
destination
/
That was pull mode. rsync
works just as well in
push mode from a local /<source>/ to a remote
/<destination>/:
$rsync -av --delete /
source
/.
username
@example.oreilly.com:/
destination
/
As a final note, rsync
provides a variety of
--include
and --exclude
options that allow for fine-grained control over which parts of
the source directory to copy. If you wish to exclude certain files from the backup—for
example, any file ending in .bak or certain
subdirectories—they may be helpful to you. For details and examples, see the rsync
manpage.
Hard links are an important Unix concept to understand if
you’re going to use this technique. Every object in a Unix filesystem, which includes
every directory, symbolic link, named pipe, and device node, is identified by a unique
positive integer known as an
inode
number. An inode keeps track of such mundane details as what kind
of object it is, where its data lives, when it was last updated, and who has permission
to access it. If you use ls
to list files, you can
see inode numbers by adding the -i
flag:
$ ls -i foo
409736 foo
What does foo consist of? It consists of an inode (specifically, inode number 409736) and some data. Also—and this is the critical part—there is now a record of the new inode in the directory where foo resides. It now lists the name foo next to the number 409736. That last part, the entry of a name and inode number in the parent directory, constitutes a hard link.
Most ordinary files are referenced by only one hard link, but it is possible for an
inode to be referenced more than once. Inodes also keep a count of the number of hard
links pointing to them. The ls
–l
command has a column showing you how many links a
given file has. (The number to the left of the file owner’s user ID is the number of
hard links—one in the following example.)
$ ls -l foo
-rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 foo
To make a second link to a file within a filesystem, use the ln
command. For example:
$ln foo bar
$ls -i foo bar
409736 foo 409736 bar
Hard links, like the one illustrated here, can be created only for files within the same filesystem.
Now the names foo and bar refer to the same file. If you edit foo, you are simultaneously editing bar, and vice versa. If you change the permissions or ownership on one, you’ve changed them on the other, too. There is no way to tell which name came first. They are equivalent.
When you remove a file, all you’re removing is a link to that inode; this is called
unlinking. An inode is not actually released until the number of
links to it drops to zero. (Even then, the inode is freed only when
the last process has finished using it, which is why you can remove the executable of a
running program.) For example, ls
-l
now tells you that foo has two links. If you remove bar,
only one remains:
$ls -l foo
-rw-r--r-- 2 cpreston mkgroup-l-d 16 Jul 8 19:35 foo $rm bar
$ls -l foo
-rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 foo
The situation would have been the same if you’d removed foo and run ls
-l
on bar
instead. If you now remove foo, the link count
drops to zero, and the operating system releases inode number 409736.
Here is a summary of some of the important properties of hard links. If foo and bar are hard links to the same inode:
Changes to foo immediately affect bar, and vice versa.
Changes to the metadata of foo—the permissions, ownership, or timestamps—affect those of bar as well, and vice versa.
The contents of the file are stored only once. The ln
command does not appreciably increase disk usage.
The hard links foo and bar must reside on the same filesystem. You cannot create a hard link in one filesystem to an inode in another because inode numbers are unique only within filesystems.
You must unlink both foo and bar (using rm
)
before the inode and data are released to the operating system.
In the previous section, you learned that ln
foo
bar
creates a second hard link called bar to the inode of file foo. In many respects, bar looks like
a copy of foo created at the same time. The
differences become relevant only when you try to change one of them, examine inode
numbers, or check disk space. In other words, as long as in-place changes are
prohibited, the outcomes of cp
foo
bar
and ln
foo
bar
are virtually indistinguishable to users. The
latter, however, does not use additional disk space.
Suppose you wanted to make regular backups of a directory called <source>/. You might make the first backup, a full
copy, using rsync
:
$rsync -av --delete
source
/. backup.0/
To make the second backup, you could simply make another full copy, like so:
$mv backup.0 backup.1
$rsync -av --delete
source
/. backup.0/
That would be inefficient if only a few of the files in <source>/ changed in the interim. Therefore, you create a second copy of backup.0 to use as a destination (since you’re using hard links, it won’t take up any more space). You can use two different techniques to make backups in this way. The first is a bit easier to understand, and it’s what we used in the example. The second streamlines things a bit, doing everything in one command.
GNU cp
provides a flag, -l
, to make hard-link copies rather than regular copies. It can even be
invoked recursively on directories:
$mv backup.0 backup.1
$cp -al backup.1/. backup.0/
$rsync -av --delete
source
/. backup.0/
Putting cp
-al
in between the two backup commands creates a
hard-linked copy of the most recent backup. You then rsync
new changes from <source>/
to the backup.0. rsync
ignores files that have not changed, so it leaves the links of
unchanged files intact. When it needs to change a file, it unlinks the original first,
so its partner is unaffected. As mentioned before, the --delete
flag also deletes any files in the destination that are no longer
present in the source (which only unlinks them in the current backup directory).
rsync
is now taking care of a lot. It decides
which files to copy, unlinks them from the destination, copies the changed files, and
deletes any files it needs to. Now let’s take a look at how it can handle the hard links
as well. rsync
now provides a new option, --link-dest
, that will do this for you, even when only
metadata has changed. Rather than running separate cp
-al
and rsync
stages, the --link-dest
flag instructs rsync
to do the whole job, copying changes into the new
directory, and making hard links where possible for unchanged files. It is significantly
faster, too.
$mv backup.0 backup.1
$rsync -av --delete --link-dest=../home.0 /home/. /backups/home/
Notice the relative path of ../ for home.0/. The path for the --link-dest
argument should be relative to the target directory—in this
case, /backups/home. This has confused many
people.
The following script can be run as many times as you want. The first time it runs,
it creates the first “full backup” of /home in
/backups/home.inprogress and moves that
directory to /backups/home.0 upon completion. The
next time through, rsync
creates a hard-linked copy
of /backups/home.0 in /backups/home.inprogress, then uses that directory to synchronize to,
updating any files that have changed, after first unlinking them. This script then
keeps three versions of the backups.
rsync -av --delete --link-dest=../home.0 /home/. /backups/home.inprogress/ [ -d /backups/home.2 ] && rm -rf /backups/home.2 [ -d /backups/home.1 ] && mv /backups/home.1 /backups/home.2 [ -d /backups/home.0 ] && mv /backups/home.0 /backups/home.1 [ -d /backups/home.inprogress ] && mv /backups/home.inprogress /backups/home.0 touch backups/home.0
This is a very basic script. If you’re serious about implementing this idea, you
have two options: either go to Mike Rubel’s web page at http://www.mikerubel.org/computers/rsync_snapshots or look at the section on
rsnapshot
later in this chapter. It is a full
implementation of this idea, complete with a user group that supports it.
Because backups
created this way are just conventional Unix filesystems, there are as many options for
restoring them as there are ways to copy files. If your backups are stored locally (as
on a removable hard disk) or accessible over a network filesystem such as NFS, you can
simply cp
files from the backups to /home. Or better yet, rsync
them back:
$ rsync -av --delete /backups/home.0/. /home/
Be careful with that --delete
flag when
restoring: make sure you really mean it!
If the backups are stored remotely on a machine you can access by ssh
, you can use scp
or
rsync
rather than ssh
. Other simple arrangements are also possible, such as placing the
directories somewhere that is accessible to a web server.
Here are a few other things to consider if you’re going to use the rsync
method for creating backups.
One drawback of the rsync
/hard-link approach is
that the sharing of unchanged files makes it deceptively hard to define the size of
any one backup directory. A normally reasonable question such as, “Which backup
directories should I erase to free 100 MB of disk space?” cannot be answered in a
straightforward way.
The space freed by removing any one backup directory is the total disk usage of all
files whose only hard links reside in that directory, plus overhead. You can obtain a
list of such files using the find
command, here
applied to the backup directory /backups/home.1/:
$ find /backups/home.1 -type f -links 1 -print
The following command prints their total disk usage:
$ du -hc 'find /backups/home.1 -type f -links 1 -print' | tail -n 1
Deleting more than one backup directory usually frees more than the sum of individual disk usages because it also erases any files that were shared exclusively among them.
This command may report erroneous numbers if the source data had a lot of hard-linked files.
There are a number of popular mail storage formats in use today. The venerable mbox format holds all messages of a folder in one large flat file. The newer maildir format, popularized by Qmail, allows each message to be a small file. Other database mail stores are also in use.
Of these, maildirs are by far the most
efficient for the rsync
/hard-link technique because
their structure leaves most files (older messages) unchanged. (This would be true of
the original rsync
/hard-link method and of rsnapshot
, which is covered later in the chapter.) For
mbox
format mail spools, consider rdiff-backup
instead.
If you have a slow network connection, you may wish to use rsync
’s --bwlimit
flag
to keep the backup from saturating it. It allows you to specify a maximum bandwidth in
kilobytes per second.
If you give rsync
the --numeric-ids
option, it ignores usernames, avoiding the need to create
user accounts on the backup server.
The method described in this chapter is designed for many small files that don’t
change very often. When this assumption breaks down, such as when backing up large
files that change regularly (such as databases, mbox-format mail spools, or UML COW files), this method is not
disk-space efficient. For these situations, consider using rdiff-backup
, covered later in this chapter.
While rsync
works under
cygwin, issues have been reported with timestamps, particularly
when backing up FAT filesystems. Windows systems traditionally operate on local time,
with its daylight savings and local quirks, as opposed to Unix’s universal time. At
least some file timestamps are made only with two-second resolution. Consider giving
rsync
a --modify-window
of
2
on Windows.
While it has improved in recent years, rsync
uses a lot of memory when synchronizing large file trees. It may be necessary to break
up your backup job into pieces and run them individually. If you take this route, it
may be necessary to manually --delete
pieces from
the destination that have been deleted from the server.
rsync
takes a finite period of time to
operate, and when a source file changes while the
backup is in progress, a partial transfer error (code 23) may be generated. Only the
files that have changed may not be transferred; rsync
completes the job as much as possible.
If you run backups only when your system is relatively static, such as in the
middle of the night for an office environment, partial transfers may never be a
problem for you. If they are a problem, consider making use of rsync
’s --partial
option.
rsnapshot
is an open-source backup and
recovery tool based on Mike Rubel’s original concept. It is designed to take a great idea
and make it more accessible to users and more useful for larger environments. Under the
hood, rsnapshot
works in the same way as Mike Rubel’s
original scripts covered in the first part of the chapter. It uses hard links to conserve
space and rsync
to copy changes and break the hard
links when necessary.
The major differences between rsnapshot
and Mike
Rubel’s original script are that rsnapshot
:
[1]
Is written in Perl because Perl skills are readily available, and Perl can more easily parse a configuration file.
Supports multiple levels of backup (daily, weekly, and monthly) from one script.
Backs up multiple directory trees from multiple sources, with some optional
fine-tuning of what rsync
parameters are
used.
Supports locking so that two backups don’t run at once.
Has syslog
support.
Supports pre- and post-backup scripts, which are useful for things like taking dumps of databases.
rsnapshot
is most at home on a Unix-like
platform, such as Linux, FreeBSD, and Solaris, but it can be used to back up data on
non-Unix platforms. rsnapshot
itself needs to run
somewhere that rsync
can deal with hard links, which
means somewhere like Unix. For these purposes, Mac OS X counts as a Unix platform,
although there are some issues with Mac OS X’s case-insensitive filesystem (HFS+) and
resource forks. Thankfully, this is very rarely an issue. Resource forks are nowhere
near as important in Mac OS X as they used to be in “classic” Mac OS. Please read the
section on this topic in Chapter
3.
Classic Mac OS is not supported at all. While it may be possible to get an
rsync
server running on it and for rsnapshot
to back it up, no changes will be made to
rsnapshot
to support this.
rsnapshot
can back up Windows systems, but
running the rsnapshot
server on Windows is not
supported because of the lack of hard-link support. Also, because permissions are so
different in the Windows world compared with the Unix world, they will almost certainly
not be backed up properly. Timestamps on files may not be backed up properly either,
because Windows seems to store timestamps in the local timezone whereas Unix uses GMT.
However, with these caveats, rsnapshot
is being used
successfully by many people to back up their Windows machines. To back up permissions,
you can run a script before the backup to dump information about file permissions and
users to a file that is then included in the snapshot. When you recover a file from
backup, you then restore the correct permissions to files either by hand or with a
script, referring to that file. Obviously, you should test this procedure before relying
on it, to make sure you are recording enough data and that your script is correctly
reading the file!
Finally, some obscure features of Unix filesystems aren’t backed up because rsync
doesn’t support them: ACLs and filesystem extended
attributes. These are rarely used, and you can use the same workaround described for
Windows. No scripts of this nature are supplied with rsnapshot
because there is so much variety from one platform to another
(which is the same reason that rsync
itself doesn’t
support them).
rdiff-backup
often supports backing up
ACLs.
With even the slightest change to a file, rsnapshot
puts a new copy of the file into your backup. This works well if
you add a file, remove a file, or make large changes to a file. However, it does fall
down if you either make small changes to a big file or are constantly changing a file.
In those cases, you lose all the benefits of hard links. (rdiff-backup
is covered later in the chapter.) Files where these sorts of
change are common are logfiles and mboxes. Normally, these make up only a small
proportion of the data on a machine, and so the amount of wasted space is actually quite
small compared to the size of a backup. However, in the specific cases of mail servers
and syslog servers, pretty much the only changes to files are of this sort. In those
cases, we recommend using rdiff-backup
instead.
rsnapshot
packages are available for several
Linux distributions and can be easily installed using the native package manager. If
your distribution or OS doesn’t have such a package available or if you want to be up to
date with the most recent version, you can download a gzipped tar
file from:
http://www.rsnapshot.org/downloads.html |
To install that version, first uncompress the file:
$gzip -d rsnapshot-
version
.tar.gz | tar xvf -
Then change to the directory this command created:
$cd rsnapshot-
version
There is a file of instructions called INSTALL, but, in summary, for a fresh install you would normally need to issue these commands:
$./configure
$su
(you will be asked for your password) #make install
#cp /etc/rsnapshot.conf.default /etc/rsnapshot.conf
You then edit the file /etc/rsnapshot.conf to tell it where to put your backups, what to back up, and so on. There are copious instructions in the example config. Most users need to change only these settings:
backup
What to back up
snapshot_root
Where to put your backups
interval
When to back up
include
and exclude
Which files rsync
pays attention to
You can check your new configuration by typing:
$ rsnapshot configtest
If it all looks good, create cron
jobs to run
rsnapshot
automatically. You can put them in root’s
cron
file like this:
[email protected] # so that I get emailed any errors 00 00 * * * /usr/local/bin/rsnapshot daily # every day at midnight 00 22 * * 6 /usr/local/bin/rsnapshot weekly # every Saturday at 22:00 00 20 1 * * /usr/local/bin/rsnapshot monthly # the first of the month at 20:00
If you want to back up data from other machines, use ssh
(which must be installed on both ends of the connection) and keys.
There’s a good description of how to do this securely here:
http://troy.jdmz.net/rsnapshot/ |
There’s also a web frontend to the CVS repository here:
http://rsnapshot.cvs.sourceforge.net/rsnapshot/rsnapshot/ |
Open-source software lives or dies by its support. Thankfully, rsnapshot
is well supported by an active mailing list that
users are encouraged to subscribe to. It is used for reporting problems, discussing
their fixes, announcing new releases, and so on. You can join the list or search the
archives here:
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss |
Many people post patches and add-ons to the list, some of which are committed to CVS by the small number of developers. If you’ve got good ideas and have posted good patches to the list, getting permission to commit to CVS is easy. Of course, everyone can look at and download the CVS repository anonymously.
rsnapshot
is stable software. There are no plans
to make major changes in how it works. The sort of changes you can expect to see
are:
More add-on scripts for reporting
More database support scripts
More configuration options
Occasional bug fixes
rdiff-backup
is a program written in Python and C that
uses the same rolling-checksum algorithm that rsync
does. Although rdiff-backup
and rsync
are similar and use the same algorithm, they do not
share any code and must be installed separately.
When backing up, both rsnapshot
and rdiff-backup
create a mirror of the source directory. For
both, the current backup is just a copy of the source, ready to be copied and verified
like an ordinary directory. And both can be used over ssh
in either push or pull mode. The most important conceptual differences
between rsync
snapshots and rdiff-backup
are how they store older backups and how they store file
metadata.
An rsync
-snapshot system basically stores older
backups as complete copies of the source. As mentioned earlier in the chapter, by being
clever with hard links, these copies do not take long to create and usually do not take up
nearly as much disk space as unlinked copies. However, every distinct version of every
file in the backup is stored as a separate copy of that file. For instance, if you add one
line to a file or change a file’s permissions, that file is stored twice in the backup
archive in its entirety. This can be troublesome especially with logfiles, which grow
slightly quite often.
On the other hand, rdiff-backup
does not keep
complete copies of older files in the backup archive. Instead, it stores only the
compressed differences between current files and their older versions, called
diffs
or deltas. For logfiles, rdiff-backup
would not keep a separate copy of the older and slightly shorter
log. Instead, it would save to the archive a delta file that contains the information “the
older version is the current version but without the last few lines.” These deltas are
often much smaller than an entire copy of the older file. When a file has changed
completely, the delta is about the same size as the older version (but is then
compressed).
When an rdiff-backup
archive has multiple versions
of a file, the program stores a series of deltas. Each one contains instructions on how to
construct an earlier version of a file from a later one. When restoring, rdiff-backup
starts with the current version and applies
deltas in reverse order.
Besides storing older versions as deltas instead of copies, rdiff-backup
also stores the (compressed)
metadata
of all files in the backup archive. Metadata is data associated with a file that describes
the file’s real data. Some examples of file metadata are ownership, permissions,
modification time, and file length. This metadata does not take up much space because
metadata is generally very compressible. Newer versions go further and store only deltas
of the metadata, for even more space efficiency.
At the cost of some disk space, storing metadata separately has several uses: first,
data loss is avoided even if the destination filesystem does not support all the features
of the source filesystem. For instance, ownership can be preserved even without root
access, and Linux filesystems with symbolic links, device files, and ACLs can be backed up
to a Windows filesystem. You don’t have to examine the details of each filesystem to know
that the backup will work. Second, with metadata stored separately, rdiff-backup
is less disk-intensive on the backup server. When
backing up, rdiff-backup
does not need to traverse the
mirror’s directory structure to determine which files have changed. Third, metadata such
as SHA-1 checksums can be used to verify the integrity of backups.
Here are some advantages of using rdiff-backup
instead of an rsync
script or rsnapshot
:
Because rdiff-backup
does not store
complete copies of older files but only the compressed differences between older
and current files, backups generally consume less disk space.
Unlike rsync
, rdiff-backup
was written originally for backups. It has sensible
defaults (so no need for the -av
–-delete
-e
ssh
options) and fewer quirks (for instance,
there is no distinction between <destination>, <destination>/, and <destination>/.).
With rsync
, all information is stored in
the filesystem itself. If you log in to your backup repository as a nonroot user
(generally a good idea), the rsync
method
forgets who owns all your files! rdiff-backup
keeps a copy of all metadata in a separate file, so no information is lost, even
if you aren’t root or if you back up to a different kind of filesystem.
rdiff-backup
has several miscellaneous
handy features. For example, it keeps detailed logs on what is changing and has
commands to process those logs so that you know which files are using up your
space and time. Also, newer versions keep SHA-1 checksums of all files so you can
verify the integrity of backups. Some rsync
scripts have similar features—check their documentation.
Let’s be honest. rdiff-backup
has some
disadvantages, too:
rdiff-backup
consumes more CPU than
rsync
and is therefore slower than most
rsync
scripts. This difference is often not
noticeable when the bottleneck is the network or a disk drive but can be
significant for local backups.
With rsync
scripts, all past backups appear
as copies and are thus easy to verify, restore, and delete. With rdiff-backup
, only the current backup appears as a
true copy. (Earlier backups are stored as compressed deltas.)
rdiff-backup
is written in Python and
requires the librsync
library. Unless you use a
distribution that includes rdiff-backup
(most
of them include it), installation could entail downloading and installing other
files.
Here’s a basic, but complete, example of how to use rdiff-backup
to back up and restore a directory. Suppose the directory to
be backed up is called <source>, and we want
our archive directory to be called <destination>:
$ rdiff-backup source destination
This command backs up the <source>
directory into <destination>. If you look
into <destination>, you’ll see that it is
just like <source> but contains a directory
called <destination>/rdiff-backup-data where
the metadata and deltas are stored. The rdiff-backup-data directory is laid out in a fairly straightforward
way—all information is either in (possibly gzipped) text files or in deltas readable
with the rdiff
utility—but we don’t have the space to
go into the data format here.
The first time you run this command, it creates the <destination> and <destination>/rdiff-backup-data directories. On subsequent runs, it sees that <destination> exists and makes an incremental backup instead. For daily backup usage, no special switches are necessary.
Suppose you accidentally delete the file <source>/foobar and want to restore it from backups. Both of these commands do that:
$cp -a
destination
/foobar
$
source
rdiff-backup -r now
destination
/foobar
source
The first command works because <destination>/foobar is a mirror of <source>/foobar, so you can use cp
or any other utility to restore. The second command contains the
-
r
switch, which tells rdiff-backup
to enter restore mode, and restore the specified file at the
given time. In the example, now
is specified, meaning
restore the most recent version of the file. rdiff-backup
accepts a large variety of time formats.
Now suppose you realize you deleted the important file <source>/foobar a week ago and want to restore. You can’t use
cp
to restore because the file is no longer present
in <destination> in its original form (in
this case it’s gzipped in the <destination>/rdiff-backup-data directory). However the -r
syntax still works, except you tell it 7D
for seven days:
$rdiff-backup -r 7D
destination
/foobar
source
Finally, suppose that the <destination> directory is getting too big, and you need to delete older backups to save disk space. This command deletes backup information more than one year old:
$rdiff-backup –-remove-older-than 1Y
destination
Just like rsync
, rdiff-backup
allows the source or destination directory (or both) to be on
a remote computer. For example, to back up the local directory <source>
to the <destination> directory on
the computer host.net, use the command:
$rdiff-backup
source
user
@host.net::
destination
This works as long as rdiff-backup
is installed
on both computers, and host.net can receive
ssh
connections. The earlier commands also work if
[email protected]::<destination> is substituted
for <destination>.
Although rdiff-backup
was originally developed
under Linux for Unix-style systems, newer versions have features that are useful to
Windows and Mac users.
For instance, rdiff-backup
can back up case-sensitive
filesystems and files whose names contain colons (:) to Windows filesystems. Also,
rdiff-backup
supports Mac resource forks and Finder
information, and is easy to install on Mac OS X because it is included in the Fink
distribution. Unfortunately, rdiff-backup
is a bit
trickier to install natively under Windows; currently, cygwin
is probably the easiest way.
Future development of rdiff-backup
may consist
mostly of making sure that the newer features like full Mac OS X support are as stable
as the core Unix support, and adding support for new filesystem features as they emerge.
For more information on rdiff-backup
, including full documentation and a pointer to the mailing
list, see the rdiff-backup
project home page at
http://rdiff-backup.nongnu.org/.
BackupCentral.com has a wiki page for every chapter in this book. Read or contribute updated information about this chapter at http://www.backupcentral.com.
[1] Some of these differences no longer apply because Mike has continued to update
his script. Specifically, the new version of Mike Rubel’s script supports multiple
trees and locking, and it also has a syslog
support wrapper.