While the cost of storage has dropped incredibly in the past few years, disk usage is still a valid concern for administrators seeking to version large amounts of data. Every bit of version history information stored in the live repository needs to be backed up elsewhere, perhaps multiple times as part of rotating backup schedules. It is useful to know which pieces of Subversion’s repository data need to remain on the live site, which need to be backed up, and which can be safely removed.
To keep the repository small, Subversion uses deltification (or deltified storage) within the repository itself. Deltification involves encoding the representation of a chunk of data as a collection of differences against some other chunk of data. If the two pieces of data are very similar, this deltification results in storage savings for the deltified chunk—rather than taking up space equal to the size of the original data, it takes up only enough space to say, “I look just like this other piece of data over here, except for the following couple of changes.” The result is that most of the repository data that tends to be bulky—namely, the contents of versioned files—is stored at a much smaller size than the original full-text representation of that data. And for repositories created with Subversion 1.4 or later, the space savings are even better—now those full-text representations of file contents are themselves compressed.
Because all of the data that is subject to deltification in a BDB-backed repository is stored in a single Berkeley DB database file, reducing the size of the stored values will not immediately reduce the size of the database file itself. Berkeley DB will, however, keep internal records of unused areas of the database file and consume those areas first before growing the size of the database file. So, while deltification doesn’t produce immediate space savings, it can drastically slow future growth of the database.
Though they are uncommon, there are circumstances in which a Subversion commit process might fail, leaving behind in the repository the remnants of the revision-to-be that wasn’t—an uncommitted transaction and all the file and directory changes associated with it. This could happen for several reasons: perhaps the client operation was inelegantly terminated by the user, or a network failure occurred in the middle of an operation. Regardless of the reason, dead transactions can happen. They don’t do any real harm, other than consuming disk space. A fastidious administrator may nonetheless wish to remove them.
You can use the svnadmin lstxns command to list the names of the currently outstanding transactions:
$ svnadmin lstxns myrepos 19 3a1 a45 $
Each item in the resultant output can then be used with svnlook (and its --transaction
(-t
) option) to determine who created the
transaction, when it was created, what types of changes were made in
the transaction—information that is helpful in determining whether the
transaction is a safe candidate for removal! If you do indeed want to
remove a transaction, its name can be passed to svnadmin rmtxns, which will perform the
cleanup of the transaction. In fact, svnadmin
rmtxns can take its input directly from the output of
svnadmin lstxns!
$ svnadmin rmtxns myrepos 'svnadmin lstxns myrepos' $
If you use these two subcommands like this, you should consider making your repository temporarily inaccessible to clients. That way, no one can begin a legitimate transaction before you start your cleanup. Example 5-1 contains a bit of shell-scripting that can quickly generate information about each outstanding transaction in your repository.
#!/bin/sh ### Generate informational output for all outstanding transactions in ### a Subversion repository. REPOS="${1}" if [ "x$REPOS" = x ] ; then echo "usage: $0 REPOS_PATH" exit fi for TXN in 'svnadmin lstxns ${REPOS}'; do echo "---[ Transaction ${TXN} ]-------------------------------------------" svnlook info "${REPOS}" -t "${TXN}" done
The output of the script is basically a concatenation of several chunks of svnlook info output (see svnlook) and will look something like this:
$ txn-info.sh myrepos ---[ Transaction 19 ]------------------------------------------- sally 2001-09-04 11:57:19 -0500 (Tue, 04 Sep 2001) 0 ---[ Transaction 3a1 ]------------------------------------------- harry 2001-09-10 16:50:30 -0500 (Mon, 10 Sep 2001) 39 Trying to commit over a faulty network. ---[ Transaction a45 ]------------------------------------------- sally 2001-09-12 11:09:28 -0500 (Wed, 12 Sep 2001) 0 $
A long-abandoned transaction usually represents some sort of failed or interrupted commit. A transaction’s datestamp can provide interesting information—for example, how likely is it that an operation begun nine months ago is still active?
In short, transaction cleanup decisions need not be made unwisely. Various sources of information—including Apache’s error and access logs, Subversion’s operational logs, Subversion revision history, and so on—can be employed in the decision-making process. And of course, an administrator can often simply communicate with a seemingly dead transaction’s owner (e.g., via email) to verify that the transaction is, in fact, in a zombie state.
Until recently, the largest offender of disk space usage with respect to BDB-backed Subversion repositories were the logfiles in which Berkeley DB performs its prewrites before modifying the actual database files. These files capture all the actions taken along the route of changing the database from one state to another—while the database files, at any given time, reflect a particular state, the logfiles contain all of the many changes along the way between states. Thus, they can grow and accumulate quite rapidly.
Fortunately, beginning with the 4.2 release of Berkeley DB, the
database environment has the ability to remove its own unused logfiles
automatically. Any repositories created using svnadmin
when compiled against Berkeley DB version 4.2 or later
will be configured for this automatic logfile removal. If you don’t
want this feature enabled, simply pass the --bdb-log-keep
option to the
svnadmin create command. If you
forget to do this or change your mind at a later time, simply edit
the DB_CONFIG file
found in your repository’s
db directory, comment out the
line that contains the set_flags DB_LOG_AUTOREMOVE
directive,
and then run svnadmin
recover on your repository to force the configuration
changes to take effect. See Berkeley DB Configuration for more information about
database configuration.
Without some sort of automatic logfile removal in place, logfiles will accumulate as you use your repository. This is actually something of a feature of the database system—you should be able to recreate your entire database using nothing but the logfiles, so these files can be useful for catastrophic database recovery. But typically, you’ll want to archive the logfiles that are no longer in use by Berkeley DB and then remove them from disk to conserve space. Use the svnadmin list-unused-dblogs command to list the unused logfiles:
$ svnadmin list-unused-dblogs /var/svn/repos /var/svn/repos/log.0000000031 /var/svn/repos/log.0000000032 /var/svn/repos/log.0000000033 ... $ rm 'svnadmin list-unused-dblogs /var/svn/repos' ## disk space reclaimed!
BDB-backed repositories whose logfiles are used as part of a backup or disaster recovery plan should not make use of the logfile autoremoval feature. Reconstruction of a repository’s data from logfiles can only be accomplished only when all the logfiles are available. If some of the logfiles are removed from disk before the backup system has a chance to copy them elsewhere, the incomplete set of backed-up logfiles is essentially useless.