Managing Disk Space

While the cost of storage has dropped incredibly in the past few years, disk usage is still a valid concern for administrators seeking to version large amounts of data. Every bit of version history information stored in the live repository needs to be backed up elsewhere, perhaps multiple times as part of rotating backup schedules. It is useful to know which pieces of Subversion’s repository data need to remain on the live site, which need to be backed up, and which can be safely removed.

How Subversion saves disk space

To keep the repository small, Subversion uses deltification (or deltified storage) within the repository itself. Deltification involves encoding the representation of a chunk of data as a collection of differences against some other chunk of data. If the two pieces of data are very similar, this deltification results in storage savings for the deltified chunk—rather than taking up space equal to the size of the original data, it takes up only enough space to say, I look just like this other piece of data over here, except for the following couple of changes. The result is that most of the repository data that tends to be bulky—namely, the contents of versioned files—is stored at a much smaller size than the original full-text representation of that data. And for repositories created with Subversion 1.4 or later, the space savings are even better—now those full-text representations of file contents are themselves compressed.

Note

Because all of the data that is subject to deltification in a BDB-backed repository is stored in a single Berkeley DB database file, reducing the size of the stored values will not immediately reduce the size of the database file itself. Berkeley DB will, however, keep internal records of unused areas of the database file and consume those areas first before growing the size of the database file. So, while deltification doesn’t produce immediate space savings, it can drastically slow future growth of the database.

Removing dead transactions

Though they are uncommon, there are circumstances in which a Subversion commit process might fail, leaving behind in the repository the remnants of the revision-to-be that wasn’t—an uncommitted transaction and all the file and directory changes associated with it. This could happen for several reasons: perhaps the client operation was inelegantly terminated by the user, or a network failure occurred in the middle of an operation. Regardless of the reason, dead transactions can happen. They don’t do any real harm, other than consuming disk space. A fastidious administrator may nonetheless wish to remove them.

You can use the svnadmin lstxns command to list the names of the currently outstanding transactions:

$ svnadmin lstxns myrepos
19
3a1
a45
$

Each item in the resultant output can then be used with svnlook (and its --transaction (-t) option) to determine who created the transaction, when it was created, what types of changes were made in the transaction—information that is helpful in determining whether the transaction is a safe candidate for removal! If you do indeed want to remove a transaction, its name can be passed to svnadmin rmtxns, which will perform the cleanup of the transaction. In fact, svnadmin rmtxns can take its input directly from the output of svnadmin lstxns!

$ svnadmin rmtxns myrepos 'svnadmin lstxns myrepos'
$

If you use these two subcommands like this, you should consider making your repository temporarily inaccessible to clients. That way, no one can begin a legitimate transaction before you start your cleanup. Example 5-1 contains a bit of shell-scripting that can quickly generate information about each outstanding transaction in your repository.

Example 5-1. txn-info.sh (reporting outstanding transactions)
#!/bin/sh

### Generate informational output for all outstanding transactions in
### a Subversion repository.

REPOS="${1}"
if [ "x$REPOS" = x ] ; then
  echo "usage: $0 REPOS_PATH"
  exit
fi

for TXN in 'svnadmin lstxns ${REPOS}'; do 
  echo "---[ Transaction ${TXN} ]-------------------------------------------"
  svnlook info "${REPOS}" -t "${TXN}"
done

The output of the script is basically a concatenation of several chunks of svnlook info output (see svnlook) and will look something like this:

$ txn-info.sh myrepos
---[ Transaction 19 ]-------------------------------------------
sally
2001-09-04 11:57:19 -0500 (Tue, 04 Sep 2001)
0
---[ Transaction 3a1 ]-------------------------------------------
harry
2001-09-10 16:50:30 -0500 (Mon, 10 Sep 2001)
39
Trying to commit over a faulty network.
---[ Transaction a45 ]-------------------------------------------
sally
2001-09-12 11:09:28 -0500 (Wed, 12 Sep 2001)
0
$

A long-abandoned transaction usually represents some sort of failed or interrupted commit. A transaction’s datestamp can provide interesting information—for example, how likely is it that an operation begun nine months ago is still active?

In short, transaction cleanup decisions need not be made unwisely. Various sources of information—including Apache’s error and access logs, Subversion’s operational logs, Subversion revision history, and so on—can be employed in the decision-making process. And of course, an administrator can often simply communicate with a seemingly dead transaction’s owner (e.g., via email) to verify that the transaction is, in fact, in a zombie state.

Purging unused Berkeley DB logfiles

Until recently, the largest offender of disk space usage with respect to BDB-backed Subversion repositories were the logfiles in which Berkeley DB performs its prewrites before modifying the actual database files. These files capture all the actions taken along the route of changing the database from one state to another—while the database files, at any given time, reflect a particular state, the logfiles contain all of the many changes along the way between states. Thus, they can grow and accumulate quite rapidly.

Fortunately, beginning with the 4.2 release of Berkeley DB, the database environment has the ability to remove its own unused logfiles automatically. Any repositories created using svnadmin when compiled against Berkeley DB version 4.2 or later will be configured for this automatic logfile removal. If you don’t want this feature enabled, simply pass the --bdb-log-keep option to the svnadmin create command. If you forget to do this or change your mind at a later time, simply edit the DB_CONFIG file found in your repository’s db directory, comment out the line that contains the set_flags DB_LOG_AUTOREMOVE directive, and then run svnadmin recover on your repository to force the configuration changes to take effect. See Berkeley DB Configuration for more information about database configuration.

Without some sort of automatic logfile removal in place, logfiles will accumulate as you use your repository. This is actually something of a feature of the database system—you should be able to recreate your entire database using nothing but the logfiles, so these files can be useful for catastrophic database recovery. But typically, you’ll want to archive the logfiles that are no longer in use by Berkeley DB and then remove them from disk to conserve space. Use the svnadmin list-unused-dblogs command to list the unused logfiles:

$ svnadmin list-unused-dblogs /var/svn/repos
/var/svn/repos/log.0000000031
/var/svn/repos/log.0000000032
/var/svn/repos/log.0000000033
...
$ rm 'svnadmin list-unused-dblogs /var/svn/repos'
## disk space reclaimed!

Warning

BDB-backed repositories whose logfiles are used as part of a backup or disaster recovery plan should not make use of the logfile autoremoval feature. Reconstruction of a repository’s data from logfiles can only be accomplished only when all the logfiles are available. If some of the logfiles are removed from disk before the backup system has a chance to copy them elsewhere, the incomplete set of backed-up logfiles is essentially useless.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset