Corpus editing with file locking

Corpus readers and views are all read-only, but there will be times when you want to add to or edit the corpus files. However, modifying a corpus file while other processes are using it, such as through a corpus reader, can lead to dangerous undefined behavior. This is where file locking comes in handy.

Getting ready

You must install the lockfile library using sudo easy_install lockfile or sudo pip install lockfile. This library provides cross-platform file locking, and so will work on Windows, Unix/Linux, Mac OS X, and more. You can find detailed documentation on lockfile at http://packages.python.org/lockfile/.

How to do it...

Here are two file editing functions: append_line() and remove_line(). Both try to acquire an exclusive lock on the file before updating it. An exclusive lock means that these functions will wait until no other process is reading from or writing to the file. Once the lock is acquired, any other process that tries to access the file will have to wait until the lock is released. This way, modifying the file will be safe and not cause any undefined behavior in other processes. These functions can be found in corpus.py, as follows:

import lockfile, tempfile, shutil

def append_line(fname, line):	with lockfile.FileLock(fname):
  fp = open(fname, 'a+')
  fp.write(line)
  fp.write('
')
  fp.close()

def remove_line(fname, line):

  with lockfile.FileLock(fname):
    tmp = tempfile.TemporaryFile()
    fp = open(fname, 'rw+')
    # write all lines from orig file, except if matches given line
    for l in fp:
      if l.strip() != line:
        tmp.write(l)

    # reset file pointers so entire files are copied
    fp.seek(0)
    tmp.seek(0)
    # copy tmp into fp, then truncate to remove trailing line(s)
    shutil.copyfileobj(tmp, fp)
    fp.truncate()
    fp.close()
    tmp.close()

The lock acquiring and releasing happens transparently when you do with lockfile.FileLock(fname).

Note

Instead of using with lockfile.FileLock(fname), you can also get a lock by calling lock = lockfile.FileLock(fname), then call lock.acquire() to acquire the lock, and lock.release() to release the lock.

How it works...

You can use these functions as follows:

>>> from corpus import append_line, remove_line
>>> append_line('test.txt', 'foo')
>>> remove_line('test.txt', 'foo')

In append_line(), a lock is acquired, the file is opened in append mode, the text is written along with an end-of-line character, and then the file is closed, releasing the lock.

Tip

A lock acquired by lockfile only protects the file from other processes that also use lockfile. In other words, just because your Python process has a lock with lockfile doesn't mean a non-Python process can't modify the file. For this reason, it's best to only use lockfile with files that will not be edited by an non-Python processes, or Python processes that do not use lockfile.

The remove_line() function is a bit more complicated. Because we're removing a line, and not a specific section of the file, we need to iterate over the file to find each instance of the line to remove. The easiest way to do this while writing the changes back to the file, is to use a temporary file to hold the changes, then copy that file back into the original file using shutil.copyfileobj().

Note

The remove_line() function does not work on Mac OS X, but does work on Linux. For remove_line() to work, it must be able to open a file in both read and write modes, and Mac OS X does not allow this.

These functions are best suited for a wordlist corpus, or some other corpus type with presumably unique lines, that may be edited by multiple people at about the same time, such as through a web interface. Using these functions with a more document-oriented corpus such as brown, treebank, or conll2000, is probably a bad idea.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset