Corpus readers and views are all read-only, but there will be times when you want to add to or edit the corpus files. However, modifying a corpus file while other processes are using it, such as through a corpus reader, can lead to dangerous undefined behavior. This is where file locking comes in handy.
You must install the lockfile
library using sudo easy_install lockfile
or sudo pip install lockfile
. This library provides cross-platform file locking, and so will work on Windows, Unix/Linux, Mac OS X, and more. You can find detailed documentation on lockfile at http://packages.python.org/lockfile/.
Here are two file editing functions: append_line()
and remove_line()
. Both try to acquire an exclusive lock on the file before updating it. An exclusive lock means that these functions will wait until no other process is reading from or writing to the file. Once the lock is acquired, any other process that tries to access the file will have to wait until the lock is released. This way, modifying the file will be safe and not cause any undefined behavior in other processes. These functions can be found in corpus.py
, as follows:
import lockfile, tempfile, shutil def append_line(fname, line): with lockfile.FileLock(fname): fp = open(fname, 'a+') fp.write(line) fp.write(' ') fp.close() def remove_line(fname, line): with lockfile.FileLock(fname): tmp = tempfile.TemporaryFile() fp = open(fname, 'rw+') # write all lines from orig file, except if matches given line for l in fp: if l.strip() != line: tmp.write(l) # reset file pointers so entire files are copied fp.seek(0) tmp.seek(0) # copy tmp into fp, then truncate to remove trailing line(s) shutil.copyfileobj(tmp, fp) fp.truncate() fp.close() tmp.close()
The lock acquiring and releasing happens transparently when you do with lockfile.FileLock(fname)
.
You can use these functions as follows:
>>> from corpus import append_line, remove_line >>> append_line('test.txt', 'foo') >>> remove_line('test.txt', 'foo')
In append_line()
, a lock is acquired, the file is opened in append mode, the text is written along with an end-of-line character, and then the file is closed, releasing the lock.
A lock acquired by lockfile
only protects the file from other processes that also use lockfile
. In other words, just because your Python process has a lock with lockfile
doesn't mean a non-Python process can't modify the file. For this reason, it's best to only use lockfile
with files that will not be edited by an non-Python processes, or Python processes that do not use lockfile
.
The remove_line()
function is a bit more complicated. Because we're removing a line, and not a specific section of the file, we need to iterate over the file to find each instance of the line to remove. The easiest way to do this while writing the changes back to the file, is to use a temporary file to hold the changes, then copy that file back into the original file using shutil.copyfileobj()
.
These functions are best suited for a wordlist corpus, or some other corpus type with presumably unique lines, that may be edited by multiple people at about the same time, such as through a web interface. Using these functions with a more document-oriented corpus such as brown
, treebank
, or conll2000
, is probably a bad idea.