In this section, we create a file indexing script that can generate an XML document representing your entire filesystem or a specific portion of it. Indexing files with XML is a powerful way to keep track of information, or perform bulk operations on groups of particular files on a disk. You can create an XML-generating indexing routine easily in Python. The index.py program in Example 3-4 (which shows up a little later in the chapter) starts in any directory you specify and generates an element for each file or directory that exists beneath the starting point. Once we have the index of file information, we look at how to use SAX to search the information to filter the list of files for whatever criteria interests us at the time.
The main part of this routine works by just checking
each file in a starting directory, and then recursing into any
directories it finds beneath the starting directory. Recursion allows
it to index an entire filesystem if you choose. On Unix, the program
performs a lot of work, as it does content checking via a popen
call to the file
command for each file. (While this
could be made more efficient by calling find
less often and requiring it to operate
on more than one file at a time, that isn’t the topic of this book.)
One of the key methods of this class is indexDirectoryFiles
:
def indexDirectoryFiles(self, dir): """Index a directory structure and creates an XML output file.""" # prepare output XML file self.__fd = open(self.outputFile, "w") self.__fd.write('<?xml version="1.0" encoding="' + XML_ENC + '"?> ') self.__fd.write("<IndexedFiles> ")# do actual indexing
self.__indexDir(dir)
# close out XML file self.__fd.write("</IndexedFiles> ") self.__fd.close()
An XML file is created with the name given in outputFile
and an XML declaration and root
element are added. The indexDirectoryFiles
method calls its
internal _ _indexDir
method—this is
the real worker method. It is a recursive method that descends the
file hierarchy, indexing files along the way.
def __indexDir(self, dir): """Recursive function to do the actual indexing.""" # Create an indexFile for each regular file, # and call the function again for each directory files = listdir(dir) for file in files: fullname = os.path.join(dir, file) st_values = stat(fullname) # check if its a directory if S_ISDIR(st_values[0]): print file # create directory element self.__fd.write("<Directory ") self.__fd.write(' name="' + escape(fullname) + '"> ') self.__indexDir(fullname) self.__fd.write("</Directory> ") else: # create regular file entry print dir + file lf = IndexFile(fullname, st_values) self.__fd.write(lf.getXML())
The actual work is just determining those files that are
directories and those that are regular files. XML is created
accordingly during this process, and written to the output file. When
all of the __indexDir
calls
eventually return, the XML file is closed.
Now the program is essentially finished. A helper function named
escape
is imported from the
xml.sax.saxutils
module to perform
entity substitution against some common characters within XML
character data to ensure they do not appear to be markup in the
resulting XML.
The IndexFile
class is used
for an XML representation of file information. This information is
derived primarily from the os.stat
system call. The class copies
information from the stat
call
into its member variables in its __init
__ method, as shown here:
def __init__(self, filename, st_vals): """Extract properties from supplied stat object.""" self.filename = filename self.uid = st_vals[4] self.gid = st_vals[5] self.size = st_vals[6] self.accessed = ctime(st_vals[7]) self.modified = ctime(st_vals[8]) self.created = ctime(st_vals[9]) # try for filename extension self.extension = os.path.splitext(filename)[1]
In this method, important file information is extracted from
the tuple st_vals
. This contains
the filesystem information returned by the stat
call. The __init
__ method also tries for a filename
extension if possible by checking for the ".
" character. If you are running Unix, the
script tries to use the os.popen
function to call the file
command, which returns a human-readable description of the content
of both text and binary files. It can take much longer to generate,
but once in, the XML is valuable and does not need to be regenerated
every time we want it:
# check contents using file command on linux if os.name == "posix": # Open a process to check file contents fd = popen("file "" + filename + """) self.contents = fd.readline().rstrip( ) fd.close( ) else: # No content information self.contents = self.extension
If you’re not using Unix, the file
command is unavailable, and so the
contents information is given the file extension. For example, in a
Word file, the XML is <contents>.doc</contents>
. On
Unix, however, the call to popen
returns a file object. The output text of the command is read in
using the readline
method of the
file object. The results are then stripped and used as a description
of the files contents. The class features a single method, getXML,
which returns the file information
as a single XML element in string format:
def getXML(self): """Returns XML version of all data members.""" return ("<file name="" + escape(self.filename) + "">" + " <userID>" + str(self.uid) + "</userID>" + " <groupID>" + str(self.gid) + "</groupID>" + " <size>" + str(self.size) + "</size>" + " <lastAccessed>" + self.accessed + "</lastAccessed>" + " <lastModified>" + self.modified + "</lastModified>" + " <created>" + self.created + "</created>" + " <extension>" + self.extension + "</extension>" + " <contents>" + escape(self.contents) + "</contents>" + " </file>")
In the preceding code, the XML is thrown together as a series
of strings. Another way is to use a DOMImplementation
object to create
individual elements and insert them into the document’s structure
(illustrated in Chapter
10).
Both of these classes are used to develop a lengthy XML document representing files and metadata for any given section of your filesystem. The complete listing of index.py is shown in Example 3-4
#!/usr/bin/env python """ index.py usage: python index.py <starting-dir> <output-file> """ import os import sys from os import stat from os import listdir from os import popen from stat import S_ISDIR from time import ctime from xml.sax.saxutils import escape XML_ENC = "ISO-8859-1" """"""""""""""""""""""""""""""""""""""""""""""""""" Class: Index(startingDir, outputFile) """"""""""""""""""""""""""""""""""""""""""""""""""" class Index: """ This class indexes files and builds a resultant XML document. """ def __init__(self, startingDir, outputFile): """ init: sets output file """ self.outputFile = outputFile self.startingDir = startingDir def indexDirectoryFiles(self, dir): """Index a directory structure and creates an XML output file.""" # prepare output XML file self.__fd = open(self.outputFile, "w") self.__fd.write('<?xml version="1.0" encoding="' + XML_ENC + '"?> ') self.__fd.write("<IndexedFiles> ") # do actual indexing self.__indexDir(dir) # close out XML file self.__fd.write("</IndexedFiles> ") self.__fd.close( ) def __indexDir(self, dir): """Recursive function to do the actual indexing.""" # Create an indexFile for each regular file, # and call the function again for each directory files = listdir(dir) for file in files: fullname = os.path.join(dir, file) st_values = stat(fullname) # check if its a directory if S_ISDIR(st_values[0]): print file # create directory element self.__fd.write("<Directory ") self.__fd.write(' name="' + escape(fullname) + '"> ') self.__indexDir(fullname) self.__fd.write("</Directory> ") else: # create regular file entry print dir + file lf = IndexFile(fullname, st_values) self.__fd.write(lf.getXML( )) """"""""""""""""""""""""""""""""""""""""""""""""""" Class: IndexFile(filename, stat-tuple) """"""""""""""""""""""""""""""""""""""""""""""""""" class IndexFile: """ Simple file representation object with XML """ def __init__(self, filename, st_vals): """Extract properties from supplied stat object.""" self.filename = filename self.uid = st_vals[4] self.gid = st_vals[5] self.size = st_vals[6] self.accessed = ctime(st_vals[7]) self.modified = ctime(st_vals[8]) self.created = ctime(st_vals[9]) # try for filename extension self.extension = os.path.splitext(filename)[1] # check contents using file command on linux if os.name == "posix": # Open a process to check file # contents fd = popen("file "" + filename + """) self.contents = fd.readline().rstrip( ) fd.close( ) else: # No content information self.contents = self.extension def getXML(self): """Returns XML version of all data members.""" return ("<file name="" + escape(self.filename) + "">" + " <userID>" + str(self.uid) + "</userID>" + " <groupID>" + str(self.gid) + "</groupID>" + " <size>" + str(self.size) + "</size>" + " <lastAccessed>" + self.accessed + "</lastAccessed>" + " <lastModified>" + self.modified + "</lastModified>" + " <created>" + self.created + "</created>" + " <extension>" + self.extension + "</extension>" + " <contents>" + escape(self.contents) + "</contents>" + " </file>") """"""""""""""""""""""""""""""""""""""""""""""""""" Main """"""""""""""""""""""""""""""""""""""""""""""""""" if __name__ == "__main__": index = Index(sys.argv[1], sys.argv[2]) print "Starting Dir:", index.startingDir print "Output file:", index.outputFile index.indexDirectoryFiles(index.startingDir)
Running index.py from the command line requires supplying both a starting directory and an XML filename to use as output:
$> python index.py /usr/bin/ usrbin.xml
The script prints directory names similar to the find command, but after completion, the file usrbin.xml contains something similar to the following:
<?xml version="1.0" encoding=" ISO-8859-1"?> <IndexedFiles> <Directory name="/usr/bin/X11"> <file name="/usr/bin/X11/Magick-config"> <userID>0</userID> <groupID>0</groupID> <size>1786</size> <lastAccessed>Fri Jan 19 22:29:34 2001</lastAccessed> <lastModified>Mon Aug 30 20:49:06 1999</lastModified> <created>Mon Sep 11 17:22:01 2000</created> <extension>None</extension> <contents>/usr/bin/X11/Magick-config: Bourne shell script text</contents> </file><file name="/usr/bin/X11/animate"> <userID>0</userID> <groupID>0</groupID> <size>16720</size> <lastAccessed>Fri Jan 19 22:29:34 2001</lastAccessed> <lastModified>Mon Aug 30 20:49:09 1999</lastModified> <created>Mon Sep 11 17:22:01 2000</created> <extension>None</extension> <contents>/usr/bin/X11/animate: ELF 32-bit LSB executable, Intel 80386,version 1, dynamically linked (uses shared libs), stripped</contents> </file>
The XML file’s size depends on the particular directory it originated in. By default, the program follows symbolic links (on Unix, symbolic links allow one directory or filename to refer to another), introducing the possibility of forming infinite recursion, so beware! Indexing your home directory or indexing a directory of open source software that you’ve downloaded is probably the most effective thing to do in this case.
Now that your file data has been abstracted to XML, you can write a SAX event handler to search for items within the file list. SAX is a good choice here, because this document could easily be several megabytes in size, and interpreting it as it is being read is the least resource-intensive approach.
The saxfinder.py script takes a single argument (the search text) and parses the supplied XML file checking via its SAX handler interfaces, in order to see if any of the files are of interest to you.
The script expects to work on XML as created earlier with index.py.
If the contents
element of
your XML file contains the character data that you supplied on the
command line, the file is considered a match and the script prints a
message accordingly. If you are running Windows, your contents
tags only have the file extension,
so your searches are limited to file extensions, unless you alter the
code to watch something besides just the contents
element.
Use three methods of the SAX interface to implement your
metadata finder. First, startElement
is implemented to both capture
the name of the current file
element as well as mark when you’ve entered the character data portion
following a contents
tag:
def startElement(self, name, attrs): if name == "file": self.filename = attrs.get('name', "") elif name == "contents": self.getCont = 1
If you’re entering a content element, a flag (self.getCont
) is set so that the characters
method knows when to gobble up
character data and store it in another member variable:
def characters(self, ch): if self.getCont: self.contents += ch
When an endElement
event
rolls around, the script examines the contents that have been captured
(if any) to see if they match the original command-line parameter. If
so, the filename is printed; if not, SAX happily moves on to the next
file element within the XML document:
def endElement(self, name): if name == "contents": self.getCont = 0 if self.contents.find(self.contentType) > -1: print self.filename, "has", self.contentType, "content." self.contents = ""
In addition, the self.getCont
flag is disabled after leaving a contents element, to instruct the
characters
method not to capture
data.
SAX helps you here by allowing you to process an XML index file that represents an entire filesystem and easily takes up 20 megabytes on your disk. Parsing such a gigantic document with the DOM can be difficult and unbearably slow.
Example 3-5 shows the complete listing of saxfinder.py.
""" saxfinder.py - generates HTML from pyxml.xml """ import sys from xml.sax import make_parser from xml.sax import ContentHandler class FileHandler(ContentHandler): def __init__(self, contentType): self.getCont = 0 self.contents = "" self.filename = "" self.contentType = contentType def startElement(self, name, attrs): if name == "file": self.filename = attrs.get('name', "") elif name == "contents": self.getCont = 1 def characters(self, ch): if self.getCont: self.contents += ch def endElement(self, name): if name == "contents": self.getCont = 0 if self.contents.find(self.contentType) > -1: print self.filename, "has", self.contentType, "content." self.contents = "" # Main fh = FileHandler(sys.argv[1]) parser = make_parser( ) parser.setContentHandler(fh) parser.parse(sys.stdin)
You can run saxfinder.py from the command line on both Unix and Windows. You need to supply a search string as the first parameter, and be sure and redirect or pipe an XML document (created with index.py) into standard input:
$> python saxfinder.py "C program" < nard.xml
The result should be something like this:
/home/shm00/nard/xd/server.cpp has C program content. /home/shm00/nard/xd/shmoo.cpp has C program content. /home/shm00/nard/gl-misc/array.cpp has C program content. /home/shm00/nard/gl-misc/vertex.cpp has C program content. /home/shm00/nard/gl-misc/mecogl.cpp has C program content. /home/shm00/nard/gl-misc/drewgl/smugl.cpp has C program content. /home/shm00/nard/gl-misc/drewgl/pal.cpp has C program content. /home/shm00/nard/gl-misc/drewgl/pal.h has C program content. /home/shm00/nard/gl-misc/drewgl/gl.cpp has C program content.