This section goes over different techniques for working with DOM, and highlights some of the features that MSXML supports but PyXML does not. In addition to these convenience functions added by Microsoft, working with MSXML means also working with COM, so examples are shown here to work with the various types returned by MSXML that may stray from your standard Python list types and tuples.
The Microsoft DOM supports the same operations as the PyXML DOM, but there are differences in using them. For starters, MSXML is only accessible via COM, so your Python needs to work as a COM client. Second, and related to the first, is MSXML is not a native Python implementation and therefore doesn’t use Python types like the lists and tuples you’d find in PyXML. This section shows you the basics of working with this foreign parser from within Python.
To illustrate some node and document manipulation, you need some source XML to manipulate. You’ll want structured data like books.xml shown in Example E-1, and try out your MSXML skills.
<book name="Python and XML"> <section name="Appendix E" type="Appendix"> <chapterTitle>Appendix E</chapterTitle> <bodytext>This appendix focuses on techniques for using... </bodytext> </section> </book>
Using MSXML, it’s easy to take this document apart. But before you
can work with MSXML, you have to import the correct library to access
COM objects (win32com.client)
.
Additionally, for the call to Dispatch
, you need the ProgID of the Microsoft
XML parser. If you’ve installed the latest Microsoft XML SDK, you have
Version 3.0 of the MSXML parser. You may also have it if you’re running
Visual Studio.NET or Internet Explorer 6. However, if you aren’t sure,
you can download the XML SDK from Microsoft and install the newest
version of the parser.
After importing the client package and calling Dispatch
with the correct ProgID, use MSXML’s
load
method to actually load a
document:
>>> import win32com.client >>> msxml = win32com.client.Dispatch("MSXML2.DOMDocument.3.0") >>> msxml.load("books.xml") 1
The returned 1
indicates
success in Python terms, and allows for the syntax:
if (msxml.load("books.xml")): # success else: # failure
Now that the msxml
instance is
ready to go, you can begin plucking out nodes and experimenting with
them.
The MSXML objects will feel familiar to you if you’ve
been working with the PyXML objects throughout this book. Retrieving a
documentElement
or getting a node’s
nodeName
works as you might
suspect:
>>> docelem = msxml.documentElement >>> print docelem.nodeName book >>> print docelem.getAttribute("name") Python & XML
MSXML throws in the occasional convenience like the text
attribute of its Node
class. This method returns all text
content (or character data) beneath the current node:
>>> print docelem.text Appendix E This appendix focuses on techniques for using...
This can come in handy when working with text-heavy documents.
Related to the text
attribute is
the xml
attribute. The xml
attribute returns a string of XML
representing the current node and its children:
>>> print docelem.xml <book name="Python and XML"> <section name="Appendix E" type="Appendix"> <chapterTitle>Appendix E</chapterTitle> <bodytext>This appendix focuses on techniques for using... </bodytext> </section> </book>
This is a definite shortcut (for your typing at least) over
using the PrettyPrint
method in the
PyXML DOM extensions package. Of course, just like PyXML, some MSXML
methods return collections of nodes rather than single nodes. In these
cases, use the MSXML NodeList
interface for dealing with the collections.
MSXML3.0 has great support for node lists, and provides
a NodeList
object for use in their
manipulation. This is slightly different then the native and robust
list type provided by Python and PyXML. The NodeList
object has a built-in iterator that you can take advantage of by
calling the nextNode
method; note
that this is different from the concept of iterators as they have been
implemented in Python 2.2 and newer versions.
node = NodeList.nextNode( ) while node: # do something here... node = NodeList.nextNode( )
A while
loop can be used
until the nextNode
method fails to
return a node. Example
E-2, people.xml, shows
some sample XML describing workers and their job titles.
<employees> <person title="Project Manager">Cal Ender</person> <person title="Development Lead">A. Buddy Codit</person> <person title="Customer Service Rep">Will Icare</person> <person title="Documentation Writer">E. Manual</person> <person title="Catering Specialist">Willy Eadit</person> </employees>
In a structure such as this, a NodeList
can be a convenient way to process
all nodes of a certain type. A NodeList
can be returned with a call to
getElementsByTagName
, or by using a
string expression in one of the selectNodes
and selectSingleNode
methods of MSXML3.0. Example E-3 shows the NodeList
in use in nodelists.py:
""" nodelists.py - using the NodeList object from MSXML3.0 """ import win32com.client # source XML strSourceDoc = "people.xml" # instantiate parser objXML = win32com.client.Dispatch("MSXML2.DOMDocument.3.0") # check for successful loading if (not objXML.load(strSourceDoc)): print "Error loading", strSourceDoc # grab all person elements peopleNodes = objXML.getElementsByTagName("person") # begin iteration of NodeList with nextNode( ) node = peopleNodes.nextNode( ) while node: # print value of text descendants print "Name: ", node.text, # print value of title attribute print " Position: ", node.getAttribute("title") # continue iteration node = peopleNodes.nextNode( )
When you run nodelists.py from the command prompt, you’ll get a textual version of its contents:
C:appD>c:python21python nodelists.py Name: Cal Ender Position: Project Manager Name: A. Buddy Codit Position: Development Lead Name: Will Icare Position: Customer Service Rep Name: E. Manual Position: Documentation Writer Name: Willy Eadit Position: Catering Specialist