Credit: Aaron Watters, Software Consultant
There are three kinds of people in this world: those who can count and those who can’t.
However, there are only two kinds of computer programs: toy programs and programs that interact with some kind of persistent databases. That is to say, most real computer programs must retrieve stored information and record information for future use. These days, this description applies to almost every computer game, which can typically save and restore the state of the game at any time. So when I refer to toy programs, I mean programs written as exercises, or for the fun of programming. Nearly all real programs (such as programs that people get paid to write) have some persistent database storage/retrieval component.
When I was a Fortran programmer in the 1980s, I noticed that although almost every program had to retrieve and store information, they almost always did it using home-grown methods. Furthermore, since the storage and retrieval parts of the program were the least interesting components from the programmer’s point of view, these parts of the program were frequently implemented very sloppily and were hideous sources of intractable bugs. This repeated observation convinced me that the study and implementation of database systems sat at the core of programming pragmatics, and that the state of the art as I saw it then required much improvement.
Later, in graduate school, I was delighted to find an impressive and sophisticated body of work relating to the implementation of database systems. The literature of database systems covered issues of concurrency, fault tolerance, distribution, query optimization, database design, and transaction semantics, among others. In typical academic fashion, many of the concepts had been elaborated to the point of absurdity (such as the silly notion of conditional multivalued dependencies), but much of the work was directly related to the practical implementation of reliable and efficient storage and retrieval systems. The starting point for much of this work was E.F. Codd’s seminal paper, “A Relational Model of Data for Large Shared Data Banks.”[1]
Among my fellow graduate students, and even among most of the faculty, the same body of knowledge was either disregarded or regarded with some scorn. Everyone recognized that knowledge of conventional relational technology could be lucrative, but they generally considered such knowledge to be on the same level as knowing how to write (or more importantly, maintain) COBOL programs. This situation was not helped by the fact that the emerging database interface standard, SQL (which is now very well established), looked like an extension of COBOL and bore little obvious relationship to any modern programming language.
More than a decade later, there is little indication that anything will soon overtake SQL-based relational technology for the majority of data-based applications. In fact, relational-database technology seems more pervasive than ever. The largest software vendors—IBM, Microsoft, and Oracle—all provide various relational-database implementations as crucial components of their core offerings. Other large software firms, such as SAP and PeopleSoft, essentially provide layers of software built on top of a relational-database core.
Generally, relational databases have been augmented rather than replaced. Enterprise software-engineering dogma frequently espouses three-tier systems, in which the bottom tier is a carefully designed relational database, the middle tier defines a view of the database as business objects, and the top tier consists of applications or transactions that manipulate the business objects, with effects that ultimately translate to changes in the underlying relational tables.
Microsoft’s Open Database Connectivity (ODBC) standard provides a common programming API for SQL-based relational databases that permits programs to interact with many different database engines with no or few changes. For example, a Python program could be first implemented using Microsoft Jet[2] as a backend database for testing and debugging purposes. Once the program is stable, it can be put into production use, remotely accessing, say, a backend DB2 database on an IBM mainframe residing on another continent, by changing (at most) one line of code.
Relational databases are not appropriate for all applications. In particular, a computer game or engineering design tool that must save and restore sessions should probably use a more direct method of persisting the logical objects of the program than the flat tabular representation encouraged in relational-database design. However, even in domains such as engineering or scientific information, a hybrid approach that uses some relational methods is often advisable. For example, I have seen a complex relational-database schema for archiving genetic-sequencing information—in which the sequences show up as binary large objects (BLOBs)—but a tremendous amount of important ancillary information can fit nicely into relational tables. But as the reader has probably surmised, I fear, I speak as a relational zealot.
Within the Python world there are many ways of providing persistence and database functionality. My personal favorite is Gadfly, http://gadfly.sourceforge.net/, a simple and minimal SQL implementation that works primarily with in-memory databases. It is my favorite for no other reason than because it is mine, and its biggest advantage is that, if it becomes unworkable for you, it is easy to switch over to another, industrial-strength SQL engine. Many Gadfly users have started an application with Gadfly (because it was easy to use) and switched later (because they needed more).
However, many people may prefer to start by using other SQL implementations such as MySQL, Microsoft Access, Oracle, Sybase, Microsoft SQL Server, SQLite, or others that provide the advantages of an ODBC interface (which Gadfly does not do).
Python provides a standard interface for accessing relational databases: the Python DB Application Programming Interface (Py-DBAPI), originally designed by Greg Stein. Each underlying database API requires a wrapper implementation of the Py-DBAPI, and implementations are available for just about all underlying database interfaces, notably Oracle and ODBC.
When the relational approach is overkill, Python provides
built-in facilities for storing and retrieving data. At the most basic
level, the programmer can manipulate files directly, as covered in Chapter 2. A step up from files, the
marshal
module allows programs to
serialize data structures constructed from simple Python types (not
including, e.g., classes or class instances). marshal
has the advantage of being able to
retrieve large data structures with blinding speed. The pickle
and cPickle
modules allow general storage of
objects, including classes, class instances, and circular structures.
cPickle
is so named because it is
implemented in C and is consequently quite fast, but it remains slower
than marshal
. For access to
structured data in a somewhat human-readable form, it is also worth
considering storing and retrieving data in XML format (taking advantage
of Python’s several XML parsing and generation modules), covered in
Chapter 12—but this option
works best for write once, read many-type
applications. Serialized data or XML representations may be stored in
SQL databases to create a hybrid approach as well.
While marshal
and
pickle
provide basic serialization
and deserialization of structures, the application programmer will
frequently desire more functionality, such as transaction support and
concurrency control. When the relational model doesn’t fit the
application, a direct object database implementation such as the
Z-Object Database (ZODB) might be appropriate—see http://zope.org/Products/ZODB3.2.
I must conclude with a plea to those who are dismissive of relational-database technology. Remember that it is successful for good reasons, and it might be worth considering. To paraphrase Churchill:
text = """ Indeed, it has been said that democracy is the worst form of government, except for all those others that have been tried from time to time. """ import string for a, b in [("democracy", "SQL"), ("government", "database")]: text = string.replace(text, a, b) print text
Credit: Luther Blissett
You want to serialize and reconstruct a Python data structure whose items are fundamental Python objects (e.g., lists, tuples, numbers, and strings but no classes, instances, etc.) as fast as possible.
If you know that your data is composed entirely of fundamental
Python objects (and you only need to support one version of Python,
though possibly on several different platforms), the lowest-level,
fastest approach to serializing your data (i.e., turning it into a
string of bytes, and later reconstructing it from such a string) is
via the marshal
module. Suppose
that data
has only elementary
Python data types as items, for example:
data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}
You can serialize data
to a
bytestring at top speed as follows:
import marshal bytes = marshal.dumps(data)
You can now sling bytes
around as you wish (e.g., send it across a network, put it as a BLOB
in a database, etc.), as long as you keep its arbitrary binary bytes
intact. Then you can reconstruct the data structure from the
bytestring at any time:
redata = marshal.loads(bytes)
When you specifically want to write the data to a disk file (as
long as the latter is open for binary—not the default text
mode—input/output), you can also use the dump
function of the marshal
module, which lets you dump several
data structures to the same file one after the other:
ouf = open('datafile.dat', 'wb') marshal.dump(data, ouf) marshal.dump('some string', ouf) marshal.dump(range(19), ouf) ouf.close( )
You can later recover from datafile.dat the same data structures you dumped into it, in the same sequence:
inf = open('datafile.dat', 'rb') a = marshal.load(inf) b = marshal.load(inf) c = marshal.load(inf) inf.close( )
Python offers several ways to serialize data (meaning to turn
the data into a string of bytes that you can save on disk, put in a
database, send across the network, etc.) and corresponding ways to
reconstruct the data from such serialized forms. The lowest-level
approach is to use the marshal
module, which Python uses to write its bytecode files. marshal
supports only elementary data types
(e.g., dictionaries, lists, tuples, numbers, and strings) and
combinations thereof. marshal
does
not guarantee compatibility from one Python release to another, so
data serialized with marshal
may
not be readable if you upgrade your Python release. However, marshal
does guarantee independence from a
specific machine’s architecture, so it is guaranteed to work if you’re
sending serialized data between different machines, as long as they
are all running the same version of Python—similar to how you can
share compiled Python bytecode files in such a distributed
setting.
marshal
’s dumps
function accepts any suitable Python
data structure and returns a bytestring representing it. You can pass
that bytestring to the loads
function, which will return another Python data structure that
compares equal (==
) to the one you
originally dumped. In particular, the order of keys in dictionaries is
arbitrary in both the original and reconstructed data structures, but
order in any kind of sequence is meaningful and is thus preserved. In
between the dumps
and loads
calls, you can subject the bytestring
to any procedure you wish, such as sending it over the network,
storing it into a database and retrieving it, or encrypting and
decrypting it. As long as the string’s binary structure is correctly
restored, loads
will work fine on
it (as stated previously, this is guaranteed only if you use loads
under the same Python release with
which you originally executed dumps
).
When you specifically need to save the data to a file, you can
also use marshal
’s dump
function, which takes two arguments:
the data structure you’re dumping and the open file object. Note that
the file must be opened for binary I/O (not the default, which is text
I/O) and can’t be a file-like object, as marshal
is quite picky about it being a true
file. The advantage of dump
is that
you can perform several calls to dump
with various data structures and the
same open file object: each data structure is then dumped together
with information about how long the dumped bytestring is. As a
consequence, when you later open the file for binary reading and then
call marshal.load
, passing the file
as the argument, you can reload each previously dumped data structure
sequentially, one after the other, at each call to load
. The return value of load
, like that of loads
, is a new data structure that compares
equal to the one you originally dumped. (Again, dump
and load
work within one Python release—no
guarantee across releases.)
Those accustomed to other languages and libraries offering
“serialization” facilities may be wondering if marshal
imposes substantial practical limits
on the size of objects you can serialize or
deserialize. Answer: Nope. Your machine’s memory might, but as long as
everything fits comfortably in memory, marshal
imposes practically no further
limit.
Recipe 7.2 for
cPickle
, the big brother of
marshal
; documentation on the
marshal
standard library module in
the Library Reference and in Python
in a Nutshell.
Credit: Luther Blissett
You want to serialize and reconstruct, at a reasonable speed, a Python data structure, which may include both fundamental Python object as well as classes and instances.
If you don’t want to assume that your data is composed only of
fundamental Python objects, or you need portability across versions of
Python, or you need to transmit the serialized form as text, the best
way of serializing your data is with the cPickle
module. (The pickle
module is a pure-Python equivalent
and totally interchangeable, but it’s slower and not worth using
except if you’re missing cPickle
.)
For example, say you have:
data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}
You can serialize data
to a
text string:
import cPickle text = cPickle.dumps(data)
or to a binary string, a choice that is faster and takes up less space:
bytes = cPickle.dumps(data, 2)
You can now sling text
or
bytes
around as you wish (e.g.,
send across a network, include as a BLOB in a database—see Recipe 7.10, Recipe 7.11, and Recipe 7.12) as long as you
keep text
or bytes
intact. In the case of bytes
, it means keeping the arbitrary binary
bytes intact. In the case of text
,
it means keeping its textual structure intact, including newline
characters. Then you can reconstruct the data at any time, regardless
of machine architecture or Python release:
redata1 = cPickle.loads(text) redata2 = cPickle.loads(bytes)
Either call reconstructs a data structure that compares equal to
data
. In particular, the order of
keys in dictionaries is arbitrary in both the original and
reconstructed data structures, but order in any kind of sequence is
meaningful, and thus it is preserved. You don’t need to tell cPickle.loads
whether the original dumps
used text mode (the default, also
readable by some very old versions of Python) or binary (faster and
more compact)—loads
figures it out
by examining its argument’s contents.
When you specifically want to write the data to a file, you can
also use the dump
function of the
cPickle
module, which lets you dump
several data structures to the same file one after the other:
ouf = open('datafile.txt', 'w') cPickle.dump(data, ouf) cPickle.dump('some string', ouf) cPickle.dump(range(19), ouf) ouf.close( )
Once you have done this, you can recover from datafile.txt the same data structures you dumped into it, one after the other, in the same order:
inf = open('datafile.txt') a = cPickle.load(inf) b = cPickle.load(inf) c = cPickle.load(inf) inf.close( )
You can also pass cPickle.dump
a third argument with a value
of 2
to tell cPickle.dump
to serialize the data in binary
form (faster and more compact), but the data file must then be opened
for binary I/O, not in the default text mode, both when you originally
dump to the file and when you later load from the file.
Python offers several ways to serialize data (i.e., make the
data into a string of bytes that you can save on disk, save in a
database, send across the network, etc.) and corresponding ways to
reconstruct the data from such serialized forms. Typically, the best
approach is to use the cPickle
module. A pure-Python equivalent, called pickle
(the cPickle
module is coded in C as a Python
extension) is substantially slower, and the only reason to use it is
if you don’t have cPickle
(e.g.,
with a Python port onto a mobile phone with tiny storage space, where
you saved every byte you possibly could by installing only an
indispensable subset of Python’s large standard library). However, in
cases where you do need to use pickle
, rest assured that it is completely
interchangeable with cPickle
: you
can pickle with either module and unpickle with the other one, without
any problems whatsoever.
cPickle
supports most
elementary data types (e.g., dictionaries, lists, tuples, numbers,
strings) and combinations thereof, as well as classes and instances.
Pickling classes and instances saves only the data involved, not the
code. (Code objects are not even among the types that cPickle
knows how to serialize, basically
because there would be no way to guarantee their portability across
disparate versions of Python. See Recipe 7.6 for a way to
serialize code objects, as long as you don’t need the cross-version
guarantee.) See Recipe
7.4 for more about pickling classes and instances.
cPickle
guarantees
compatibility from one Python release to another, as well as
independence from a specific machine’s architecture. Data serialized
with cPickle
will still be readable
if you upgrade your Python release, and pickling is also guaranteed to
work if you’re sending serialized data between different
machines.
The dumps
function of
cPickle
accepts any Python data
structure and returns a text string representing it. If you call
dumps
with a second argument of
2
, dumps
returns an arbitrary bytestring
instead: the operation is faster, and the resulting string takes up
less space. You can pass either the text or the bytestring to the
loads
function, which will return
another Python data structure that compares equal (==
) to the one you originally dumped. In
between the dumps
and loads
calls, you can subject the text or
bytestring to any procedure you wish, such as sending it over the
network, storing it in a database and retrieving it, or encrypting and
decrypting it. As long as the string’s textual or binary structure is
correctly restored, loads
will work
fine on it (even across platforms and in future releases). If you need
to produce data readable by old (pre-2.3) versions of Python, consider
using 1 as the second argument: operation will be slower, and the
resulting strings will not be as compact as those obtained by using 2,
but the strings will be unpicklable by old Python versions as well as
current and future ones.
When you specifically need to save the data into a file, you can
also use cPickle
’s dump
function, which takes two arguments:
the data structure you’re dumping and the open file or file-like
object. If the file is opened for binary I/O, rather than the default
(text I/O), then by giving dump
a
third argument of 2, you can ask for binary format, which is faster
and takes up less space (again, you can also use 1 in this position to
get a binary format that’s neither as compact nor as fast, but is
understood by old, pre-2.3 Python versions too). The advantage of
dump
over dumps
is that, with dump
, you can perform several calls, one
after the other, with various data structures and the same open file
object. Each data structure is then dumped with information about how
long the dumped string is. Consequently, when you later open the file
for reading (binary reading, if you asked for binary format) and then
repeatedly call cPickle.load
,
passing the file as the argument, each data structure previously
dumped is reloaded sequentially, one after the other. The return value
of load
, like that of loads
, is a new data structure that compares
equal to the one you originally dumped.
Those accustomed to other languages and libraries offering
“serialization” facilities may be wondering whether pickle
imposes substantial practical limits
on the size of objects you can serialize or
deserialize. Answer: Nope. Your machine’s memory might, but as long as
everything fits comfortably in memory, pickle
practically imposes no further
limit.
Recipe 7.2 and
Recipe 7.4;
documentation for the standard library module cPickle
in the Library
Reference and Python in a
Nutshell.
Credit: Bill McNeill, Andrew Dalke
Standard library modules cPickle
and gzip
offer the needed functionality; you
just need to glue them together appropriately:
import cPickle, gzip def save(filename, *objects): ''' save objects into a compressed diskfile ''' fil = gzip.open(filename, 'wb') for obj in objects: cPickle.dump(obj, fil, proto=2) fil.close( ) def load(filename): ''' reload objects from a compressed diskfile ''' fil = gzip.open(filename, 'rb') while True: try: yield cPickle.load(fil) except EOFError: break fil.close( )
Persistence and compression, as a general rule, go well
together. cPickle
protocol 2 saves
Python objects quite compactly, but the resulting files can still
compress quite well. For example, on my Linux box, open('/usr/dict/share/words').readlines( )
produces a list of over 45,000 strings. Pickling that list with the
default protocol 0 makes a disk file of 972 KB, while protocol 2 takes
only 716 KB. However, using both gzip
and protocol 2, as shown in this
recipe, requires only 268 KB, saving a significant amount of space. As
it happens, protocol 0 produces a more compressible file in this case,
so that using gzip
and protocol 0
would save even more space, taking only 252 KB on disk. However, the
difference between 268 and 252 isn’t all that meaningful, and protocol
2 has other advantages, particularly when used on instances of
new-style classes, so I recommend the mix I use in the functions shown
in this recipe.
Whatever protocol you choose to save your data, you
don’t need to worry about it when you’re reloading the data. The
protocol is recorded in the file together with the data, so cPickle.load
can figure out by itself all it
needs. Just pass it an instance of a file
or pseudo-file
object with a read
method, and cPickle.load
returns each object that was
pickled to the file, one after the other, and raises EOFError
when the file’s done. In this
recipe, we wrap a generator around cPickle.load
, so you can simply loop over
all recovered objects with a for
statement, or, depending on what you need, you can use some call such
as list(load('somefile.gz'))
to get
a list with all recovered objects as its items.
Modules gzip
and cPickle
in the Library
Reference.
Credit: Luther Blissett
You often need no special precautions to use cPickle
on your classes and their instances.
For example, the following works fine:
import cPickle class ForExample(object): def _ _init_ _(self, *stuff): self.stuff = stuff anInstance = ForExample('one', 2, 3) saved = cPickle.dumps(anInstance) reloaded = cPickle.loads(saved) assert anInstance.stuff == reloaded.stuff
However, sometimes there are problems:
anotherInstance = ForExample(1, 2, open('three', 'w')) wontWork = cPickle.dumps(anotherInstance)
This snippet causes a TypeError
: “can’t pickle file objects”
exception, because the state of anotherInstance
includes a file
object, and file
objects cannot be pickled. You would
get exactly the same exception if you tried to pickle any other
container that includes a file
object among its items.
However, in some cases, you may be able to do something about it:
class PrettyClever(object): def _ _init_ _(self, *stuff): self.stuff = stuff def _ _getstate_ _(self): def normalize(x): if isinstance(x, file): return 1, (x.name, x.mode, x.tell( )) return 0, x return [ normalize(x) for x in self.stuff ] def _ _setstate_ _(self, stuff): def reconstruct(x): if x[0] == 0: return x[1] name, mode, offs = x[1] openfile = open(name, mode) openfile.seek(offs) return openfile self.stuff = tuple([reconstruct(x) for x in stuff])
By defining the _ _getstate_
_
and _ _setstate_ _
special methods in your class, you gain fine-grained control about
what, exactly, your class’ instances consider to be their state. As
long as you can define such state in picklable terms, and reconstruct
your instances from the unpickled state in some way that is sufficient
for your application, you can make your instances themselves picklable
and unpicklable in this way.
cPickle
dumps class and
function objects by name (i.e., through their module’s name and their
name within the module). Thus, you can dump only classes defined at
module level (not inside other classes and functions). Reloading such
objects requires the respective modules to be available for import
. Instances can be saved and reloaded
only if they belong to such classes. In addition, the instance’s state
must also be picklable.
By default, an instance’s state is the contents of the
instance’s _ _dict_ _
, plus
whatever state the instance may get from the built-in type the
instance’s class inherits from, if any. For example, an instance of a
new-style class that subclasses list
includes the list items as part of the
instance’s state. cPickle
also
handles instances of new-style classes that define or inherit a class
attribute named _ _
slots_ _
(and therefore hold some or all
per-instance state in those predefined slots, rather than in a
per-instance _ _dict_ _
). Overall, cPickle
’s default approach is often quite sufficient and
satisfactory
.
Sometimes, however, you may have nonpicklable attributes or
items as part of your instance’s state (as cPickle
defines such state by default, as
explained in the previous paragraph). In this recipe, for example, I
show a class whose instances hold arbitrary stuff, which may include
open file
objects. To handle this
case, your class can define the special method _ _getstate_ _
. cPickle
calls that method on your object, if
your object’s class defines it or inherits it, instead of going
directly for the object’s _ _dict_
_
(or possibly _ _slots_
_
and/or built-in type bases).
Normally, when you define the _
_getstate_ _
method, you define the _ _setstate_ _
method as well, as shown in
this recipe’s Solution. _ _getstate_
_
can return any picklable object, and that object gets
pickled, and later, at unpickling time, passed as _ _setstate_ _
’s argument. In this recipe’s
Solution, _ _getstate_ _
returns a
list that’s similar to the instance’s default state (attribute
self.stuff
), except that each item
is turned into a tuple of two items. The first item in the pair can be
set to 0
to indicate that the
second one will be taken verbatim, or 1
to indicate that the second item will be
used to reconstruct an open file. (Of course, the reconstruction may
fail or be unsatisfactory in several ways. There is no general way to
save an open file’s state, which is why cPickle
itself doesn’t even try. But in the
context of our application, we can assume that the given approach will
work.) When reloading the instance from pickled form, cPickle
calls _
_setstate_ _
with the list of pairs, and _ _setstate_ _
can reconstruct self.stuff
by processing each pair
appropriately in its nested reconstruct
function.
This scheme can clearly generalize to getting and restoring state that
may contain various kinds of normally unpicklable objects—just be sure
to use different numbers to tag each of the various kinds of
“nonverbatim” pairs you need to support.
In one particular case, you can define _ _getstate_ _
without defining _ _setstate_ _
: _
_getstate_ _
must then return a dictionary, and reloading
the instance from pickled form uses that dictionary just as the
instance’s _ _dict_ _
would
normally be used. Not running your own code at reloading time is a
serious hindrance, but it may come in handy when you want to use
_ _getstate_ _
, not to save
otherwise unpicklable state but rather as an optimization. Typically,
this optimization opportunity occurs when your instance caches results
that it can recompute if they’re absent, and you decide it’s best not
to store the cache as a part of the instance’s state. In this case,
you should define _ _getstate_ _
to
return a dictionary that’s the indispensable subset of the instance’s
_ _dict_ _
. (See Recipe 4.13) for a simple
and handy way to “subset a dictionary”.)
Defining _ _getstate_ _
(and
then, normally, also _ _setstate_
_
) also gives you a further important bonus, besides the
pickling support: if a class offers these methods but doesn’t offer
special methods _ _copy_ _
or
_ _deepcopy_ _
, then the methods
are also used for copying, both shallowly and deeply, as well as for
serializing. The state data returned by _
_getstate_ _
is deep-copied if and only if the object is
being dee-copied, but, other than this distinction, shallow and deep
copies work very similarly when they are implemented through _ _getstate_ _
. See Recipe 4.1 for more
information about how a class can control the way its instances are
copied, shallowly or deeply.
With either the default pickling/unpickling approach, or
your own _ _getstate_ _
and
_ _setstate_ _
, the instance’s
special method _ _init_ _
is
not called when the instance is getting
unpickled. If the most convenient way for you to reconstruct an
instance is to call the _ _init_ _
method with appropriate parameters, then you may want to define the
special method _ _getinitargs_ _
,
instead of _ _getstate_ _
. In this
case, cPickle
calls this method
without arguments: the method must return a pickable tuple
, and at unpickling time, cPickle
calls _
_init_ _
with the arguments that are that tuple’s items.
_ _getinitargs_ _
, like _ _getstate_ _
and _ _setstate_ _
, can also be used for
copying.
The Library Reference for the pickle
and copy_reg
modules details even subtler things
you can do when pickling and unpickling, as well as the thorny
security issues that are likely to arise if you ever stoop to
unpickling data from untrusted sources. (Executive summary:
don’t do that—there is no way Python can protect
you if you do.) However, the techniques I’ve discussed here should
suffice in almost all practical cases, as long as the security aspects
of unpickling are not a problem (and if they are, the
only practical suggestion is: forget
pickling!).
Recipe 7.2;
documentation for the standard library module cPickle
in the Library
Reference and Python in a
Nutshell.
Credit: Peter Cogolo
You need to pickle an object, but that object holds (as an attribute or item) a bound method of another object, and bound methods are not picklable.
Say you have the following objects:
import cPickle class Greeter(object): def _ _init_ _(self, name): self.name = name def greet(self): print 'hello', self.name class Repeater(object): def _ _init_ _(self, greeter): self.greeter = greeter def greet(self): self.greeter( ) self.greeter( ) r = Repeater(Greeter('world').greet)
Were it not for the fact that r
holds a bound
method as its greeter
attribute, you could pickle
r
very simply:
s = cPickle.dumps(r)
However, upon encountering the bound method, this call to
cPickle.dumps
raises a TypeError
. One simple solution is to have
each instance of class Repeater
hold, not a bound
method directly, but rather a picklable wrapper to it. For
example:
class picklable_boundmethod(object): def _ _init_ _(self, mt): self.mt = mt def _ _getstate_ _(self): return self.mt.im_self, self.mt.im_func._ _name_ _ def _ _setstate_ _(self, (s,fn)): self.mt = getattr(s, fn) def _ _call_ _(self, *a, **kw): return self.mt(*a, **kw)
Now, changing Repeater._ _init_ _
’s body to
self.greeter =
picklable_boundmethod(greeter)
makes the previous snippet
work.
The Python Standard Library pickle
module (just like its faster
equivalent cousin cPickle
) pickles
functions and classes by name—this implies, in particular, that only
functions defined at the top level of a module can be pickled (the
pickling of such a function, in practice, contains just the names of
the module and function).
If you have a graph of objects that hold each other, not
directly, but via one another’s bound methods (which is often a good
idea in Python), this limitation can make the whole graph unpicklable.
One solution might be to teach pickle
how to serialize bound methods, along
the same lines as described in Recipe 7.6. Another possible
solution is to define appropriate _
_getstate_ _
and _ _setstate_
_
methods to turn bound methods into something picklable at
dump
time and rebuild them at
load
time, along the lines
described in Recipe
7.4. However, this latter possibility is not a good
factorization when you have several classes whose instances hold bound
methods.
This recipe pursues a simpler idea, based on holding bound
methods, not directly, but via the
picklable_boundmethod
wrapper class.
picklable_boundmethod
is written under the assumption
that the only thing you usually do with a bound method is to call it,
so it only delegates _ _call_ _
functionality specifically. (You could, in addition, also use _ _getattr_ _
, in order to delegate other
attribute accesses.)
In normal operation, the fact that you’re holding an instance of
picklable_boundmethod
rather than
holding the bound method object directly is essentially transparent.
When pickling time comes, special method _
_getstate_ _
of picklable_boundmethod
comes
into play, as previously covered in Recipe 7.4. In the case of
picklable_boundmethod
, _
_getstate_ _
returns the object to which the bound method
belongs and the function name of the bound method. Later, at
unpickling time, _ _setstate_ _
recovers an equivalent bound method from the reconstructed object by
using the getattr
built-in for that
name. This approach isn’t infallible because an object might hold its
methods under assumed names (different from the real function names of
the methods). However, assuming you’re not specifically doing
something weird for the specific purpose of breaking
picklable_boundmethod
’s functionality, you shouldn’t
ever run into this kind of obscure problem!
Library Reference and Python
in a Nutshell docs for modules pickle
and cPickle
, bound-method objects, and the
getattr
built-in.
Credit: Andres Tremols, Peter Cogolo
You want to be able to pickle code objects, but this functionality is not supported by the standard library’s pickling modules.
You can extend the abilities of the pickle
(or cPickle
) module by using module copy_reg
. Just make sure the following
module has been imported before you pickle code objects, and has been
imported, or is available to be imported, when you’re unpickling
them:
import new, types, copy_reg def code_ctor(*args): # delegate to new.code the construction of a new code object return new.code(*args) def reduce_code(co): # a reductor function must return a tuple with two items: first, the # constructor function to be called to rebuild the argument object # at a future de-serialization time; then, the tuple of arguments # that will need to be passed to the constructor function. if co.co_freevars or co.co_cellvars: raise ValueError, "Sorry, cannot pickle code objects from closures" return code_ctor, (co.co_argcount, co.co_nlocals, co.co_stacksize, co.co_flags, co.co_code, co.co_consts, co.co_names, co.co_varnames, co.co_filename, co.co_name, co.co_firstlineno, co.co_lnotab) # register the reductor to be used for pickling objects of type 'CodeType' copy_reg.pickle(types.CodeType, reduce_code) if _ _name_ _ == '_ _main_ _': # example usage of our new ability to pickle code objects import cPickle # a function (which, inside, has a code object, of course) def f(x): print 'Hello,', x # serialize the function's code object to a string of bytes pickled_code = cPickle.dumps(f.func_code) # recover an equal code object from the string of bytes recovered_code = cPickle.loads(pickled_code) # build a new function around the rebuilt code object g = new.function(recovered_code, globals( )) # check what happens when the new function gets called g('world')
The Python Standard Library pickle
module (just like its faster
equivalent cousin cPickle
) pickles
functions and classes by name. There is no pickling of the
code objects containing the compiled bytecode
that, when run, determines almost every aspect of functions’ (and
methods') behavior. In some situations, you’d rather pickle everything
by value, so that all the relevant stuff can later be retrieved from
the pickle, rather than having to have module files around for some of
it. Sometimes you can solve such problems by using marshaling rather
than pickling, since marshal
does let you serialize code objects, but marshal
has limitations on many other
issues. For example, you cannot marshal instances of classes you have
coded. (Once you’re serializing code objects, which are specific to a
given version of Python, pickle
will share one key limitation of marshal
: no guaranteed ability to save and
later reload data across different versions of Python.)
An alternative approach is to take advantage of the possibility,
which the Python Standard Library allows, to extend the set of types
known to pickle
. Basically, you can
“teach” pickle
how to save and
reload code objects; this, in turn, lets you pickle by value, rather
than “by name”, such objects as functions and classes. (The code in
this recipe’s Solution under the if _ _name_
_ ==
'_ _main_ _
' guard
essentially shows how to extend pickle
for a function.)
To teach pickle
about some
new type, use module copy_reg
,
which is also part of the Python Standard Library. Through function
copy_reg.pickle
, you register the
reduction function to use for instances of a given type. A reduction
function takes as its argument an instance to be pickled and returns a
tuple with two items: a constructor function, which will be called to
reconstruct the instance, and a tuple of arguments, which will be
passed to the constructor function. (A reduction function may also
return other kinds of results, but for this recipe’s purposes a
two-item tuple suffices.)
The module in this recipe defines function
reduce_code
, then registers it as the reduction
function for objects of type types.CodeType
—that is, code objects. When
reduce_code
gets called, it first checks whether its
code object co
comes from a
closure (functions nested inside each other),
because it just can’t deal with this eventuality—I’ve been unable to
find a way that works, so in this case, reduce_code
just raises an exception to let the user know about the
problem.
In normal cases, reduce_code
returns
code_ctor
as the constructor and a tuple made up of
all of co
’s attributes as the arguments tuple for the
constructor. When a code object is reloaded from a pickle,
code_ctor
gets called with those arguments and simply
passes the call on to the new.code
callable, which is the true constructor for code
arguments. Unfortunately, reduce_code
cannot return
new.code
itself as the first item
in its result tuple, because new.code
is a built-in (a C-coded callable)
but is not available through a built-in name. So,
basically, the role of code_ctor
is to provide a name
for the (by-name) pickling of new.code
.
The if _ _name_ _ == '_ _main_
_
' part of the recipe provides a typical toy usage
example—it pickles a code object to a string, recovers a copy of it
from the pickle string, and builds and calls a function around that
code object. A more typical use case for this recipe’s functionality,
of course, will do the pickling in one script and the unpickling in
another. Assume that the module in this recipe has been saved as file
reco.py somewhere on Python’s
sys.path
, so that it can be
imported by Python scripts and other modules. You could then have a
script that imports reco
and thus becomes able to
pickle code objects, such as:
import reco, pickle def f(x): print 'Hello,', x pickle.dump(f.func_code, open('saved.pickle','wb'))
To unpickle and use that code object, an example script might be:
import new, cPickle c = cPickle.load(open('saved.pickle','rb')) g = new.function(c, globals( )) g('world')
Note that the second script does not need to import reco
—the import
will happen automatically when needed
(part of the information that pickle
saves in saved.pickle is that, in order to
reconstruct the pickled object therein, it needs to call
reco.code_ctor
; so, it also knows it needs to import
reco
). I’m also showing that you can use modules
pickle
and cPickle
interchangeably. Pickle
is faster, but there are no other
differences, and in particular, you can use one module to pickle
objects and the other one to unpickle them, if you wish.
Modules pickle
, cPickle
, and copy_reg
in the Library
Reference and Python in a
Nutshell.
Credit: Luther Blissett
You are using the standard module shelve
. Some of the values you have shelved
are mutable objects, and you need to mutate these objects.
The shelve
module offers a
kind of persistent dictionary—an important niche between the power of
relational-database engines and the simplicity of marshal
, pickle
, dbm
, and similar file formats. However, you
should be aware of a typical trap you need to avoid when using
shelve
. Consider the following
interactive Python session:
>>> import shelve >>> # Build a simple sample shelf >>> she = shelve.open('try.she', 'c') >>> for c in 'spam': she[c] = {c:23} ... >>> for c in she.keys( ): print c, she[c] ...p {'p': 23}
s {'s': 23}
a {'a': 23}
m {'m': 23}
>>> she.close( )
We’ve created the shelve
file, added some data to it, and closed it. Good—now we can reopen it
and work with it:
>>> she=shelve.open('try.she', 'c') >>> she['p']{'p': 23} >>> she['p']['p'] = 42 >>> she['p'] {'p': 23}
What’s going on here? We just set the value to 42, but our
setting didn’t take in the shelve
object! The problem is that we were
working with a temporary object that shelve
gave us, not with the “real thing”.
shelve
, when we open it with
default options, like here, doesn’t track changes to such temporary
objects. One reasonable solution is to bind a name to this temporary
object, do our mutation, and then assign the mutated object back to
the appropriate item of shelve
:
>>> a = she['p']
>>> a['p'] = 42
>>> she['p'] = a
>>> she['p']{'p': 42}
>>> she.close( )
We can verify that the change was properly persisted:
>>> she=shelve.open('try.she','c') >>> for c in she.keys( ): print c,she[c] ...p {'p': 42} s {'s': 23} a {'a': 23} m {'m': 23}
A simpler solution is to open the shelve
object with the writeback
option set to True
:
>>> she = shelve.open('try.she', 'c', writeback=True)
The writeback
option
instructs shelve
to keep track of
all the objects it gets from the file and write them all back to the
file before closing it, just in case they have been modified in the
meantime. While simple, this approach can be quite expensive,
particularly in terms of memory consumption. Specifically, if we read
many objects from a shelve
object
opened with writeback=True
, even if
we only modify a few of them, shelve
is going to keep them
all in memory, since it can’t tell in advance
which one we may be about to modify. The previous approach, where we
explicitly take responsibility to notify shelve
of any changes (by assigning the
changed objects back to the place they came from), requires more care
on our part, but repays that care by giving us much better
performance.
The standard Python module shelve
can be quite convenient in many
cases, but it hides a potentially nasty trap, admittedly well
documented in Python’s online docs but still easy to miss. Suppose
you’re shelving mutable objects, such as dictionaries or lists.
Naturally, you are quite likely to want to mutate some of those
objects—for example, by calling mutating methods (append
on a list, update
on a dictionary, etc.) or by
assigning a new value to an item or attribute of the object. However,
when you do this, the change doesn’t occur in the shelve
object. This is because we actually
mutate a temporary object that the shelve
object has given us as the result of
shelve
’s own _ _getitem_ _
method, but the shelve
object, by default, does not keep
track of that temporary object, nor does it care about it once it
returns it to us.
As shown in the recipe, one solution is to bind a name to the
temporary object obtained by keying into the shelf, doing whatever
mutations are needed to the object via the name, then assigning the
newly mutated object back to the appropriate item of the shelve
object. When you assign to a shelve
object’s item, the shelve
object’s _
_setitem_ _
method gets invoked, and it appropriately
updates the shelve
object itself,
so that the change does occur.
Alternatively, you can add the flag writeback=True
at the time you open the
shelve
object, and then shelve
keeps track of every object it hands
you, saving them all back to disk at the end. This approach may save
you quite a bit of fussy and laborious coding, but take care: if you
read many items of the shelve
object and only modify a few of them, the writeback
approach can be exceedingly
costly, particularly in terms of memory consumption. When opened with
writeback=True
, shelve
will keep in
memory any item it has ever handed you, and save them all to disk at
the end, since it doesn’t have a reliable way to tell which items you
may be about to modify, nor, in general, even which items you
have actually modified by the time you close
the shelve
object. The recommended approach,
unless you’re going to modify just about every item you read (or
unless the shelve
object in
question is small enough compared with your available memory that you
don’t really care), is the previous one: bind a name to the items that
you get from a shelve
object with
intent to modify them, and assign each item back into the shelve
object once you’re done mutating that
item.
Recipe 7.1 and
Recipe 7.2 for
alternative serialization approaches; documentation for the shelve
standard library module in the
Library Reference and Python in a
Nutshell.
You want to persist some data, exploiting the simplicity and good performance of the Berkeley DB database library.
If you have previously installed Berkeley DB on your machine,
the Python Standard Library comes with package bsddb
(and optionally bsddb3
, to access Berkeley DB release 3.2
databases) to interface your Python code with Berkeley DB. To get
either bsddb
or, lacking it,
bsddb3
, use a try
/except
on import
:
try:
from bsddb import db # first try release 4
except ImportError:
from bsddb3 import db # not there, try release 3 instead
print db.DB_VERSION_STRING
# emits, e.g:Sleepycat Software: Berkeley DB 4.1.25: (December 19, 2002)
To create a database, instantiate a db.DB
object, then call its method open
with appropriate parameters, such
as:
adb = db.DB( ) adb.open('db_filename', dbtype=db.DB_HASH, flags=db.DB_CREATE)
db.DB_HASH
is just one of
several access methods you may choose when you create a database—a
popular alternative is db.DB_BTREE
,
to use B+tree access (handy if you need to get records in sorted
order). You may make an in-memory database, without an underlying file
for persistence, by passing None
instead of a filename as the first argument to the open
method.
Once you have an open instance of db.DB
, you can add records, each composed of
two strings, key
and data
:
for i, w in enumerate('some words for example'.split( )): adb.put(w, str(i))
You can access records via a cursor on the database:
def irecords(curs): record = curs.first( ) while record: yield record record = curs.next( ) for key, data in irecords(adb.cursor( )): print 'key=%r, data=%r' % (key, data) #emits (the order may vary):
#key='some', data='0'
#key='example', data='3'
#key='words', data='1'
#key='for', data='2'
When you’re done, you close the database:
adb.close( )
At any future time, in the same or another Python program, you
can reopen the database by giving just its filename as the argument to
the open
method of a newly created
db.DB
instance:
the_same_db = db.DB( ) the_same_db.open('db_filename')
and work on it again in the same ways:
the_same_db.put('skidoo', '23') # add a record the_same_db.put('words', 'sweet') # replace a record for key, data in irecords(the_same_db.cursor( )): print 'key=%r, data=%r' % (key, data) # emits (the order may vary): #key='some', data='0'
#key='example', data='3'
#key='words', data='sweet'
#key='for', data='2'
#key='skidoo', data='23'
Again, remember to close the database when you’re done:
the_same_db.close( )
The Berkeley DB is a popular open source database. It does not support SQL, but it’s simple to use, offers excellent performance, and gives you a lot of control over exactly what happens, if you care to exert it, through a huge array of options, flags, and methods. Berkeley DB is just as accessible from many other languages as from Python: for example, you can perform some changes or queries with a Python program, and others with a separate C program, on the same database file, using the same underlying open source library that you can freely download from Sleepycat.
The Python Standard Library shelve
module can use the Berkeley DB as its
underlying database engine, just as it uses cPickle
for serialization. However, shelve
does not let you take advantage of
the ability to access a Berkeley DB database file from several
different languages, exactly because the records are strings produced
by pickle.dumps
, and languages
other than Python can’t easily deal with them. Accessing the Berkeley
DB directly with bsddb
also gives
you access to many advanced functionalities of the database engine
that shelve
simply doesn’t
support.
For example, creating a database with an access method of
db.DB_HASH
, as shown in the recipe,
may give maximum performance, but, as you’ll have noticed when listing
all records with the generator irecords
that is also
presented in the recipe, hashing puts records in apparently random,
unpredictable order. If you need to access records in sorted order,
you can use an access method of db.DB_BTREE
instead. Berkeley DB also
supports more advanced functionality, such as transactions, which you
can enable through direct access but not via anydbm
or shelve
.
For detailed documentation about all functionality of the Python
Standard Library bsddb
package, see
http://pybsddb.sourceforge.net/bsddb3.html.
For documentation, downloads, and more of the Berkeley DB itself, see
http://www.sleepycat.com/.
Library Reference and Python
in a Nutshell docs for modules anydbm
, shelve
, and bsddb
; http://pybsddb.sourceforge.net/bsddb3.html for
many more details about bsddb
and
bsddb3
; http://www.sleepycat.com/ for downloads of,
and very detailed documentation on, the Berkeley DB itself.
Credit: Mark Nenadov
The MySQLdb
module makes this
task extremely easy:
import MySQLdb # Create a connection object, then use it to create a cursorcon = MySQLdb.connect(host="127.0.0.1", port=3306, user="joe", passwd="egf42", db="tst") cursor = con.cursor( ) # Execute an SQL string sql = "SELECT * FROM Users" cursor.execute(sql) # Fetch all results from the cursor into a sequence and close the connection results = cursor.fetchall( ) con.close( )
MySQLdb
is at http://sourceforge.net/projects/mysql-python.
It is a plain and simple implementation of the Python DB API 2.0 that
is suitable for Python 2.3 (and some older versions, too) and MySQL
versions 3.22 to 4.0. MySQLdb
, at
the time of this writing, did not yet officially support Python 2.4.
However, if you have the right C compiler installation to build Python
extensions (as should be the case for all Linux, Mac OS X, and other
Unix users, and many Windows developers), the current version of
MySQLdb
does in fact build from
sources, install, and work just fine, with Python 2.4. A newer version
of MySQLdb
is in the works, with
official support for Python 2.3 or later and MySQL 4.0 or
later.
As with all other Python DB API implementations (once you have
downloaded and installed the needed Python extension and have the
database engine it needs up and running), you start by importing the
module and calling the connect
function with suitable parameters. The keyword parameters you can pass
when calling connect
depend on the
database involved: host
(defaulting
to the local host), user
, passwd
(password), and db
(name of the database) are typical. In
the recipe, I explicitly pass the default local host’s IP address and
the default MySQL port (3306), just to show that you can specify
parameters explicitly even when you’re passing their default values
(e.g., to make your source code clearer and more readable and
maintainable).
The connect
function returns
a connection object, and you can proceed to call methods on this
object; when you are done, call the close
method. The method you most often call
on a connection object is cursor
,
which returns a cursor object, which is what you use to send SQL
commands to the database and fetch the commands’ results. The
underlying MySQL database engine does not in fact support SQL cursors,
but that’s no problem—the MySQLdb
module emulates them on your behalf, quite transparently, for the
limited cursor needs of the Python DB API 2.0. Of course, this doesn’t
mean that you can use SQL phrases like WHERE
CURRENT OF
CURSOR
with a
database that does not offer cursors! Once you have a cursor
object in hand, you can call methods
on it. The recipe uses the execute
method to execute an SQL statement, and then the fetchall
method to obtain all results as a
sequence of tuples—one tuple per row in the result. You can use many
refinements, but these basic elements of the Python DB API’s
functionality already suffice for many tasks.
The Python-MySQL interface module (http://sourceforge.net/projects/mysql-python); the Python DB API (http://www.python.org/topics/database/DatabaseAPI-2.0.html); DB API documentation in Python in a Nutshell.
Credit: Luther Blissett
The MySQLdb
module does not
support full-fledged placeholders, but you can make do with the
module’s escape_string
function:
import MySQLdb, cPickle # Connect to a DB, e.g., the test DB on your localhost, and get a cursor connection = MySQLdb.connect(db="test") cursor = connection.cursor( ) # Make a new table for experimentation cursor.execute("CREATE TABLE justatest (name TEXT, ablob BLOB)") try: # Prepare some BLOBs to insert in the table names = 'aramis', 'athos', 'porthos' data = { } for name in names: datum = list(name) datum.sort( ) data[name] = cPickle.dumps(datum, 2) # Perform the insertions sql = "INSERT INTO justatest VALUES(%s, %s)" for name in names: cursor.execute(sql, (name, MySQLdb.escape_string(data[name])) ) # Recover the data so you can check back sql = "SELECT name, ablob FROM justatest ORDER BY name" cursor.execute(sql) for name, blob in cursor.fetchall( ): print name, cPickle.loads(blob), cPickle.loads(data[name]) finally: # Done. Remove the table and close the connection. cursor.execute("DROP TABLE justatest") connection.close( )
MySQL supports binary data (BLOBs and variations thereof), but
you should be careful when communicating such data via SQL.
Specifically, when you use a normal INSERT
SQL statement and need to have binary
strings among the VALUES
you’re
inserting, you have to escape some characters in the binary string
according to MySQL’s own rules. Fortunately, you don’t have to figure
out those rules for yourself: MySQL supplies a function that does the
needed escaping, and MySQLdb
exposes it to your Python programs as the escape_string
function.
This recipe shows a typical case: the BLOBs you’re inserting
come from cPickle.dumps
, so they
may represent almost arbitrary Python objects (although, in this case,
we’re just using them for a few lists of characters). The recipe is
purely demonstrative and works by creating a table and dropping it at
the end (using a try/finally
statement to ensure that finalization is performed even if the program
should terminate because of an uncaught exception). With recent
versions of MySQL and MySQLdb
, you
don’t even need to call the escape_string
function anymore, so you can
change the relevant statement to the simpler:
cursor.execute(sql, (name, data[name]))
Recipe 7.11 and Recipe 7.12 for PostgreSQL-oriented and SQLite-oriented solutions to the same problem; the MySQL home page (http://www.mysql.org); the Python/MySQL interface module (http://sourceforge.net/projects/mysql-python).
You need to store a BLOB in a PostgreSQL database.
PostgreSQL 7.2 and later supports large objects, and the
psycopg
module supplies a Binary
escaping function:
import psycopg, cPickle # Connect to a DB, e.g., the test DB on your localhost, and get a cursor connection = psycopg.connect("dbname=test") cursor = connection.cursor( ) # Make a new table for experimentation cursor.execute("CREATE TABLE justatest (name TEXT, ablob BYTEA)") try: # Prepare some BLOBs to insert in the table names = 'aramis', 'athos', 'porthos' data = { } for name in names: datum = list(name) datum.sort( ) data[name] = cPickle.dumps(datum, 2) # Perform the insertions sql = "INSERT INTO justatest VALUES(%s, %s)" for name in names: cursor.execute(sql, (name, psycopg.Binary(data[name])) ) # Recover the data so you can check back sql = "SELECT name, ablob FROM justatest ORDER BY name" cursor.execute(sql) for name, blob in cursor.fetchall( ): print name, cPickle.loads(blob), cPickle.loads(data[name]) finally: # Done. Remove the table and close the connection. cursor.execute("DROP TABLE justatest") connection.close( )
PostgreSQL supports binary data (BYTEA
and variations thereof), but you
should be careful when communicating such data via SQL. Specifically,
when you use a normal INSERT
SQL
statement and need to have binary strings among the VALUES
you’re inserting, you have to escape
some characters in the binary string according to PostgreSQL’s own
rules. Fortunately, you don’t have to figure out those rules for
yourself: PostgreSQL supplies functions that do all the needed
escaping, and psycopg
exposes such
a function to your Python programs as the Binary
function. This recipe shows a typical
case: the BYTEA
s you’re inserting
come from cPickle.dumps
, so they
may represent almost arbitrary Python objects (although, in this case,
we’re just using them for a few lists of characters). The recipe is
purely demonstrative and works by creating a table and dropping it at
the end (using a try/finally
statement to ensure finalization is performed even if the program
should terminate because of an uncaught exception).
Earlier PostgreSQL releases limited to a few kilobytes the
amount of data you could store in a normal field of the database. To
store really large objects, you had to use roundabout techniques to
load the data into the database (such as PostgreSQL’s nonstandard SQL
function LO_IMPORT
to load a data
file as an object, which requires superuser privileges and data files
that reside on the machine running the Postgre-SQL Server) and store a
field of type OID
in the table to
be used later for indirect recovery of the data. Fortunately, none of
these techniques are necessary anymore: since Release 7.1 (the current
release at the time of writing is 8.0), PostgreSQL embodies the
results of project TOAST, which removed the limitations on
field-storage size and therefore the need for peculiar indirection.
Module psycopg
supplies the handy
Binary
function to let you escape
any binary string of bytes into a form acceptable for placeholder
substitution in INSERT
and UPDATE
SQL statements.
Recipe 7.10 and Recipe 7.12 for MySQL-oriented and SQLite-oriented solutions to the same problem; PostgresSQL’s home page (http://www.postgresql.org/); the Python/PostgreSQL module (http://initd.org/software/psycopg).
Credit: John Barham
The PySQLite Python extension offers function sqlite.encode
to let you insert binary
strings in SQLite databases. You can also build a small adapter class
based on that function:
import sqlite, cPickle class Blob(object): ''' automatic converter for binary strings ''' def _ _init_ _(self, s): self.s = s def _quote(self): return "'%s'" % sqlite.encode(self.s) # make a test database in memory, get a cursor on it, and make a table connection = sqlite.connect(':memory:') cursor = connection.cursor( ) cursor.execute("CREATE TABLE justatest (name TEXT, ablob BLOB)") # Prepare some BLOBs to insert in the table names = 'aramis', 'athos', 'porthos' data = { } for name in names: datum = list(name) datum.sort( ) data[name] = cPickle.dumps(datum, 2) # Perform the insertions sql = 'INSERT INTO justatest VALUES(%s, %s)' for name in names: cursor.execute(sql, (name, Blob(data[name])) ) # Recover the data so you can check back sql = 'SELECT name, ablob FROM justatest ORDER BY name' cursor.execute(sql) for name, blob in cursor.fetchall( ): print name, cPickle.loads(blob), cPickle.loads(data[name]) # Done, close the connection (would be no big deal if you didn't, but...) connection.close( )
SQLite does not directly support binary data, but it still lets
you declare such types for fields in a CREATE
TABLE
DDL statement. The PySQLite Python extension uses the
declared types of fields to convert field values appropriately to
Python values when you fetch data after an SQL SELECT
from an SQLite database. However, you
still need to be careful when communicating binary string data via
SQL.
Specifically, when you use INSERT
or UPDATE
SQL statements, and need to have
binary strings among the VALUES
you’re passing, you need to escape some characters in the binary
string according to SQLite’s own rules. Fortunately, you don’t have to
figure out those rules for yourself: SQLite supplies the function to
do the needed escaping, and PySQLite exposes that function to your
Python programs as the sqlite.encode
function. This recipe shows a
typical case: the BLOB
s you’re
inserting come from cPickle.dumps
,
so they may represent almost arbitrary Python objects (although, in
this case, we’re just using them for a few lists of characters). The
recipe is purely demonstrative and works by creating a database in
memory, so that the database is implicitly lost at the end of the
script.
While you could perfectly well call sqlite.encode
directly on your binary
strings at the time you pass them as parameters to a cursor’s execute
method, this recipe takes a slightly
different tack, defining a Blob
class to wrap binary
strings and passing instances of that. When PySQLite receives as
arguments instances of any class, the class must define a method named
_quote
, and PySQLite calls that
method on each instance, expecting the method to return a string fully
ready for insertion into an SQL statement. When you use this approach
for more complicated classes of your own, you’ll probably want to pass
a decoders
keyword argument to the
connect
method, to associate
appropriate decoding functions to specific SQL types. By default,
however, the BLOB
SQL type is
associated with the decoding function sqlite.decode
, which is exactly the inverse
of sqlite.encode
; for the simple
Blob
class in this recipe, therefore, we do not need
to specify any custom decoder, since the default one suits us
perfectly well.
Recipe 7.10 and Recipe 7.11 for MySQL-oriented and PostgreSQL-oriented solutions to the same problem; SQLite’s home page (http://www.sqlite.org/); the PySQLite manual (http://pysqlite.sourceforge.net/manual.html); the SQLite FAQ (“Does SQLite support a BLOB type?”) at http://www.hwaci.com/sw/sqlite/faq.html#q12.
Credit: Thomas T. Jenkins
You want to access data fetched from a DB API cursor object, but you want to access the columns by field name, not by number.
Accessing columns within a set of database-fetched rows by
column index is not very readable, nor is it robust should columns
ever get reordered in a rework of the database’s schema (a rare event,
but it does occasionally happen). This recipe exploits the description
attribute of Python DB API’s
cursor
objects to build a
dictionary that maps column names to index values, so you can use
cursor_row[field_dict[fieldname]]
to get the value of a named column:
def fields(cursor): """ Given a DB API 2.0 cursor object that has been executed, returns a dictionary that maps each field name to a column index, 0 and up. """ results = { } for column, desc in enumerate(cursor.description): results[desc[0]] = column return results
When you get a set of rows from a call to any of a cursor
’s various fetch
. . . methods (fetchone
, fetchmany
, fetchall
), it is often helpful to be able to
access a specific column in a row by field name and not by column
number. This recipe shows a function that takes a DB API 2.0 cursor
object and returns a dictionary with column numbers keyed by field
names.
Here’s a usage example (assuming you put this recipe’s code in a
module that you call dbutils.py
somewhere on your Python sys.path
).
You must start with conn
being a
connection object for any DB API 2-compliant Python module.
>>> c = conn.cursor( ) >>> c.execute('''select * from country_region_goal ... where crg_region_code is null''') >>> import pprint >>> pp = pprint.pprint >>> pp(c.description)(('CRG_ID', 4, None, None, 10, 0, 0),
('CRG_PROGRAM_ID', 4, None, None, 10, 0, 1),
('CRG_FISCAL_YEAR', 12, None, None, 4, 0, 1),
('CRG_REGION_CODE', 12, None, None, 3, 0, 1),
('CRG_COUNTRY_CODE', 12, None, None, 2, 0, 1),
('CRG_GOAL_CODE', 12, None, None, 2, 0, 1),
('CRG_FUNDING_AMOUNT', 8, None, None, 15, 0, 1))
>>> import dbutils >>> field_dict = dbutils.fields(c) >>> pp(field_dict){'CRG_COUNTRY_CODE': 4,
'CRG_FISCAL_YEAR': 2,
'CRG_FUNDING_AMOUNT': 6,
'CRG_GOAL_CODE': 5,
'CRG_ID': 0,
'CRG_PROGRAM_ID': 1,
'CRG_REGION_CODE': 3}
>>> row = c.fetchone( ) >>> pp(row)(45, 3, '2000', None, 'HR', '26', 48509.0)
>>> ctry_code = row[field_dict['CRG_COUNTRY_CODE']] >>> print ctry_codeHR
>>> fund = row[field_dict['CRG_FUNDING_AMOUNT']] >>> print fund48509.0
If you find accesses such as row[field_dict['CRG_COUNTRY_CODE']]
to be
still inelegant, you may want to get fancier and wrap the row as well
as the dictionary of fields into an object allowing more elegant
access—a simple example might be:
class neater(object): def _ _init_ _(self, row, field_dict): self.r = row self.d = field_dict def _ _getattr_ _(self, name): try: return self.r[self.d[name]] except LookupError: raise AttributeError
If this neater
class was also in your
dubtils
module, you could then continue the preceding
interactive snippet with, for example:
>>> row = dbutils.neater(row, field_dict)
>>> print row.CRG_FUNDING_AMOUNT48509.0
However, if you’re tempted by such fancier approaches, I suggest
that, rather than rolling your own, you have a look at the dbtuple
module showcased in Recipe 7.14. Reusing good,
solid, proven code is a much smarter approach than writing your own
infrastructure.
Recipe 7.14 for
a slicker and more elaborate approach to a very similar task,
facilitated by reusing the third-party dbtuple
module.
Credit: Steve Holden, Hamish Lawson, Kevin Jacobs
You want flexible access to sequences, such as the rows in a database query, by either name or column number.
Rather than coding your own solution, it’s often more clever to
reuse a good existing one. For this recipe’s task, a good existing
solution is packaged in Greg Stein’s dtuple
module:
import dtuple import mx.ODBC.Windows as odbc flist = ["Name", "Num", "LinkText"] descr =dtuple.TupleDescriptor([[n] for n in flist]) conn = odbc.connect("HoldenWebSQL") # Connect to a database curs = conn.cursor( ) # Create a cursor sql = """SELECT %s FROM StdPage WHERE PageSet='Std' AND Num<25 ORDER BY PageSet, Num""" % ", ".join(flist) print sql curs.execute(sql) rows = curs.fetchall( ) for row in rows: row = dtuple.DatabaseTuple(descr, row) print "Attribute: Name: %s Number: %d" % (row.Name, row.Num or 0) print "Subscript: Name: %s Number: %d" % (row[0], row[1] or 0) print "Mapping: Name: %s Number: %d" % (row["Name"], row["Num"] or 0) conn.close( )
Novice Python programmers are sometimes deterred from
using databases because query results are presented by DB
API-compliant modules as a list of tuples. Since tuples can only be
numerically subscripted, code that uses the query results becomes
opaque and difficult to maintain. Greg Stein’s dtuple
module, available from http://www.lyra.org/greg/python/dtuple.py,
helps by defining two useful classes: TupleDescriptor
and DatabaseTuple
. To access an arbitrary SQL
database, this recipe uses the ODBC protocol through the mxODBC
module, http://www.egenix.com/files/python/mxODBC.html,
but nothing relevant to the recipe’s task would change if any other
standard DB API-compliant module was used instead.
The TupleDescriptor
class creates a description of tuples from a list of sequences, the
first element of each subsequence being a column name. It is often
convenient to describe data with such sequences. For example, in an
interactive forms-based application, each column name might be
followed by validation parameters such as data type and allowable
length. TupleDescriptor
’s purpose
is to allow the creation of DatabaseTuple
objects. In this particular
application, no other information about the columns is needed beyond
the names, so the required list of sequences is a list of singleton
lists (meaning lists that have just one element each), constructed
from a list of field names using a list comprehension.
Created from TupleDescriptor
and a tuple such as a database row, DatabaseTuple
is an object whose elements
can be accessed by numeric subscript (like a tuple) or column-name
subscript (like a dictionary). If column names are legal Python names,
you can also access the columns in your DatabaseTuple
as attributes. A purist might
object to this crossover between items and attributes, but it’s a
highly pragmatic choice in this case. Python is nothing if not a
highly pragmatic language, so I see nothing wrong with this
convenience.
To demonstrate the utility of DatabaseTuple
, the simple test program in
this recipe creates a TupleDescriptor
and uses it to convert each
row retrieved from an SQL query into DatabaseTuple
. Because the sample uses the
same field list to build both TupleDescriptor
and the SQL SELECT
statement, it demonstrates how
database code can be parameterized relatively easily.
Alternatively, if you wish to get all the fields (an SQL
SELECT *
query), and dynamically
get the field names from the cursor, as previously described in Recipe 7.13, you can do so.
Just remove variable flist
, which you don’t need any
more, and move the construction of variable descr
to
right after the call to the cursor’s execute
method, as follows:
curs.execute(sql) descr = dtuple.TupleDescriptor(curs.description)
The rest of the recipe can remain unchanged.
A more sophisticated approach, with functionality
similar to dtuple
’s and even better
performance, is offered by the Python Database Row Module (also known
as db_row
) made freely available by
the OPAL Group. For downloads and information, visit http://opensource.theopalgroup.com/.
Module pysqlite
, which
handles relational databases in memory or in files by wrapping the
SQLite library, does not return real tuple
s from such methods as fetchall
: rather, it returns instances of a
convenience class that wraps tuple
and also allows field access with attribute-access syntax, much like
the approaches mentioned in this recipe.
Recipe 7.13 for
a simpler, less functionally rich way to convert field names to column
numbers; the dtuple
module is at
http://www.lyra.org/greg/python/dtuple.py;
OPAL’s db_row
is at http://opensource.theopalgroup.com/; SQLite, a
fast, lightweight, embedded relational database (http://www.sqlite.org/), and
its Python DB API interface module pysqlite
(http://pysqlite.sourceforge.net/).
Credit: Steve Holden, Farhad Fouladi, Rosendo Martinez, David Berry, Kevin Ryan
You want to present a query’s result with appropriate column headers (and optionally widths), but you do not want to hard-code this information in your program. Indeed, you may not even know the column headers and widths at the time you’re writing the code.
Discovering the column headers and widths dynamically is the most flexible approach, and it gives you code that’s highly reusable over many such presentation tasks:
def pp(cursor, data=None, check_row_lengths=False): if not data: data = cursor.fetchall( ) names = [ ] lengths = [ ] rules = [ ] for col, field_description in enumerate(cursor.description): field_name = field_description[0] names.append(field_name) field_length = field_description[2] or 12 field_length = max(field_length, len(field_name)) if check_row_lengths: # double-check field length, if it's unreliable data_length = max([ len(str(row[col])) for row in data ]) field_length = max(field_length, data_length) lengths.append(field_length) rules.append('-' * field_length) format = " ".join(["%%-%ss" % l for l in lengths]) result = [ format % tuple(names), format % tuple(rules) ] for row in data: result.append(format % tuple(row)) return " ".join(result)
Relational databases are often perceived as difficult to use.
The Python DB API can make them much easier to use, but if your
programs work with several different DB engines, it’s sometimes
tedious to reconcile the implementation differences between the
various modules, and, even more, between the engines they connect to.
One of the problems of dealing with databases is presenting the result
of a query when you may not know much about the data. This recipe uses
the cursor’s description
attribute
to try to provide appropriate headings. The recipe optionally examines
each output row to ensure that column widths are adequate.
In some cases, a cursor can yield a solid description of the
data it returns, but not all database modules are kind enough to
supply cursors that do so. The pretty printer function
pp
shown in this recipe’s Solution takes as its first
argument a cursor, on which you have just executed a retrieval
operation (generally the execute
of
an SQL SELECT
statement). It also
takes an optional argument for the returned data; to use the data for
other purposes, retrieve the data from the cursor, typically with
fetchall
, and pass it in as
pp
’s data
argument. The second
optional argument tells the pretty printer to determine the column
lengths from the data, rather than trusting the cursor’s description;
checking the data for column lengths can be time-consuming, but is
helpful with some RDBMS engines and DB API module combinations, where
the widths given by the cursor’s description
attribute can be
inaccurate.
A simple test program shows the value of the second optional
argument when a Microsoft Jet database is used through the mxODBC
module:
import mx.ODBC.Windows as odbc import dbcp # contains pp function conn = odbc.connect("MyDSN") curs = conn.cursor( ) curs.execute("""SELECT Name, LinkText, Pageset FROM StdPage ORDER BY PageSet, Name""") rows = curs.fetchall( ) print " Without rowlens:" print dbcp.pp(curs, rows) print " With rowlens:" print dbcp.pp(curs, rows, rowlens=1) conn.close( )
In this case, the cursor’s description
does not include column lengths.
The first output shows that the default column length of 12 is too
short. The second output corrects this by examining the data:
Without rowlens: Name LinkText Pageset ------------ ------------ ------------ ERROR ERROR: Cannot Locate Page None home Home None consult Consulting Activity Std ffx FactFaxer Std hardware Hardware Platforms Std python Python Std rates Rates Std technol Technologies Std wcb WebCallback Std With rowlens: Name LinkText Pageset ------------ ------------------------- ------------ ERROR ERROR: Cannot Locate Page None home Home None consult Consulting Activity Std ffx FactFaxer Std hardware Hardware Platforms Std python Python Std rates Rates Std technol Technologies Std wcb WebCallback Std
Module pysqlite
,
which handles relational databases in memory or in files by wrapping
the SQLite library, is another example of a DB
API module whose cursors’ descriptions do not contain reliable values
for field lengths. Moreover, pysqlite
does not return real tuple
s from such methods as fetchall
: rather, it returns instances of a
convenience class which wraps tuple
and also allocws field access with attribute access syntax, much like
the approaches presented in Recipe 7.14. To deal with
such small variations from the DB API specifications, this recipe
carefully uses tuple(row)
, not just
row
, as the right-hand operand of operator %
in the statement result.append(format % tuple(row))
. Python’s
semantics specify that if the right-hand operand is not a tuple
, then the left-hand (format string)
operand may contain only one format specifier. This recipe uses a
tuple
as the right-hand operand
because the whole point of the recipe is to build and use a format
string with many format specifiers, one per field.
This recipe’s function is useful during testing, since it lets
you easily verify that you are indeed retrieving what you expect from
the database. The output is pretty enough to display ad hoc query
outputs to users. The function currently makes no attempt to represent
null values other than the None
the
DB API returns, though it could easily be modified to show a null
string or some other significant value.
The mxODBC
package, a DB
API-compatible interface to ODBC (http://www.egenix.com/files/python/mxODBC.html);
SQLite, a fast, lightweight embedded relational database (http://www.sqlite.org/), and
its Python DB API interface module pysqlite
(http://pysqlite.sourceforge.net/).
Credit: Denis S. Otkidach
You want to write Python code that runs under any DB API compliant module, but such modules can use different styles to allow parameter passing.
We need a set of supporting functions to convert SQL queries and parameters to any of the five possible parameter-passing styles:
class Param(object): ''' a class to wrap any single parameter ''' def _ _init_ _(self, value): self.value = value def _ _repr_ _(self): return 'Param(%r)' % (self.value,) def to_qmark(chunks): ''' prepare SQL query in '?' style ''' query_parts = [ ] params = [ ] for chunk in chunks: if isinstance(chunk, Param): params.append(chunk.value) query_parts.append('?') else: query_parts.append(chunk) return ''.join(query_parts), params def to_numeric(chunks): ''' prepare SQL query in ':1' style ''' query_parts = [ ] params = [ ] for chunk in chunks: if isinstance(chunk, Param): params.append(chunk.value) query_parts.append(':%d' % len(params)) else: query_parts.append(chunk) # DCOracle2 needs, specifically, a _tuple_ of parameters: return ''.join(query_parts), tuple(params) def to_named(chunks): ''' prepare SQL query in ':name' style ''' query_parts = [ ] params = { } for chunk in chunks: if isinstance(chunk, Param): name = 'p%d' % len(params) params[name] = chunk.value query_parts.append(':%s' % name) else: query_parts.append(chunk) return ''.join(query_parts), params def to_format(chunks): ''' prepare SQL query in '%s' style ''' query_parts = [ ] params = [ ] for chunk in chunks: if isinstance(chunk, Param): params.append(chunk.value) query_parts.append('%s') else: query_parts.append(chunk.replace('%', '%%')) return ''.join(query_parts), params def to_pyformat(chunks): ''' prepare SQL query in '%(name)s' style ''' query_parts = [ ] params = { } for chunk in chunks: if isinstance(chunk, Param): name = 'p%d' % len(params) params[name] = chunk.value query_parts.append('%%(%s)s' % name) else: query_parts.append(chunk.replace('%', '%%')) return ''.join(query_parts), params converter = { } for paramstyle in ('qmark', 'numeric', 'named', 'format', 'pyformat'): converter[paramstyle] = globals['to_' + param_style] def execute(cursor, converter, chunked_query): query, params = converter(chunked_query) return cursor.execute(query, params) if _ _name_ _=='_ _main_ _': query = ('SELECT * FROM test WHERE field1>', Param(10), ' AND field2 LIKE ', Param('%value%')) print 'Query:', query for paramstyle in ('qmark', 'numeric', 'named', 'format', 'pyformat'): print '%s: %r' % (paramstyle, converter[param_style](query))
The DB API specification is quite handy, but it has one most
annoying problem: it allows compliant modules to use any of five
parameter styles. So you cannot necessarily switch to another database
just by changing the database module: if the parameter-passing styles
of two such modules differ, you need to rewrite all SQL queries that
use parameter substitution. Using this recipe, you can improve this
situation a little. Pick the appropriate converter from the
converter
dictionary (indexing it with the paramstyle
attribute of your current DB API
module), write your queries as mixed chunks of SQL strings and
instances of the provided Param
class (as exemplified
in the if _ _name_ _=='_ _main_ _
'
part of the recipe), and execute your queries through the
execute
function in this recipe. Not a perfectly
satisfactory solution, by any means, but way better than
nothing!
The DB API docs at http://www.python.org/peps/pep-0249.html; the list of DB API-compliant modules at http://www.python.org/topics/database/modules.html.
Credit: Souman Deb
You need to access a Microsoft Jet database via Microsoft’s ADO, for example from a Python-coded CGI script for the Apache web-server.
The CGI script must live in Apache’s cgi-bin directory and can use the PyWin32 extensions to connect, via COM, to ADO and hence to Microsoft Jet. For example:
#!C:Python23python print "Content-type:text/html " import win32com db='C:\Program Files\Microsoft Office\Office\Samples\Northwind.mdb' MAX_ROWS=2155 def connect(query): con = win32com.client.Dispatch('ADODB.Connection') con.Open("Provider=Microsoft.Jet.OLEDB.4.0; Data Source="+db) result_set = con.Execute(query + ';') con.Close( ) return result_set def display(columns, MAX_ROWS): print "<table border=1>" print "<th>Order ID</th>" print "<th>Product</th>" print "<th>Unit Price</th>" print "<th>Quantity</th>" print "<th>Discount</th>" for k in range(MAX_ROWS): print "<tr>" for field in columns: print "<td>", field[k], "</td>" print "</tr>" print "</table>" result_set = connect("select * from [Order details]") columns = result_set[0].GetRows(MAX_ROWS) display(columns, MAX_ROWS) result_set[0].Close
This recipe uses the “Northwind Database” example that Microsoft
distributes with several of its products, such as Microsoft Access. To
run this recipe, you need a machine running Microsoft Windows with
working installations of other Microsoft add-ons such as OLEDB, ADO,
and the Jet database driver, which is often (though not correctly)
known as “the Access database”. (Microsoft Access is a product to
build database frontend applications, and it can work with other
database drivers, such as Microsoft SQL Server, not just with the
freely distributable and downloadable Microsoft Jet database drivers.)
Moreover, you need an installation of Mark Hammond’s PyWin32
package (formerly known as win32all
); the Python distribution known as
ActivePython, from ActiveState, comes with (among other things)
PyWin32
already
installed.
If you want to run this recipe specifically as an Apache CGI script, of course, you also need to install Apache and to place this script in the cgi-bin directory where Apache expects to find CGI scripts (the location of the cgi-bin directory depends on how you have installed Apache on your machine).
Make sure that the paths in the script are correct, depending on where, on your machine, you have installed the python.exe file you want to use, and the Northwind.mdb database you want to query. The paths indicated in the recipe correspond to default installations of Python 2.3 and the “Northwind” example database. If the script doesn’t work correctly, check the Apache error.log file, where you will find error messages that may help you find out what kind of error you’re dealing with.
To try the script, assuming that, for example, you have saved it
as cgi-bin/adoexample.py and that
your Apache server is running correctly, visit with any browser the
URL http://localhost/cgi-bin/adoexample.py. One
known limitation of the interface between Python and Jet databases
with ADO is on fields of type currency
: such fields are returned as some
strange tuples, rather than as plain numbers. This recipe does not
deal with that limitation.
Documentation for the Win32 API in PyWin32
(http://starship.python.net/crew/mhammond/win32/Downloads.html)
or ActivePython (http://www.activestate.com/ActivePython/);
Windows API documentation available from Microsoft (http://msdn.microsoft.com);
Mark Hammond and Andy Robinson, Python Programming on
Win32 (O’Reilly).
Credit: Brian Zhou
You’re writing a servlet in Jython, and you need to connect to a database server (such as Oracle, Sybase, Microsoft SQL Server, or MySQL) via JDBC.
The technique is basically the same for any kind of database, give or take a couple of statements. Here’s the code for when your database is Oracle:
import java, javax class emp(javax.servlet.http.HttpServlet): def doGet(self, request, response): ''' a Servlet answers a Get query by writing to the response's output stream. In this case we ignore the request, though in normal, non-toy cases that's where we get form input from. ''' # we answer in plain text, so set the content type accordingly response.setContentType("text/plain") # get the output stream, use it for the query, then close it out = response.getOutputStream( ) self.dbQuery(out) out.close( ) def dbQuery(self, out): # connect to the Oracle driver, building an instance of itdriver = "oracle.jdbc.driver.OracleDriver" java.lang.Class.forName(driver).newInstance( ) # get a connection to the Oracle driver w/given user and password server, db = "server", "ORCL" url = "jdbc:oracle:thin:@" + server + ":" + db usr, passwd = "scott", "tiger" conn = java.sql.DriverManager.getConnection(url, usr, passwd) # send an SQL query to the connection query = "SELECT EMPNO, ENAME, JOB FROM EMP" stmt = conn.createStatement( ) if stmt.execute(query): # get query results and print the out to the out stream rs = stmt.getResultSet( ) while rs and rs.next( ): out.println(rs.getString("EMPNO")) out.println(rs.getString("ENAME")) out.println(rs.getString("JOB")) out.println( ) stmt.close( ) conn.close( )
When your database is Sybase or Microsoft SQL Server, use the following (we won’t repeat the comments from the preceding Oracle example, since they apply identically here):
import java, javax class titles(javax.servlet.http.HttpServlet): def doGet(self, request, response): response.setContentType("text/plain") out = response.getOutputStream( ) self.dbQuery(out) out.close( ) def dbQuery(self, out):driver = "sun.jdbc.odbc.JdbcOdbcDriver" java.lang.Class.forName(driver).newInstance( ) # Use "pubs" DB for mssql and "pubs2" for Sybase url = "jdbc:odbc:myDataSource" usr, passwd = "sa", "password" conn = java.sql.DriverManager.getConnection(url, usr, passwd) query = "select title, price, ytd_sales, pubdate from titles" stmt = conn.createStatement( ) if stmt.execute(query): rs = stmt.getResultSet( ) while rs and rs.next( ): out.println(rs.getString("title")) if rs.getObject("price"): out.println("%2.2f" % rs.getFloat("price")) else: out.println("null") if rs.getObject("ytd_sales"): out.println(rs.getInt("ytd_sales")) else: out.println("null") out.println(rs.getTimestamp("pubdate").toString( )) out.println( ) stmt.close( ) conn.close( )
And here’s the code for when your database is MySQL:
import java, javax class goosebumps(javax.servlet.http.HttpServlet): def doGet(self, request, response): response.setContentType("text/plain") out = response.getOutputStream( ) self.dbQuery(out) out.close( ) def dbQuery(self, out):driver = "org.gjt.mm.mysql.Driver" java.lang.Class.forName(driver).newInstance( ) server, db = "server", "test" usr, passwd = "root", "password" url = "jdbc:mysql://%s/%s?user=%s&password=%s" % ( server, db, usr, passwd) conn = java.sql.DriverManager.getConnection(url) query = "select country, monster from goosebumps" stmt = conn.createStatement( ) if stmt.execute(query): rs = stmt.getResultSet( ) while rs and rs.next( ): out.println(rs.getString("country")) out.println(rs.getString("monster")) out.println( ) stmt.close( )
You might want to use different JDBC drivers and URLs, but you
can see that the basic technique is quite simple and straightforward.
This recipe’s code uses a content type of text/plain
because the recipe is about
accessing the database, not about formatting the data you get from it.
Obviously, you can change this content type to whichever is
appropriate for your application.
In each case, the basic technique is first to instantiate the
needed driver (whose package name, as a string, we place in variable
driver
) via the Java dynamic loading facility. The
forName
method of the java.lang.Class
class loads and provides the
relevant Java class, and that class’ newInstance
method ensures that the driver
we need is instantiated. Then, we can call the getConnection
method of java.sql.DriverManager
with the appropriate
URL (or username and password, where needed) and thus obtain a
connection object to place in the conn
variable. From
the connection object, we can create a statement object with the
createStatement
method and use it
to execute a query that we have in the query
string
variable with the execute
method.
If the query succeeds, we can obtain the results with the getResultSet
method. Finally, Oracle and
MySQL allow easy sequential navigation of the result set to present
all results, while Sybase and Microsoft SQL Server need a bit more
care. Overall, the procedure is similar in all cases.
The Jython site (http://www.jython.org); JDBC’s home page (http://java.sun.com/products/jdbc).
Credit: Zabil CM
Jython, just like Java, can access ODBC through the JDBC-ODBC Bridge, and Microsoft Excel can in turn be queried via ODBC:
from java import lang, sql lang.Class.forName('sun.jdbc.odbc.JdbcOdbcDriver') excel_file = 'values.xls' connection = sql.DriverManager.getConnection( 'jdbc:odbc:Driver={Microsoft Excel Driver (*.xls)};DBQ=%s;READONLY=true}' % excel_file, '', '') # Sheet1 is the name of the Excel workbook we want. The field names for the # query are implicitly set by the values for each column in the first row. record_set = connection.createStatement( ).executeQuery( 'SELECT * FROM [Sheet1$]') # print the first-column field of every record (==row) while record_set.next( ): print record_set.getString(1) # we're done, close the connection and recordset record_set.close( ) connection.close( )
This recipe is most easily used on Microsoft Windows, where installing and configuring ODBC, and the Microsoft Excel ODBC driver in particular, is best supported. However, with suitable commercial products, you can equally well use the recipe on an Apple Macintosh or just about any other Unix version on the planet.
Using ODBC rather than alternate ways to access Microsoft Excel
has one substantial advantage that is not displayed in this recipe:
with ODBC, you can use a broad subset of SQL. For example, you can
easily extract a subset of a workbook’s row by adding a WHERE
clause, such as:
SELECT * FROM [Sheet1$] WHERE DEPARTMENT=9
Since all of the selection logic can be easily expressed in the
SQL string you pass to the executeQuery
method, this approach lends
itself particularly well to being encapsulated in a simple reusable
function.
If you’re coding in Classic Python (CPython) rather than Jython,
you can’t use JDBC, but you can use ODBC directly (typically in the DB
API-compliant way supported by mxODBC
, http://www.egenix.com/files/python/mxODBC.html)
to perform this recipe’s task in a similar way.
The Jython site (http://www.jython.org); JDBC’s home page (http://java.sun.com/products/jdbc); Recipe 12.7, for another way to access Excel data (by parsing the XML file that Excel can be asked to output).
[1] E.F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, 13, no. 6 (1970), pp. 377-87, http://www.acm.org/classics/nov95/toc.html.
[2] Microsoft Jet is commonly but erroneously known as the “Microsoft Access database.” Access is a product that Microsoft sells for designing and implementing database frontends; Jet is a backend that you may download for free from Microsoft’s web site.