Python directly exposes, supports, and documents many of its internal mechanisms. This may help you understand Python at an advanced level, and lets you hook your own code into such Python mechanisms, controlling them to some extent. For example, “Python built-ins” covers the way Python arranges for built-ins to be visible. This chapter covers some other advanced Python techniques; Chapter 16 covers issues specific to testing, debugging, and profiling. Other issues related to controlling execution are about using multiple threads and processes, covered in Chapter 14, and about asynchronous processing, covered in Chapter 18.
Python provides a specific “hook” to let each site customize some aspects of Python’s behavior at the start of each run. Customization by each single user is not enabled by default, but Python specifies how programs that want to run user-provided code at startup can explicitly support such customization (a rarely used facility).
Python loads the standard module site
just before the main script. If Python is run with option -S
, Python does not load site
. -S
allows faster startup but saddles the main script with initialization chores. site
’s tasks are:
Putting sys.path
in standard form (absolute paths, no duplicates).
Interpreting each .pth file found in the Python home directory, adding entries to sys.path
, and/or importing modules, as each .pth file indicates.
Adding built-ins used to print information in interactive sessions (exit
, copyright
, credits
, license
, and quit
).
In v2 only, setting the default Unicode encoding to 'ascii'
(in v3, the default encoding is built-in as 'utf-8'
). v2’s site
source code includes two blocks, each guarded by if 0:
, one to set the default encoding to be locale-dependent, and the other to completely disable any default encoding between Unicode and plain strings. You may optionally edit site.py to select either block, but this is not a good idea, even though a comment in site.py says “if you’re willing to experiment, you can change this.”
In v2 only, trying to import sitecustomize
(should import sitecustomize
raise an ImportError
exception, site
catches and ignores it). sitecustomize
is the module that each site’s installation can optionally use for further site-specific customization beyond site
’s tasks. It is best not to edit site.py, since any Python upgrade or reinstallation would overwrite customizations.
After sitecustomize
is done, removing the attribute sys.setdefaultencoding
from the sys
module, so that the default encoding can’t be changed.
Each interactive Python interpreter session starts by running the script named by the environment variable PYTHONSTARTUP
. Outside of interactive interpreter sessions, there is no automatic per-user customization. To request per-user customization, a Python (v2 only) main script can explicitly import user
. The v2 standard library module user
, when loaded, first determines the user’s home directory, as indicated by the environment variable HOME
(or, failing that, HOMEPATH
, possibly preceded by HOMEDRIVE
on Windows systems only). If the environment does not indicate a home directory, user
uses the current directory. If the user
module finds a file named .pythonrc.py in the indicated directory, user
executes that file, with the built-in Python v2 function execfile
, in user
’s own global namespace.
Scripts that don’t import user
do not run .pythonrc.py; no Python v3 script does, either, since the user
module is not defined in v3. Of course, any given script is free to arrange other specific ways to run whatever user-specific startup module it requires. Such application-specific arrangements, even in v2, are more common than importing user
. A generic .pythonrc.py, as loaded via import user
, needs to be usable with any application that loads it. Specialized, application-specific startup files only need to follow whatever convention a specific application documents.
For example, your application MyApp.py could document that it looks for a file named .myapprc.py in the user’s home directory, as indicated by the environment variable HOME
, and loads it in the application’s main script’s global namespace. You could then have the following code in your main script:
import
os
homedir
=
os
.
environ
.
get
(
'HOME'
)
if
homedir
is
not
None
:
userscript
=
os
.
path
.
join
(
homedir
,
'.myapprc.py'
)
if
os
.
path
.
isfile
(
userscript
):
with
open
(
userscript
)
as
f
:
exec
(
f
.
read
())
In this case, the .myapprc.py user customization script, if present, has to deal only with MyApp
-specific user customization tasks. This approach is better than relying on the user
module, and works just as well in v3 as it does in v2.
The atexit
module lets you register termination functions (i.e., functions to be called at program termination, “last in, first out”). Termination functions are similar to clean-up handlers established by try
/finally
or with
. However, termination functions are globally registered and get called at the end of the whole program, while clean-up handlers are established lexically and get called at the end of a specific try
clause or with
statement. Termination functions and clean-up handlers are called whether the program terminates normally or abnormally, but not when the program ends by calling os._exit
(which is why you normally call sys.exit
instead). The atexit
module supplies a function called register
:
register |
Ensures that |
Python’s exec
statement (built-in function, in v3) can execute code that you read, generate, or otherwise obtain during a program’s run. exec
dynamically executes a statement or a suite of statements. In v2, exec
is a simple keyword statement with the following syntax:
exec
code
[
in
globals
[
,
locals
]
]
code
can be a string, an open file-like object, or a code object. globals
and locals
are mappings. In v3, exec
is a built-in function with the syntax:
exec
(
code
,
globals
=
None
,
locals
=
None
)
code
can be a string, bytes, or code object. globals
is a dict; locals
, any mapping.
If both globals
are locals
are present, they are the global and local namespaces in which code
runs. If only globals
is present, exec
uses globals
as both namespaces. If neither is present, code
runs in the current scope.
Running exec
in the current scope is a very bad idea: it can bind, rebind, or unbind any global name. To keep things under control, use exec
, if at all, only with specific, explicit dictionaries.
A frequently asked question about Python is “How do I set a variable whose name I just read or built?” Literally, for a global variable, exec
allows this, but it’s a bad idea. For example, if the name of the variable is in varname
, you might think to use:
exec
(
varname
+
'
= 23
'
)
Don’t do this. An exec
like this in current scope makes you lose control of your namespace, leading to bugs that are extremely hard to find, and making your program unfathomably difficult to understand. Keep the “variables” that you need to set, not as variables, but as entries in a dictionary, say mydict
. You could then use:
exec
(
varname
+
'
=23
'
,
mydict
)
While this is not quite as terrible as the previous example, it is still a bad idea. Keeping such “variables” as dictionary entries means that you don’t have any need to use exec
to set them. Just code:
mydict
[
varname
]
=
23
This way, your program is clearer, direct, elegant, and faster. There are some valid uses of exec
, but they are extremely rare: just use explicit dictionaries instead.
Use exec
only when it’s really indispensable, which is extremely rare. Most often, it’s best to avoid exec
and choose more specific, well-controlled mechanisms: exec
weakens your control of your code’s namespace, can damage your program’s performance, and exposes you to numerous hard-to-find bugs and huge security risks.
exec
can execute an expression, because any expression is also a valid statement (called an expression statement). However, Python ignores the value returned by an expression statement. To evaluate an expression and obtain the expression’s value, see the built-in function eval
, covered in Table 7-2.
To make a code object to use with exec
, call the built-in function compile
with the last argument set to 'exec'
(as covered in Table 7-2).
A code object c
exposes many interesting read-only attributes whose names all start with 'co_'
, such as:
co_argcount
Number of parameters of the function of which c
is the code (0
when c
is not the code object of a function, but rather is built directly by compile
)
co_code
A bytestring with c
’s bytecode
co_consts
The tuple of constants used in c
co_filename
The name of the file c
was compiled from (the string that is the second argument to compile
, when c
was built that way)
co_firstlinenumber
The initial line number (within the file named by co_filename
) of the source code that was compiled to produce c
, if c
was built by compiling from a file
co_name
The name of the function of which c
is the code ('<module>'
when c
is not the code object of a function but rather is built directly by compile
)
co_names
The tuple of all identifiers used within c
co_varnames
The tuple of local variables’ identifiers in c
, starting with parameter names
Most of these attributes are useful only for debugging purposes, but some may help advanced introspection, as exemplified later in this section.
If you start with a string that holds some statements, first use compile
on the string, then call exec
on the resulting code object—that’s better than giving exec
the string to compile and execute. This separation lets you check for syntax errors separately from execution-time errors. You can often arrange things so that the string is compiled once and the code object executes repeatedly, which speeds things up. eval
can also benefit from such separation. Moreover, the compile
step is intrinsically safe (both exec
and eval
are extremely risky if you execute them on code that you don’t trust), and you may be able to perform some checks on the code object, before it executes, to lessen the risk (though never truly down to zero).
A code object has a read-only attribute co_names
, which is the tuple of the names used in the code. For example, say that you want the user to enter an expression that contains only literal constants and operators—no function calls or other names. Before evaluating the expression, you can check that the string the user entered satisfies these constraints:
def
safer_eval
(
s
):
code
=
compile
(
s
,
'<user-entered string>'
,
'eval'
)
if
code
.
co_names
:
raise
ValueError
(
'No names
{!r}
allowed in expression
{!r}
'
.
format
(
code
.
co_names
,
s
))
return
eval
(
code
)
This function safer_eval
evaluates the expression passed in as argument s
only when the string is a syntactically valid expression (otherwise, compile
raises SyntaxError
) and contains no names at all (otherwise, safer_eval
explicitly raises ValueError
). (This is similar to the standard library function ast.literal_eval
, covered in “Standard Input”, but a bit more powerful, since it does allow the use of operators.)
Knowing what names the code is about to access may sometimes help you optimize the preparation of the dictionary that you need to pass to exec
or eval
as the namespace. Since you need to provide values only for those names, you may save work by not preparing other entries. For example, say that your application dynamically accepts code from the user with the convention that variable names starting with data_
refer to files residing in the subdirectory data that user-written code doesn’t need to read explicitly. User-written code may in turn compute and leave results in global variables with names starting with result_
, which your application writes back as files in subdirectory data. Thanks to this convention, you may later move the data elsewhere (e.g., to BLOBs in a database instead of files in a subdirectory), and user-written code won’t be affected. Here’s how you might implement these conventions efficiently (in v3; in v2, use exec user_code in datadict
instead of exec(user_code, datadict)
):
def
exec_with_data
(
user_code_string
):
user_code
=
compile
(
user_code_string
,
'<user code>'
,
'exec'
)
datadict
=
{}
for
name
in
user_code
.
co_names
:
if
name
.
startswith
(
'data_'
):
with
open
(
'data/
{}
'
.
format
(
name
[
5
:]),
'rb'
)
as
datafile
:
datadict
[
name
]
=
datafile
.
read
()
exec
(
user_code
,
datadict
)
for
name
in
datadict
:
if
name
.
startswith
(
'result_'
):
with
open
(
'data/
{}
'
.
format
(
name
[
7
:]),
'wb'
)
as
datafile
:
datafile
.
write
(
datadict
[
name
])
Old versions of Python tried to supply tools to ameliorate the risks of using exec
and eval
, under the heading of “restricted execution,” but those tools were never entirely secure against the ingenuity of able hackers, and current versions of Python have therefore dropped them. If you need to ward against such attacks, take advantage of your operating system’s protection mechanisms: run untrusted code in a separate process, with privileges as restricted as you can possibly make them (study the mechanisms that your OS supplies for the purpose, such as chroot
, setuid
, and jail
), or run untrusted code in a separate, highly constrained virtual machine. To guard against “denial of service” attacks, have the main process monitor the separate one and terminate the latter if and when resource consumption becomes excessive. Processes are covered in “Running Other Programs”.
The function exec_with_data
is not at all safe against untrusted code: if you pass it, as the argument user_code_string
, some string obtained in a way that you cannot entirely trust, there is essentially no limit to the amount of damage it might do. This is unfortunately true of just about any use of both exec
and eval
, except for those rare cases in which you can set very strict and checkable limits on the code to execute or evaluate, as was the case for the function safer_eval
.
Some of the internal Python objects in this section are hard to use. Using such objects correctly and to good effect requires some study of your Python implementation’s own C (or Java, or C#) sources. Such black magic is rarely needed, except to build general-purpose development tools, and similar wizardly tasks. Once you do understand things in depth, Python empowers you to exert control if and when needed. Since Python exposes many kinds of internal objects to your Python code, you can exert that control by coding in Python, even when an understanding of C (or Java, or C#) is needed to read Python’s sources to understand what’s going on.
The built-in type named type
acts as a callable factory, returning objects that are types. Type objects don’t have to support any special operations except equality comparison and representation as strings. However, most type objects are callable and return new instances of the type when called. In particular, built-in types such as int
, float
, list
, str
, tuple
, set
, and dict
all work this way; specifically, when called without arguments, they return a new empty instance or, for numbers, one that equals 0
. The attributes of the types
module are the built-in types, each with one or more names. For example, in v2, types.DictType
and types.DictionaryType
both refer to type({})
, also known as dict
. In v3, types
only supplies names for built-in types that don’t already have a built-in name, as covered in Chapter 7. Besides being callable to generate instances, many type objects are also useful because you can inherit from them, as covered in “Classes and Instances”.
Besides using the built-in function compile
, you can get a code object via the __code__
attribute of a function or method object. (For the attributes of code objects, see “Compile and Code Objects”.) Code objects are not callable, but you can rebind the __code__
attribute of a function object with the right number of parameters in order to wrap a code object into callable form. For example:
def
g
(
x
)
:
(
'
g
'
,
x
)
code_object
=
g
.
__code__
def
f
(
x
)
:
pass
f
.
__code__
=
code_object
f
(
23
)
# prints:
g 23
Code objects that have no parameters can also be used with exec
or eval
. To create a new object, call the type object you want to instantiate. However, directly creating code objects requires many parameters; see Stack Overflow’s nonofficial docs on how to do it, but, almost always, you’re better off calling compile
instead.
The function _getframe
in the module sys
returns a frame object from Python’s call stack. A frame object has attributes giving information about the code executing in the frame and the execution state. The modules traceback
and inspect
help you access and display such information, particularly when an exception is being handled. Chapter 16 provides more information about frames and tracebacks, and covers the module inspect
, which is the best way to perform such introspection.
Python’s garbage collection normally proceeds transparently and automatically, but you can choose to exert some direct control. The general principle is that Python collects each object x
at some time after x
becomes unreachable—that is, when no chain of references can reach x
by starting from a local variable of a function instance that is executing, nor from a global variable of a loaded module. Normally, an object x
becomes unreachable when there are no references at all to x
. In addition, a group of objects can be unreachable when they reference each other but no global or local variables reference any of them, even indirectly (such a situation is known as a mutual reference loop).
Classic Python keeps with each object x
a count, known as a reference count, of how many references to x
are outstanding. When x
’s reference count drops to 0
, CPython immediately collects x
. The function getrefcount
of the module sys
accepts any object and returns its reference count (at least 1
, since getrefcount
itself has a reference to the object it’s examining). Other versions of Python, such as Jython or IronPython, rely on other garbage-collection mechanisms supplied by the platform they run on (e.g., the JVM or the MSCLR). The modules gc
and weakref
therefore apply only to CPython.
When Python garbage-collects x
and there are no references at all to x
, Python then finalizes x
(i.e., calls x
.__del__()
) and makes the memory that x
occupied available for other uses. If x
held any references to other objects, Python removes the references, which in turn may make other objects collectable by leaving them unreachable.
The gc
module exposes the functionality of Python’s garbage collector. gc
deals only with unreachable objects that are part of mutual reference loops. In such a loop, each object in the loop refers to others, keeping the reference counts of all objects positive. However, no outside references to any one of the set of mutually referencing objects exist any longer. Therefore, the whole group, also known as cyclic garbage, is unreachable, and therefore garbage-collectable. Looking for such cyclic garbage loops takes time, which is why the module gc
exists: to help you control whether and when your program spends that time. The functionality of “cyclic garbage collection,” by default, is enabled with some reasonable default parameters: however, by importing the gc
module and calling its functions, you may choose to disable the functionality, change its parameters, and/or find out exactly what’s going on in this respect.
gc
exposes functions you can use to help you keep cyclic garbage-collection times under control. These functions can sometimes let you track down a memory leak—objects that are not getting collected even though there should be no more references to them—by helping you discover what other objects are in fact holding on to references to them:
collect |
Forces a full cyclic garbage collection run to happen immediately. |
disable |
Suspends automatic, periodic cyclic garbage collection. |
enable |
Reenables periodic cyclic garbage collection previously suspended with |
garbage |
A read-only attribute that lists the unreachable but uncollectable objects. This happens when any object in a cyclic garbage loop has a |
get_debug |
Returns an |
get_objects |
Returns a list of all objects currently tracked by the cyclic garbage collector. |
get_referrers |
Returns a list of all container objects, currently tracked by the cyclic garbage collector, that refer to any one or more of the arguments. |
get_threshold |
Returns a three-item tuple |
isenabled |
Returns |
set_debug |
Sets debugging flags for garbage collection.
|
set_threshold |
Sets thresholds that control how often cyclic garbage-collection cycles run. A |
When you know there are no cyclic garbage loops in your program, or when you can’t afford the delay of cyclic garbage collection at some crucial time, suspend automatic garbage collection by calling gc.disable()
. You can enable collection again later by calling gc.enable()
. You can test whether automatic collection is currently enabled by calling gc.isenabled()
, which returns True
or False
. To control when time is spent collecting, you can call gc.collect()
to force a full cyclic collection run to happen immediately. To wrap some time-critical code:
import
gc
gc_was_enabled
=
gc
.
isenabled
()
if
gc_was_enabled
:
gc
.
collect
()
gc
.
disable
()
# insert some time-critical code here
if
gc_was_enabled
:
gc
.
enable
()
Other functionality in the module gc
is more advanced and rarely used, and can be grouped into two areas. The functions get_threshold
and set_threshold
and debug flag DEBUG_STATS
help you fine-tune garbage collection to optimize your program’s performance. The rest of gc
’s functionality can help you diagnose memory leaks in your program. While gc
itself can automatically fix many leaks (as long as you avoid defining __del__
in your classes, since the existence of __del__
can block cyclic garbage collection), your program runs faster if it avoids creating cyclic garbage in the first place.
Careful design can often avoid reference loops. However, at times you need objects to know about each other, and avoiding mutual references would distort and complicate your design. For example, a container has references to its items, yet it can often be useful for an object to know about a container holding it. The result is a reference loop: due to the mutual references, the container and items keep each other alive, even when all other objects forget about them. Weak references solve this problem by allowing objects to reference others without keeping them alive.
A weak reference is a special object w
that refers to some other object x
without incrementing x
’s reference count. When x
’s reference count goes down to 0
, Python finalizes and collects x
, then informs w
of x
’s demise. Weak reference w
can now either disappear or get marked as invalid in a controlled way. At any time, a given w
refers to either the same object x
as when w
was created, or to nothing at all; a weak reference is never retargeted. Not all types of objects support being the target x
of a weak reference w
, but classes, instances, and functions do.
The weakref
module exposes functions and types to create and manage weak references:
getweakrefcount |
Returns |
getweakrefs |
Returns a list of all weak references and proxies whose target is |
proxy |
Returns a weak proxy |
ref |
Returns a weak reference |
WeakKeyDictionary |
A |
WeakValueDictionary |
A |
WeakKeyDictionary
lets you noninvasively associate additional data with some hashable objects, with no change to the objects. WeakValueDictionary
lets you non-invasively record transient associations between objects, and build caches. In each case, use a weak mapping, rather than a dict
, to ensure that an object that is otherwise garbage-collectable is not kept alive just by being used in a mapping.
A typical example is a class that keeps track of its instances, but does not keep them alive just in order to keep track of them:
import
weakref
class
Tracking
(
object
):
_instances_dict
=
weakref
.
WeakValueDictionary
()
def
__init__
(
self
):
Tracking
.
_instances_dict
[
id
(
self
)]
=
self
@classmethod
def
instances
(
cls
):
return
cls
.
_instances_dict
.
values
()