Python’s standard library supplies several modules to simplify the use of Internet protocols, particularly on the client side (or for some simple servers). These days, the Python Package Index (PyPI) offers many more such packages. These third-party packages support a wider array of protocols, and several offer better APIs than the standard library’s equivalents. When you need to use a protocol that’s missing from the standard library, or covered by the standard library in a way you think is not satisfactory, be sure to search PyPI—you’re likely to find better solutions there.
In this chapter, we cover some standard library packages that may prove satisfactory for some uses of network protocols, especially simple ones: when you can code without requiring third-party packages, your application or library is easier to install on other machines. We also mention a few third-party packages covering important network protocols not included in the standard library. This chapter does not cover third-party packages using an asynchronous programming approach: for that kind of package, see “The asyncio Module (v3 Only)”.
For the specific, very frequent use case of HTTP1 clients and other network resources (such as anonymous FTP sites) best accessed via URLs,2 the standard library offers only complex, incompatible (between v2 and v3) support. For that case, therefore, we have chosen to cover and recommend the splendid third-party package requests
, with its well-designed API and v2/v3 compatibility, instead of any standard library modules.
Most email today is sent via servers implementing the Simple Mail Transport Protocol (SMTP) and received via servers and clients implementing the Post Office Protocol version 3 (POP3) and/or the IMAP4 one (either in the original version, specified in RFC 1730, or the IMAP4rev1 one, specified in RFC 2060). These protocols, client-side, are supported by the Python standard library modules smtplib
, poplib
, and imaplib
.
If you need to write a client that can connect via either POP3 or IMAP4, a standard recommendation would be to pick IMAP4, since it is definitely more powerful, and—according to Python’s own online docs—often more accurately implemented on the server side. Unfortunately, the standard library module for IMAP4, imaplib
, is also inordinately complicated, and far too vast to cover in this book. If you do choose to go that route, use the online docs, inevitably complemented by voluminous RFCs 1730 or 2060, and possibly other related RFCs, such as 5161 and 6855 for capabilities, and 2342 for namespaces. Using the RFCs, in addition to the online docs for the standard library module, can’t be avoided: many of the arguments passed to call imaplib
functions and methods, and results from such calls, are strings with formats that are only documented in the RFCs, not in Python’s own online docs. Therefore, we don’t cover imaplib
in this book; in the following sections, we only cover poplib
and stmplib
.
If you do want to use the rich and powerful IMAP4 protocol, we highly recommend that you do so, not directly via the standard library’s imaplib
, but rather by leveraging the simpler, higher-abstraction-level third-party package IMAPClient, available with a pip install
for both v2 and v3, and well-documented online.
The poplib
module supplies a class POP3
to access a POP mailbox. The specifications of the POP protocol are in RFC 1939.
POP3 |
Returns an instance Sister class |
a To connect to a Gmail account in particular, you also need to configure that account to enaple POP, “Allow less secure apps,” and avoid two-step verification—actions that in general we can’t recommend, since they weaken your email’s security. |
Instance p
supplies many methods, of which the most frequently used are the following (in each case, msgnum
, the identifying number of a message, can be a string or an int
):
dele |
Marks message |
list |
Returns a tuple |
pass_ |
Sends the password to the server. Must be called after |
quit |
Ends the session and tells the server to perform deletions that were requested by calls to |
retr |
Returns a three-item tuple |
set_debuglevel |
Sets debug level to |
stat |
Returns pair |
top |
Like |
user |
Sends the server the username; followed up by a call to |
The smtplib
module supplies a class SMTP
to send mail to an SMTP server. The specifications of the SMTP protocol are in RFC 2821.
SMTP |
Returns an instance The sister class |
The instance s
supplies many methods, of which the most frequently used are the following:
connect |
Connects to an SMTP server on the given |
login |
Logs into the server with the given |
quit |
Terminates the SMTP session. |
sendmail |
Sends mail message |
Most of the time, your code uses HTTP and FTP protocols through the higher-abstraction URL layer, supported by the modules and packages covered in the following sections. Python’s standard library also offers lower-level, protocol-specific modules that are less often used: for FTP clients, ftplib
; for HTTP clients, in v3, http.client
, and in v2, httplib
(we cover HTTP servers in Chapter 20). If you need to write an FTP server, look at the third-party module pyftpdlib
. Implementations of the brand-new HTTP/2 protocol are still at very early stages, but your best bet as of this writing is the third-party module hyper
. (Third-party modules, as usual, can be installed from the PyPI repository with the highly recommended tool pip
.) We do not cover any of these lower-level modules in this book, but rather focus on higher-abstraction, URL-level access throughout the following sections.
A URL is a type of URI (Uniform Resource Identifier). A URI is a string that identifies a resource (but does not necessarily locate it), while a URL locates a resource on the Internet. A URL is a string composed of several optional parts, called components: scheme, location, path, query, and fragment. (The second component is sometimes also known as a net location, or netloc for short.) A URL with all its parts looks like:
scheme://lo.ca.ti.on/pa/th?qu=ery#
fragment
In https://www.python.org/community/awards/psf-awards/#october-2016, for example, the scheme is http, the location is www.python.org, the path is /community/awards/psf-awards/, there is no query, and the fragment is #october-2016. (Most schemes default to a “well-known port” when the port is not explicitly specified; for example, 80 is the “well-known port” for the HTTP scheme.) Some punctuation is part of one of the components it separates; other punctuation characters are just separators, not part of any component. Omitting punctuation implies missing components. For example, in mailto:[email protected], the scheme is mailto, the path is [email protected], and there is no location, query, or fragment. The missing //
means the URI has no location, the missing ?
means it has no query, and the missing #
means it has no fragment.
If the location ends with a colon followed by a number, this denotes a TCP port for the endpoint. Otherwise, the connection uses the “well-known port” associated with the scheme (e.g., port 80 for HTTP).
The urllib.parse
(in v3; urlparse
in v2) module supplies functions for analyzing and synthesizing URL strings. The most frequently used of these functions are urljoin
, urlsplit
, and urlunsplit
:
urljoin |
Returns a URL string
|
urlsplit |
Analyzes
|
urlunsplit |
In this case, the normalization ensures that redundant separators, such as the trailing |
The third-party requests
package (very well documented online) supports both v2 and v3, and it’s how we recommend you access HTTP URLs. As usual for third-party packages, it’s best installed with a simple pip install requests
. In this section, we summarize how best to use it for reasonably simple cases.
Natively, requests
only supports the HTTP transport protocol; to access URLs using other protocols, you need to also install other third-party packages (known as protocol adapters), such as requests-ftp
for FTP URLs, or others supplied as part of the rich, useful requests-toolbelt
package of requests
utilities.
requests
’ functionality hinges mostly on three classes it supplies: Request
, modeling an HTTP request to be sent to a server; Response
, modeling a server’s HTTP response to a request; and Session
, offering continuity across multiple requests, also known as a session. For the common use case of a single request/response interaction, you don’t need continuity, so you may often just ignore Session
.
Most often, you don’t need to explicitly consider the Request
class: rather, you call the utility function request
, which internally prepares and sends the Request
, and returns the Response
instance. request
has two mandatory positional arguments, both str
s: method
, the HTTP method to use, then url
, the URL to address; and then, many optional named parameters (in the next section, we cover the most commonly used named parameters to the request
function).
For further convenience, the requests
module also supplies functions whose names are the HTTP methods delete
, get
, head
, options
, patch
, post
, and put
; each takes a single mandatory positional argument, url
, then the same optional named arguments as the function request
.
When you want some continuity across multiple requests, call Session
to make an instance s
, then use s
’s methods request
, get
, post
, and so on, which are just like the functions with the same names directly supplied by the requests
module (however, s
’s methods merge s
’s settings with the optional named parameters to prepare the request to send to the given url
).
The function request
(just like the functions get
, post
, and so on—and methods with the same names on an instance s
of class Session
) accepts many optional named parameters—refer to the requests
package’s excellent online docs for the full set, if you need advanced functionality such as control over proxies, authentication, special treatment of redirection, streaming, cookies, and so on. The most frequently used named parameters are:
data
A dict
, a sequence of key/value pairs, a bytestring, or a file-like object, to use as the body of the request
headers
json
Python data (usually a dict
) to encode as JSON as the body of the request
files
A dict
with names as keys, file-like objects, or file tuples as values, used with the POST method to specify a multipart-encoding file upload; we cover the format of values for files=
in the next section
params
A dict
, or a bytestring, to send as the query string with the request
timeout
A float
number of seconds, the maximum time to wait for the response before raising an exception
data
, json
, and files
are mutually incompatible ways to specify a body for the request; use only one of them, and only for HTTP methods that do use a body, namely PATCH, POST, and PUT. The one exception is that you can have both data=
passing a dict
, and a files=
, and that is very common usage: in this case, both the key/value pairs in the dict
, and the files, form the body of the request as a single multipart/form-data whole, according to RFC 2388.
When you specify the request’s body with json=
, or data=
passing a bytestring or a file-like object (which must be open for reading, usually in binary mode), the resulting bytes are directly used as the request’s body; when you specify it with data=
(passing a dict
, or a sequence of key/value pairs), the body is built as a form, from the key/value pairs formatted in application/x-www-form-urlencoded format, according to the relevant web standard.
When you specify the request’s body with files=
, the body is also built as a form, in this case with the format set to multipart/form-data (the only way to upload files in a PATCH, POST, or PUT HTTP request). Each file you’re uploading is formatted into its own part of the form; if, in addition, you want the form to give to the server further nonfile parameters, then in addition to files=
also pass a data=
with a dict
value (or a sequence of key/value pairs) for the further parameters—those parameters get encoded into a supplementary part of the multipart form.
To offer you a lot of flexibility, the value of the files=
argument can be a dict
(its items are taken as an arbitrary-order sequence of name/value pairs), or a sequence of name/value pairs (in the latter case, the sequence’s order is maintained in the resulting request body).
Either way, each value in a name/value pair can be a str
(or, best,3 a bytes
or bytearray
) to be used directly as the uploaded file’s contents; or, a file-like object open for reading (then, requests
calls .read()
on it, and uses the result as the uploaded file’s contents; we strongly recommend that, in such cases, you open the file in binary mode to avoid any ambiguity regarding content-length). When any of these conditions apply, requests
uses the name
part of the pair (e.g., the key into the dict
) as the file’s name (unless it can improve on that because the open file object is able to reveal its underlying filename), takes its best guess at a content-type, and uses minimal headers for the file’s form-part.
Alternatively, the value in each name/value pair can be a tuple with two to four items: fn, fp, [ft, [fh]]
(using square brackets as meta-syntax to indicate optional parts). In this case, fn
is the file’s name, fp
provides the contents (in just the same way as in the previous paragraph), optional ft
provides the content type (if missing, requests
guesses it, as in the previous paragraph), and the optional dict
fh
provides extra headers for the file’s form-part.
In practical applications, you don’t usually need to consider the internal instance r
of the class requests.Request
, which functions like requests.post
are building, preparing, and then sending on your behalf. However, to understand exactly what requests
is doing, working at a lower level of abstraction (building, preparing, and examining r
—no need to send it!) is instructive. For example:
import
requests
r
=
requests
.
Request
(
'GET'
,
'http://www.example.com'
,
data
=
{
'foo'
:
'bar'
},
params
=
{
'fie'
:
'foo'
})
p
=
r
.
prepare
()
(
p
.
url
)
(
p
.
headers
)
(
p
.
body
)
prints out (splitting the p.headers
dict
’s printout for readability):
http://www.example.com/?fie=foo
{'Content-Length': '7',
'Content-Type': 'application/x-www-form-urlencoded'}
foo=bar
Similarly, when files=
is involved:
import
requests
r
=
requests
.
Request
(
'POST'
,
'http://www.example.com'
,
data
=
{
'foo'
:
'bar'
},
files
=
{
'fie'
:
'foo'
})
p
=
r
.
prepare
()
(
p
.
headers
)
(
p
.
body
)
prints out (with several lines split for readability):
{'Content-Type': 'multipart/form-data;
boundary=5d1cf4890fcc4aa280304c379e62607b',
'Content-Length': '228'}
b'--5d1cf4890fcc4aa280304c379e62607b Content-Disposition: form-
data; name="foo" bar --5d1cf4890fcc4aa280304c379e62607b
Content-Disposition: form-data; name="fie"; filename="fie"
foo --5d1cf4890fcc4aa280304c379e62607b-- '
Happy interactive exploring!
The one class from the requests
module that you always have to consider is Response
: every request, once sent to the server (typically, that’s done implicitly by methods such as get
), returns an instance r
of requests.Response
.
The first thing you usually want to do is to check r.status_code
, an int
that tells you how the request went, in typical “HTTPese”: 200 means “everything’s fine,” 404 means “not found,” and so on. If you’d rather just get an exception for status codes indicating some kind of error, call r.raise_for_status()
; that does nothing if the request went fine, but raises a requests.exceptions.HTTPError
otherwise. (Other exceptions, not corresponding to any specific HTTP status code, can and do get raised without requiring any such explicit call: e.g., ConnectionError
for any kind of network problem, or TimeoutError
for a timeout.)
Next, you may want to check the response’s HTTP headers: for that, use r.headers
, a dict
(with the special feature of having case-insensitive string-only keys, indicating the header names as listed, e.g., in Wikipedia, per HTTP specs). Most headers can be safely ignored, but sometimes you’d rather check. For example, you may check whether the response specifies which natural language its body is written in, via r.headers.get('content-language')
, to offer the user different presentation choices, such as the option to use some kind of natural-language translation service to make the response more acceptable to the user.
You don’t usually need to make specific status or header checks for redirects: by default, requests
automatically follows redirects for all methods except HEAD (you can explicitly pass the allow_redirection
named parameter in the request to alter that behavior). If you do allow redirects, you may optionally want to check r.history
, a list of all Response
instances accumulated along the way, oldest to newest, up to but excluding r
itself (so, in particular, r.history
is empty if there have been no redirects).
In most cases, perhaps after some checks on status and headers, you’ll want to use the response’s body. In simple cases, you’ll just access the response’s body as a bytestring, r.content
, or decoded via JSON (once you’ve determined that’s how it’s encoded, e.g., via r.headers.get('content-type')
) by calling r.json()
.
Often, you’d rather access the response’s body as (Unicode) text, with property r.text
. The latter gets decoded (from the octets that actually make up the response’s body) with the codec requests
thinks is best, based on the content-type header and a cursory examination of the body itself. You can check what codec has been used (or is about to be used) via the attribute r.encoding
, the name of a codec registered with the codecs
module, covered in “The codecs Module”. You can even override the choice of codec to use by assigning to r.encoding
the name of the codec you choose.
We do not cover other advanced issues, such as streaming, in this book; if you need information on that, check requests
’ online docs.
Beyond urllib.parse
, covered in “The urllib.parse (v3) / urlparse (v2) modules”, the urllib
package in v3 supplies the module urllib.robotparser
for the specific purpose of parsing a site’s robots.txt file as documented in a well-known informal standard (in v2, use the standard library module robotparser
); the module urllib.error
, containing all exception types raised by other urllib
modules; and, mainly, the module urllib.request
, for opening and reading URLs.
The functionality supplied by v3’s urllib.request
is parallel to that of v2’s urrlib2
module, covered in “The urllib2 Module (v2)”, plus some functionality from v2’s urllib
module, covered in “The urllib Module (v2)”. For coverage of urllib.request
, check the online docs, supplemented with Michael Foord’s HOWTO.
The urllib
module, in v2, supplies functions to read data from URLs. urllib
supports the following protocols (schemes): http, https, ftp, gopher, and file. file indicates a local file. urllib
uses file as the default scheme for URLs that lack an explicit scheme. You can find simple, typical examples of urllib
use in Chapter 22, where urllib.urlopen
is used to fetch HTML and XML pages that all the various examples parse and analyze.
The module urllib
in v2 supplies a number of functions described in Table 19-1, with urlopen
being the simplest and most frequently used.
quote |
Returns a copy of
|
quote_plus |
Like |
unquote |
Returns a copy of
|
unquote_plus |
Like |
urlcleanup |
Clears the cache of function |
urlencode |
Returns a string with the URL-encoded form of
The order of items in a dictionary is arbitrary. Should you need the URL-encoded form to have key/value pairs in a specific order, use a sequence as the When
When
|
urlopen |
Accesses the given URL and returns a read-only file-like object
When
|
urlretrieve |
Similar to When
|
You normally use the module urllib
in v2 through the functions it supplies (most often urlopen
). To customize urllib
’s functionality, however, you can subclass urllib
’s FancyURLopener
class and bind an instance of your subclass to the attribute _urlopener
of the module urllib
. The customizable aspects of an instance f
of a subclass of FancyURLopener
are the following:
prompt_user_passwd |
Returns a pair |
version |
The string that |
The urllib2
module is a rich, highly customizable superset of the urllib
module. urllib2
lets you work directly with advanced aspects of protocols such as HTTP. For example, you can send requests with customized headers as well as URL-encoded POST bodies, and handle authentication in various realms, in both Basic and Digest forms, directly or via HTTP proxies.
In the rest of this section, we cover only the ways in which v2’s urllib2
lets your program customize these advanced aspects of URL retrieval. We do not try to impart the advanced knowledge of HTTP and other network protocols, independent of Python, that you need to make full use of urllib2
’s rich functionality.
urllib2
supplies a function urlopen
that is basically identical to urllib
’s urlopen
. To customize urllib2
, install, before calling urlopen
, any number of handlers, grouped into an opener object, using the build_opener
and install_opener
functions.
You can also optionally pass to urlopen
an instance of the class Request
instead of a URL string. Such an instance may include both a URL string and supplementary information about how to access it, as covered in “The Request class”.
build_opener |
Creates and returns an instance of the class |
install_opener |
Installs |
urlopen |
Almost identical to the |
You can optionally pass to the function urlopen
an instance of the class Request
instead of a URL string. Such an instance can embody both a URL and, optionally, other information about how to access the target URL.
Request |
is just like calling:
When
The An instance |
add_data |
Sets Despite its name, the method |
add_header |
Adds a header with the given |
add_unredirec-ted_header |
Like |
get_data |
Returns the data of |
get_full_url |
Returns the URL of |
get_host |
Returns the |
get_method |
Returns the HTTP method of |
get_selector |
Returns the selector components of |
get_type |
Returns the scheme component of |
has_data |
Like |
has_header |
Returns |
set_proxy |
Sets |
An instance d
of the class OpenerDirector
collects instances of handler classes and orchestrates their use to open URLs of various schemes and to handle errors. Normally, you create d
by calling the function build_opener
and then install it by calling the function install_opener
. For advanced uses, you may also access various attributes and methods of d
, but this is a rare need and we do not cover it further in this book.
The urllib2
module supplies a class called BaseHandler
to use as the superclass of any custom handler classes you write. urllib2
also supplies many concrete subclasses of BaseHandler
that handle the schemes gopher, ftp, http, https, and file, as well as authentication, proxies, redirects, and errors. Writing custom urllib2
handlers is an advanced topic, and we do not cover it further in this book.
urllib2
’s default opener does no authentication. To get authentication, call build_opener
to build an opener with instances of HTTPBasicAuthHandler
, ProxyBasicAuthHandler
, HTTPDigestAuthHandler
, and/or ProxyDigestAuthHandler
, depending on whether you want authentication to be directly in HTTP or to a proxy, and on whether you need Basic or Digest authentication.
To instantiate each of these authentication handlers, use an instance x
of the class HTTPPasswordMgrWithDefaultRealm
as the only argument to the authentication handler’s constructor. You normally use the same x
to instantiate all the authentication handlers you need. To record users and passwords for given authentication realms and URLs, call x
.add_password
one or more times.
Many, many other network protocols are in use—a few are best supported by Python’s standard library, but, for most of them, you’ll be happier researching third-party modules on PyPI.
To connect as if you were logging into another machine (or, into a separate login session on your own node), you can use the secure SSH protocol, supported by the third-party module paramiko
, or the higher abstraction layer wrapper around it, the third-party module spur
. (You can also, with some likely security risks, still use classic telnet, supported by the standard library module telnetlib
.)
Other network protocols include:
NNTP, to access the somewhat-old Usenet News servers, supported by the standard library module nntplib
XML-RPC, for a rudimentary remote procedure call functionality, supported by xmlrpc.client
(xmlrpc
in v2)
gRPC, for a more modern and advanced remote procedure call functionality, supported by the third-party module grpcio
NTP, to get precise time off the network, supported by the third-party module ntplib
SNMP, for network management, supported by the third-party module pysnmp
…among many others. No single book, including this one, could possibly cover all these protocols and their supporting modules. Rather, our best suggestion in the matter is a strategic one: whenever you decide that your application needs to interact with some other system via a certain networking protocol, don’t rush to implement your own modules to support that protocol. Instead, search and ask around, and you’re likely to find excellent existing Python modules already supporting that protocol.4
Should you find some bug or missing feature in such modules, open a bug or feature request (and, ideally, supply a patch or pull request that would fix the problem and satisfy your application’s needs!). In other words, become an active member of the open source community, rather than just a passive user: you will be welcome there, meet your own needs, and help many other people in the process. “Give forward,” since you cannot “give back” to all the awesome people who contributed to give you most of the tools you’re using!
1 HTTP, the Hypertext Transfer Protocol, is the core protocol of the World Wide Web: every web server and browser uses it, and it has become the dominant application-level protocol on the Internet today.
2 Uniform Resource Locators
3 As it gives you complete, explicit control of exactly what octets are uploaded.
4 Even more important: if you think you need to invent a brand-new protocol and implement it on top of sockets, think again, and search carefully: it’s far more likely that one or more of the huge number of existing Internet protocols meets your needs just fine!