Working with URLs

Uniform Resource Locators (URLs) are fundamental to the way in which the web operates, and are formally described in RFC 3986. A URL represents a resource on a given host. URLs can point to files on the server, or the resources may be dynamically generated when a request is received.

Python uses the urllib.parse module for working with URLs. Let's use Python to break a URL into its component parts:

>>> from urllib.parse import urlparse
>>> result = urlparse('https://www.packtpub.com/tech/Python')
>>> result
ParseResult(scheme='http', netloc='www.packtpub.com', path='/tech/Python',
params='', query='', fragment='')

The urllib.parse.urlparse() function interprets our URL and recognizes HTTP as the scheme, www.packtpub.com as the network location, and /tech/Python as the path.

We can access these components as attributes of the ParseResult:

For almost all resources on the web, we'll be using the HTTP or HTTPS schemes. In these schemes, to locate a specific resource, we need to know the host that it resides on and the TCP port that we should connect to, and we also need to know the path to the resource on the host.

The path in a URL is anything that comes after the host and the port. Paths always start with a forward slash (/), and when a slash appears on its own, it's called the root.

RFC 3986 defines another property of URLs called query strings. They can contain additional parameters in the form of key-value pairs that appear after the path. They are separated from the path by a question mark. In this example, we can see how we can get URL parameters with the query argument:

>>> result = urlparse('https://search.packtpub.com/?query=python')
>>> result
ParseResult(scheme='https', netloc='search.packtpub.com', path='/', params='', query='query=python', fragment='')
>>> result.query
'query=python'

Query strings are used for supplying parameters to the resource that we wish to retrieve, and this usually customizes the resource in some way. In the previous example, our query string tells the packtpub search page that we want to run a search for the term python. The urllib.parse module has a function called parse_qs() that reads the query string and then converts it into a dictionary:

>>> from urllib.parse import parse_qs
>>> result = urlparse('https://search.packtpub.com/?query=python')
>>> parse_qs(result.query)
{'query': ['python']}

The simplest way to code the string is to use the urllib urlencode method, which accepts a dictionary or a list of tuples (key, value) and generates the corresponding encoded string.

The urlencode() function is similarly intended for encoding query strings directly from dictionaries. Notice how it correctly percent-encodes our values and then joins them with &, so as to construct the query string:

>>> from urllib.parse import urlencode
>>> params = urllib.parse.urlencode({"user": "user", "password": "password"})
>>> params
'user=user&password=password'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset