A cookie is a small piece of data that the server sends in a Set-Cookie
header as a part of the response. The client stores cookies locally and includes them in any future requests that are sent to the server.
Servers use cookies in various ways. They can add a unique ID to them, which enables them to track a client as it accesses different areas of a site. They can store a login token, which will automatically log the client in, even if the client leaves the site and then accesses it later. They can also be used for storing the client's user preferences or snippets of personalizing information, and so on.
Cookies are necessary because the server has no other way of tracking a client between requests. HTTP is called a stateless protocol. It doesn't contain an explicit mechanism for a server to know for sure that two requests have come from the same client. Without cookies to allow the server to add some uniquely identifying information to the requests, things such as shopping carts (which were the original problem that cookies were developed to solve) would become impossible to build, because the server would not be able to determine which basket goes with which request.
We may need to handle cookies in Python because without them, some sites don't behave as expected. When using Python, we may also want to access the parts of a site which require a login, and the login sessions are usually maintained through cookies.
We're going to discuss how to handle cookies with urllib
. First, we need to create a place for storing the cookies that the server will send us:
>>> from http.cookiejar import CookieJar >>> cookie_jar = CookieJar()
Next, we build something called an urllib
opener
. This will automatically extract the cookies from the responses that we receive and then store them in our cookie jar:
>>> from urllib.request import build_opener, HTTPCookieProcessor >>> opener = build_opener(HTTPCookieProcessor(cookie_jar))
Then, we can use our opener to make an HTTP request:
>>> opener.open('http://www.github.com')
Lastly, we can check that the server has sent us some cookies:
>>> len(cookie_jar) 2
Whenever we use opener
to make further requests, the HTTPCookieProcessor
functionality will check our cookie_jar
to see if it contains any cookies for that site and then it will automatically add them to our requests. It will also add any further cookies that are received to the cookie jar.
The http.cookiejar
module also contains a FileCookieJar
class, that works in the same way as CookieJar
, but it provides an additional function for easily saving the cookies to a file. This allows persistence of cookies across Python sessions.
It's worth looking at the properties of cookies in more detail. Let's examine the cookies that GitHub sent us in the preceding section.
To do this, we need to pull the cookies out of the cookie jar. The CookieJar
module doesn't let us access them directly, but it supports the iterator protocol. So, a quick way of getting them is to create a list
from it:
>>> cookies = list(cookie_jar) >>> cookies [Cookie(version=0, name='logged_in', value='no', ...), Cookie(version=0, name='_gh_sess', value='eyJzZxNzaW9uX...', ...) ]
You can see that we have two Cookie
objects. Now, let's pull out some information from the first one:
>>> cookies[0].name 'logged_in' >>> cookies[0].value 'no'
The cookie's name allows the server to quickly reference it. This cookie is clearly a part of the mechanism that GitHub uses for finding out whether we've logged in yet. Next, let's do the following:
>>> cookies[0].domain '.github.com' >>> cookies[0].path '/'
The domain and the path are the areas for which this cookie is valid, so our urllib
opener will include this cookie in any request that it sends to www.github.com and its sub-domains, where the path is anywhere below the root.
Now, let's look at the cookie's lifetime:
>>> cookies[0].expires 2060882017
This is a Unix timestamp; we can convert it to datetime
:
>>> import datetime >>> datetime.datetime.fromtimestamp(cookies[0].expires) datetime.datetime(2035, 4, 22, 20, 13, 37)
So, our cookie will expire on 22nd of April, 2035. An expiry date is the amount of time that the server would like the client to hold on to the cookie for. Once the expiry date has passed, the client can throw the cookie away and the server will send a new one with the next request. Of course, there's nothing to stop a client from immediately throwing the cookie away, though on some sites this may break functionality that depends on the cookie.
Let's discuss two common cookie flags:
>>> print(cookies[0].get_nonstandard_attr('HttpOnly')) None
Cookies that are stored on a client can be accessed in a number of ways:
The HttpOnly
flag indicates that the client should only allow access to a cookie when the access is part of an HTTP request or response. The other methods should be denied access. This will protect the client against Cross-site scripting attacks (see Chapter 9, Applications for the Web, for more information on these). This is an important security feature, and when the server sets it, our application should behaves accordingly.
There is also a secure
flag:
>>> cookies[0].secure True
If the value is true, the Secure
flag indicates that the cookie should only ever be sent over a secure connection, such as HTTPS. Again, we should honor this if the flag has been set such that when our application send requests containing this cookie, it only sends them to HTTPS URLs.
You may have spotted an inconsistency here. Our URL has requested a response over HTTP, yet the server has sent us a cookie, which it's requesting to be sent only over secure connections. Surely the site designers didn't overlook a security loophole like that? Rest assured; they didn't. The response was actually sent over HTTPS. But, how did that happen? Well, the answer lies with redirects.