Handling cookies with urllib

In order to work with cookies with urllib, we can use the HTTPCookieProcessor handler from the urllib.request package:

>>> import urllib
>>> cookie_processor = urllib.request.HTTPCookieProcessor()

If we want to access these cookies or be able to send our own cookies, we can pass a CookieJar object of the cookielib module as a parameter to the HTTPCookieProcessor initializer.

To read the cookies that the server sends us, just create an iterable object of the CookieJar class from the http.cookiejar package. This will automatically extract the cookies from the responses that we receive and then store them in our cookie jar:

>>> from http.cookiejar import CookieJar
>>> cookie_jar = CookieJar()
>>> cookie_processor = urllib.request.HTTPCookieProcessor(cookie_jar)
>>> opener = urllib.request.build_opener(cookie_processor)  
>>> urllib.request.install_opener(opener)

We can use our opener to make an HTTP request:

>>> opener.open('http://www.github.com')
<http.client.HTTPResponse object at 0x00FFBD50>

Lastly, we can check that the server has sent us some cookies:

>>> len(cookie_jar)
3

Whenever we use the opener to make further requests, the HTTPCookieProcessor functionality will check our cookie_jar to see if it contains any cookies for that site and will then automatically add them to our requests. It will also add any further cookies that are received to the cookie jar.

Now, we are examining the cookies that GitHub sent us in the preceding section. You can see that we have three cookie objects with the names 'logged_in', '_gh_sess', and 'has_recent_activity'. Also, we can see information related to the GitHub domain as part of the mechanism that GitHub uses for finding out whether we've logged in.

The expires attribute or cookie's lifespan represents the amount of time that the server would like the client to hold on to the cookie for. Once the expiry date has passed, the client can throw the cookie away and the server will send a new one with the next request:

>>> cookies = list(cookie_jar)
>>> cookies
 [Cookie(version=0, name='logged_in', value='no', port=None, port_specified=False, domain='.github.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=True, expires=2173978199, discard=False, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='_gh_sess', value='cDlMVjFOdHM1djhLYWRGeGpPMTIxLytnRzdHNlRQT3VkUngxLzY2UkpmcGI1KzhNYWd4TzBzb2VJK0cxcWZBeGdpOFA3ZE95RXd5Nnp0WDdDWlQ0dHpwSkYzQ0hOZ2o1R3JkOXBPZTdCL0N4dGVMZ0dHc0VKZ0RreW5raDdRNDcrS0tTTlZiY1pOcGw5NkdkNDZBTnlQNTBiTDRzRTRIeVZPNVY2RWdUZ3VvYkFsczNqd3psQ0JBSld2Rlk4d3QvQm5XRm1iSGtVeVpTdG9haVVzMFhnQT09LS1pR3NNNnN1VmRocVJ0RXpkaWhxK0JRPT0%3D--84304a9b84b7e8c2605efb9808c1b92a25fcc221', port=None, port_specified=False, domain='github.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=True, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='has_recent_activity', value='1', port=None, port_specified=False, domain='github.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1542829799, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]

Another interesting attribute is the HttpOnly flag, which indicates the client should only allow access to a cookie when the access is part of an HTTP request or response. The other methods should be denied access. This will protect the client from cross-site scripting attacks. This is an important security feature, and when the server sets it, our application should behaves accordingly. We can see that for cookies with the names 'logged_in' and '_gh_sess', the HTTPOnly flag is established to None value and the secure flag has the True value.

If the value is true, the secure flag indicates that the cookie should only ever be sent over a secure connection, such as HTTPS. Again, we should honor this if the flag has been set such that when our application sends requests containing this cookie, it only sends them to HTTPS URLs.

In this script, we can see how we can obtain cookies from a website. We are using the same methods we have reviewed, and for each cookie in the list, we print the name and the value. We can process the headers response to obtain other cookies related to the website.

You can find the following code in the extract_cookie_information.py file:

import http.cookiejar
import urllib
URL = 'https://github.com/'

def extract_cookie_info():
    # setup cookie jar
    cookie_j = http.cookiejar.CookieJar()
    # create url opener
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_j))
    # now access without any login info
    resp = opener.open(URL)
    for cookie in cookie_j:
        print ("Cookie: %s --> %s" %(cookie.name, cookie.value))
    print ("Headers: %s" %resp.headers)

if __name__ == '__main__':
    extract_cookie_info()

In this screenshot, we can see the execution of the previous script:

Table of Contents for Handling cookies with urllib

Create new playlist

Sign In

Sign Up

Table of Contents for
Handling cookies with urllib