Extracting links from a URL with urllib

In this script, we can see how to extract links using urllib and HTMLParser. HTMLParser is a module that allows us to parse text files formatted in HTML. You can get more information at https://docs.python.org/3/library/html.parser.html.

You can find the following code in the extract_links_parser.py file:

#!/usr/bin/env python3
from html.parser import HTMLParser
import urllib.request

class myParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if (tag == "a"):
            for a in attrs:
                if (a[0] == 'href'):
                    link = a[1]
                    if (link.find('http') >= 0):
                        print(link)
                        newParse = myParser()
                        newParse.feed(link)

url = "http://www.packtpub.com"
request = urllib.request.urlopen(url)
parser = myParser()
parser.feed(request.read().decode('utf-8'))

In the following screenshot, we can see the script execution for the packtpub.com domain:

Another way to extract links from a URL is using the regular expression (re) module to find href elements in the target URL.

You can find the following code in the urlib_link_extractor.py file:

#!/usr/bin/env python3

from urllib.request import urlopen
import re

def download_page(url):
    return urlopen(url).read().decode('utf-8')

def extract_links(page):
    link_regex = re.compile('<a[^>]+href=["'](.*?)["']',re.IGNORECASE)
    return link_regex.findall(page)

if __name__ == '__main__':
    target_url = 'http://www.packtpub.com'
    packtpub = download_page(target_url)
    links = extract_links(packtpub)
    for link in links:
        print(link)

Table of Contents for Extracting links from a URL with urllib

Create new playlist

Sign In

Sign Up

Table of Contents for
Extracting links from a URL with urllib