Extracting links from a URL with urllib

In this script, we can see how to extract links using urllib and HTMLParser. HTMLParser is a module that allows us to parse text files formatted in HTML. You can get more information at https://docs.python.org/3/library/html.parser.html.

You can find the following code in the extract_links_parser.py file:

#!/usr/bin/env python3
from html.parser import HTMLParser
import urllib.request

class myParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if (tag == "a"):
for a in attrs:
if (a[0] == 'href'):
link = a[1]
if (link.find('http') >= 0):
print(link)
newParse = myParser()
newParse.feed(link)

url = "http://www.packtpub.com"
request = urllib.request.urlopen(url)
parser = myParser()
parser.feed(request.read().decode('utf-8'))

In the following screenshot, we can see the script execution for the packtpub.com domain:

Another way to extract links from a URL is using the regular expression (re) module to find href elements in the target URL.

You can find the following code in the urlib_link_extractor.py file:

#!/usr/bin/env python3

from urllib.request import urlopen
import re

def download_page(url):
return urlopen(url).read().decode('utf-8')

def extract_links(page):
link_regex = re.compile('<a[^>]+href=["'](.*?)["']',re.IGNORECASE)
return link_regex.findall(page)

if __name__ == '__main__':
target_url = 'http://www.packtpub.com'
packtpub = download_page(target_url)
links = extract_links(packtpub)
for link in links:
print(link)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset