Implementing the scraper

Scraper would be a system of copying content of other websites using web scraping. First, we want to state a few of the things that we want to accomplish:

  • Downloading a web page
  • Parsing HTML
  • Cherry-picking attributes from the HTML
  • Saving the results

For a modern way to fetch content from the web, we will avoid the standard urllib library and go directly with the nicer requests library from the Python community.

For parsing and drilling into web pages, we'll use the almost de-facto library for this in the Python world—BeautifulSoup.

Let's fetch these via pip:

$ pip install requests beautifulsoup
Requirement already satisfied (use --upgrade to upgrade): requests in /Library/Python/2.7/site-packages/requests-2.2.1-py2.7.egg
Downloading/unpacking beautifulsoup
  Downloading BeautifulSoup-3.2.1.tar.gz
  Running setup.py (path:/private/var/folders/gw/xp4xsqt97957cc7hcgxd0w0c0000gn/T/pip_build_dotan/beautifulsoup/setup.py) egg_info for package beautifulsoup

Installing collected packages: beautifulsoup
  Running setup.py install for beautifulsoup

Successfully installed beautifulsoup
Cleaning up...

And now, let's sketch out the scraper based on our consumer skeleton:

#!/usr/bin/env python
importpika
import requests
fromBeautifulSoup import BeautifulSoup

def handler(ch, method, properties, url):
print "-> Starting: [%s]" % (url)
    r = requests.get(url)
soup = BeautifulSoup(r.text)
print "-> Extracted: %s" % (soup.html.head.title)

print "-> Done: [%s]" % (url)

connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))

channel = connection.channel()

print '* Handling messages.'

channel.basic_consume(handler, queue='pages', no_ack=True)

channel.start_consuming()

We're going to go over the code, but worry not, since every library that we've used is pretty awesome, our code is highly readable and concise.

We start again with our Shebang and importing Pika, requests, and beautifulsoup. Next up, we beef up our handler from the previous consumer skeleton such that, as promised, all the real logic is contained within it.

Fetching a URL is made very easy with requests; the URL's content is available on the response object return to use by the get call within the text field. This is a simple text, non-parsed and non-digested.

We'll use bautifulSoup to turn this raw text into an HTML tree so that we can drill into the meaning from code, rather looking at an array of characters.

Accessing the title is easy with beautifulSoup; by specifying soup.html.head.title, we get prime access to it, and all that's left to do is output it somewhere.

We are skipping storing the findings (such as the title) to a persistent store for the sake of brevity. As with the scheduler persistent store, a good look at Postgres or MongoDB will make sense here, but we'll skip it for the scope of this chapter and simply output to the standard output.

Running the scraper

Let's run our scheduler first. Wait a bit, and it will start pushing URLs for the scraper to bite at:

$ python scheduler.py
* Connecting to RabbitMQ broker
* Pushed: [http://ebay.to/1G163Lh]
* Pushed: [http://ebay.to/1G163Lh]
* Pushed: [http://ebay.to/1G163Lh]
* Pushed: [http://ebay.to/1G163Lh]

On the other end, we'll start our scraper. Feel free to start it on a different terminal, and position it such that you'll have parallel visuals of both the scheduler and scraper.

Your scraper will immediately go to work:

$ python scraper.py
* Handling messages.
-> Starting: [http://ebay.to/1G163Lh]
-> Extracted: <title> Killer Rabbit of Death w Pointy Teeth Monty Python Blinking Red Eyes | eBay </title>
-> Done: [http://ebay.to/1G163Lh]
-> Starting: [http://ebay.to/1G163Lh]
-> Extracted: <title> Killer Rabbit of Death w Pointy Teeth Monty Python Blinking Red Eyes | eBay </title>
-> Done: [http://ebay.to/1G163Lh]

We see that it has downloaded a page, and we are actually pulling out the <title> element from each page! The way from here to pulling out product details and building a sophisticated data-driven product based on Ebay's data (of course, please adhere to Ebay's terms of service) is very, very short.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset