Data flow in Scrapy

The data flow in Scrapy is controlled by the execution engine and goes like this:

  1. The process starts with locating the chosen spider and opening the first URL from the list of start_urls.
  2. The first URL is then scheduled as a request in a scheduler. This is more of an internal to Scrapy.
  3. The Scrapy engine then looks for the next set of URLs to crawl.
  4. The scheduler then sends the next URLs to the engine and the engine then forwards it to the downloader using the downloaded middleware. These middlewares are where we place different proxies and user-agent settings.
  5. The downloader downloads the response from the page and passes it to the spider, where the parse method selects specific elements from the response.
  6. Then, the spider sends the processed item to the engine.
  7. The engine sends the processed response to the item pipeline, where we can add some post processing.
  8. The same process continues for each URL until there are no remaining requests.

The Scrapy shell

The best way to understand Scrapy is to use it through a shell and to get your hands dirty with some of the initial commands and tools provided by Scrapy. It allows you to experiment and develop your XPath expressions that you can put into your spider code.

Tip

To experiment with the Scrapy shell, I would recommend you to install one of the developer tools (Chrome) and Firebug (Mozilla Firefox) as a plugin. This tool will help us dig down to the very specific part that we want from the web page.

Now, let's start with a very interesting use case where we want to capture the trending topics from Google news (https://news.google.com/).

The steps to follow here are:

  1. Open https://news.google.com/ in your favorite browser.
  2. Go to the trending topic section on Google news. Then, right-click on and select Inspect Element for the first topic, as shown in the following screenshot:
    The Scrapy shell
  3. The moment you open this, there will be a side window that will pop up and you will get a view.
  4. Search and select the div tag. For this example, we are interested in <div class="topic">.
  5. Once this is done, you will come to know that we have actually parsed the specific part of the web page, as shown in the following screenshot:
    The Scrapy shell

Now, what we actually did manually in the preceding steps can be done in an automated way. Scrapy uses an XML path language called XPath. XPath can be used to achieve this kind of functionality. So, let's see how we can implement the same example using Scrapy.

To use Scrapy, put the following command in you cmd:

$scrapy shell https://news.google.com/

The moment you hit enter, the response of the Google news page is loaded in the Scrapy shell. Now, let's move to the most important aspect of Scrapy where we want to understand how to look for a specific HTML element of the page. Let's start and run the example of getting topics from Google news that are shown in the preceding image:

In [1]: sel.xpath('//div[@class="topic"]').extract()

The output to this will be as follows:

Out[1]:
[<Selector xpath='//div[@class="topic"]' data=u'<div class="topic"><a href="/news/sectio'>,
<Selector xpath='//div[@class="topic"]' data=u'<div class="topic"><a href="/news/sectio'>,
<Selector xpath='//div[@class="topic"]' data=u'<div class="topic"><a href="/news/sectio'>]

Now, we need to understand some of the functions that Scrapy and XPath provide to experiment with the shell and then, we need to update our spider to do more sophisticated stuff. Scrapy selectors are built with the help of the lxml library, which means that they're very similar in terms of speed and parsing accuracy.

Let's have a look at some of the most frequently used methods provided for selectors:

  • xpath(): This returns a list of selectors, where each of the selectors represents the nodes selected by the XPath expression given as an argument.
  • css(): This returns a list of selectors. Here, each of the selectors represent the nodes selected by the CSS expression given as an argument.
  • extract():This returns content as a string with the selected data.
  • re(): This returns a list of unicode strings extracted by applying the regular expression given as an argument.

I am giving you a cheat sheet of these top 10 selector patterns that can cover most of your work for you. For a more complex selector, if you search the Web, there should be an easy solution that you can use. Let's start with extracting the title of the web page that is very generic for all web pages:

In [2] :sel.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data=u' Google News'>]

Now, once you have selected any element, you also want to extract for more processing. Let's extract the selected content. This is a generic method that works with any selector:

In [3]: sel.xpath('//title/text()').extract()
Out[3]: [u' Google News']

The other very generic requirement is to look for all the elements in the given page. Let's achieve this with this selector:

In [4]: sel.xpath('//ul/li')
Out [4] : list of elements (divs and all)

We can extract all the titles in the page with this selector:

In [5]: sel.xpath('//ul/li/a/text()').extract()
Out [5]: [ u'India',
u'World',
u'Business',
u'Technology',
u'Entertainment',
u'More Top Stories']

With this selector, you can extract all the hyperlinks in the web page:

In [6]:sel.xpath('//ul/li/a/@href').extract()
Out [6] : List of urls

Let's select all the <td> and div elements:

In [7]:sel.xpath('td'')
In [8]:divs=sel.xpath("//div")

This will select all the divs elements and then, you can loop it:

In [9]: for d in divs:
    printd.extract()

This will print the entire content of each div in the entire page. So, in case you are not able to get the exact div name, you can also look at the regex-based search.

Now, let's select all div elements that contain the attribute class="topic":

In [10]:sel.xpath('/div[@class="topic"]').extract()
In [11]:  sel.xpath("//h1").extract()         # this includes the h1 tag

This will select all the <p> elements in the page and get the class of those elements:

In [12 ] for node in sel.xpath("//p"):
print node.xpath("@class").extract()
Out[12] print all the <p>

In [13]: sel.xpath("//li[contains(@class, 'topic')]")
Out[13]:
[<Selector xpath="//li[contains(@class, 'topic')]" data=u'<li class="nav-item nv-FRONTPAGE selecte'>,
<Selector xpath="//li[contains(@class, 'topic')]" data=u'<li class="nav-item nv-FRONTPAGE selecte'>]

Let's write some selector nuggets to get the data from a css file. If we just want to extract the title from the css file, typically, everything works the same, except you need to modify the syntax:

In [14] :sel.css('title::text').extract()
Out[14]: [u'Google News']

Use the following command to list the names of all the images used in the page:

In[15]: sel.xpath('//a[contains(@href, "image")]/img/@src').extract()
Out [15] : Will list all the images if the web developer has put the images in /img/src

Let's see a regex-based selector:

In [16 ]sel.xpath('//title').re('(w+)')
Out[16]: [u'title', u'Google', u'News', u'title']

In some cases, removing the namespaces can help us get the right pattern. A selector has an inbuilt remove_namespaces() function to make sure that the entire document is scanned and all the namespaces are removed. Make sure before using it whether we want some of these namespaces to be part of the pattern or not. The following is example of remove_namespaces() function:

In [17] sel.remove_namespaces()
sel.xpath("//link")

Now that we have more understanding about the selectors, let's modify the same old news spider that we built previously:

>>>from scrapy.spider import BaseSpider
>>>class NewsSpider(BaseSpider):
>>>    name = "news"
>>>    allowed_domains = ["nytimes.com"]
>>>    start_URLss = [
>>>        'http://www.nytimes.com/'
>>>    ]
>>>def parse(self, response):
>>>    sel = Selector(response)
>>>        sites = sel.xpath('//ul/li')
>>>        for site in sites:
>>>            title = site.xpath('a/text()').extract()
>>>            link = site.xpath('a/@href').extract()
>>>            desc = site.xpath('text()').extract()
>>>            print title, link, desc

Here, we mainly modified the parse method, which is one of the core of our spider. This spider can now crawl through the entire page, but we do a more structured parsing of the title, description, and URLs.

Now, let's write a more robust crawler using all the capabilities of Scrapy.

Items

Until now, we were just printing the crawled content on stdout or dumping it in a file. A better way to do this is to define items.py every time we write a crawler. The advantage of doing this is that we can consume these items in our parse method, and this can also give us output in any data format, such as XML, JSON, or CSV. So, if you go back to your old crawler, the items class will have a function like this:

>>>fromscrapy.item import Item, Field
>>>class NewsItem(scrapy.Item):
>>>    # define the fields for your item here like:
>>>    # name = scrapy.Field()
>>>    pass

Now, let's make it like the following by adding different fields:

>>>from scrapy.item import Item, Field
>>>class NewsItem(Item):
>>>    title = Field()
>>>    link = Field()
>>>    desc = Field()

Here, we added field() to title, link, and desc. Once we have a field in place, our spider parse method can be modified to parse_news_item, where instead dumping the parsed fields to a file now it can be consumed by an item object.

A Rule method is a way of specifying what kind of URL needs to be crawled after the current one. A Rule method provides SgmlLinkExtractor, which is a way of defining the URL pattern that needs to be extracted from the crawled page. A Rule method also provides a callback method, which is typically a pointer for a spider to look for the parsing method, which in this case is parse_news_item. In case we have a different way to parse, then we can have multiple rules and parse methods. A Rule method also has a Boolean parameter to follow, which specifies whether links should be followed by each response extracted with this rule. If the callback is None, follow defaults to True: otherwise, it default to False.

One important point to note is that the Rule method does not use parse. This is because the name of the default callback method is parse() and if we use it, we are actually overriding it, and that can stop the functionality of the crawl spider. Now, let's jump on to the following code to understand the preceding methods and parameters:

>>>from scrapy.contrib.spiders import CrawlSpider, Rule
>>>from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>>from scrapy.selector import Selector
>>>from scrapy.item import NewsItem
>>>class NewsSpider(CrawlSpider):
>>>    name = 'news'
>>>    allowed_domains = ['news.google.com']
>>>    start_urls = ['https://news.google.com']
>>>    rules = (
>>>        # Extract links matching cnn.com
>>>        Rule(SgmlLinkExtractor(allow=('cnn.com', ), deny=(http://edition.cnn.com/', ))),
>>>       # Extract links matching 'news.google.com'
>>>       Rule(SgmlLinkExtractor(allow=('news.google.com', )), callback='parse_news_item'),
>>>    )
>>>    def parse_news_item(self, response):
>>>        sel = Selector(response)
>>>        item = NewsItem()
>>>        item['title'] = sel.xpath('//title/text()').extract()
>>>        item[topic] = sel.xpath('/div[@class="topic"]').extract()
>>>        item['desc'] = sel.xpath('//td//text()').extract()
>>>        return item
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset