The data flow in Scrapy is controlled by the execution engine and goes like this:
start_urls
.The best way to understand Scrapy is to use it through a shell and to get your hands dirty with some of the initial commands and tools provided by Scrapy. It allows you to experiment and develop your XPath expressions that you can put into your spider code.
Now, let's start with a very interesting use case where we want to capture the trending topics from Google news (https://news.google.com/).
The steps to follow here are:
div
tag. For this example, we are interested in <div class="topic">
.Now, what we actually did manually in the preceding steps can be done in an automated way. Scrapy uses an XML path language called XPath. XPath can be used to achieve this kind of functionality. So, let's see how we can implement the same example using Scrapy.
To use Scrapy, put the following command in you cmd:
$scrapy shell https://news.google.com/
The moment you hit enter, the response of the Google news page is loaded in the Scrapy shell. Now, let's move to the most important aspect of Scrapy where we want to understand how to look for a specific HTML element of the page. Let's start and run the example of getting topics from Google news that are shown in the preceding image:
In [1]: sel.xpath('//div[@class="topic"]').extract()
The output to this will be as follows:
Out[1]: [<Selector xpath='//div[@class="topic"]' data=u'<div class="topic"><a href="/news/sectio'>, <Selector xpath='//div[@class="topic"]' data=u'<div class="topic"><a href="/news/sectio'>, <Selector xpath='//div[@class="topic"]' data=u'<div class="topic"><a href="/news/sectio'>]
Now, we need to understand some of the functions that Scrapy and XPath provide to experiment with the shell and then, we need to update our spider to do more sophisticated stuff. Scrapy selectors are built with the help of the lxml library, which means that they're very similar in terms of speed and parsing accuracy.
Let's have a look at some of the most frequently used methods provided for selectors:
xpath()
: This returns a list of selectors, where each of the selectors represents the nodes selected by the XPath expression given as an argument.css()
: This returns a list of selectors. Here, each of the selectors represent the nodes selected by the CSS expression given as an argument.extract()
:This returns content as a string with the selected data.re()
: This returns a list of unicode strings extracted by applying the regular expression given as an argument.I am giving you a cheat sheet of these top 10 selector patterns that can cover most of your work for you. For a more complex selector, if you search the Web, there should be an easy solution that you can use. Let's start with extracting the title of the web page that is very generic for all web pages:
In [2] :sel.xpath('//title/text()') Out[2]: [<Selector xpath='//title/text()' data=u' Google News'>]
Now, once you have selected any element, you also want to extract for more processing. Let's extract the selected content. This is a generic method that works with any selector:
In [3]: sel.xpath('//title/text()').extract() Out[3]: [u' Google News']
The other very generic requirement is to look for all the elements in the given page. Let's achieve this with this selector:
In [4]: sel.xpath('//ul/li') Out [4] : list of elements (divs and all)
We can extract all the titles in the page with this selector:
In [5]: sel.xpath('//ul/li/a/text()').extract() Out [5]: [ u'India', u'World', u'Business', u'Technology', u'Entertainment', u'More Top Stories']
With this selector, you can extract all the hyperlinks in the web page:
In [6]:sel.xpath('//ul/li/a/@href').extract() Out [6] : List of urls
Let's select all the <td>
and div
elements:
In [7]:sel.xpath('td'') In [8]:divs=sel.xpath("//div")
This will select all the divs
elements and then, you can loop it:
In [9]: for d in divs: printd.extract()
This will print the entire content of each div in the entire page. So, in case you are not able to get the exact div name, you can also look at the regex-based search.
Now, let's select all div
elements that contain the attribute class="topic"
:
In [10]:sel.xpath('/div[@class="topic"]').extract() In [11]: sel.xpath("//h1").extract() # this includes the h1 tag
This will select all the <p>
elements in the page and get the class of those elements:
In [12 ] for node in sel.xpath("//p"): print node.xpath("@class").extract() Out[12] print all the <p> In [13]: sel.xpath("//li[contains(@class, 'topic')]") Out[13]: [<Selector xpath="//li[contains(@class, 'topic')]" data=u'<li class="nav-item nv-FRONTPAGE selecte'>, <Selector xpath="//li[contains(@class, 'topic')]" data=u'<li class="nav-item nv-FRONTPAGE selecte'>]
Let's write some selector nuggets to get the data from a css file. If we just want to extract the title from the css file, typically, everything works the same, except you need to modify the syntax:
In [14] :sel.css('title::text').extract() Out[14]: [u'Google News']
Use the following command to list the names of all the images used in the page:
In[15]: sel.xpath('//a[contains(@href, "image")]/img/@src').extract() Out [15] : Will list all the images if the web developer has put the images in /img/src
Let's see a regex-based selector:
In [16 ]sel.xpath('//title').re('(w+)') Out[16]: [u'title', u'Google', u'News', u'title']
In some cases, removing the namespaces can help us get the right pattern. A selector has an inbuilt remove_namespaces()
function to make sure that the entire document is scanned and all the namespaces are removed. Make sure before using it whether we want some of these namespaces to be part of the pattern or not. The following is example of remove_namespaces()
function:
In [17] sel.remove_namespaces() sel.xpath("//link")
Now that we have more understanding about the selectors, let's modify the same old news spider that we built previously:
>>>from scrapy.spider import BaseSpider >>>class NewsSpider(BaseSpider): >>> name = "news" >>> allowed_domains = ["nytimes.com"] >>> start_URLss = [ >>> 'http://www.nytimes.com/' >>> ] >>>def parse(self, response): >>> sel = Selector(response) >>> sites = sel.xpath('//ul/li') >>> for site in sites: >>> title = site.xpath('a/text()').extract() >>> link = site.xpath('a/@href').extract() >>> desc = site.xpath('text()').extract() >>> print title, link, desc
Here, we mainly modified the parse method, which is one of the core of our spider. This spider can now crawl through the entire page, but we do a more structured parsing of the title, description, and URLs.
Now, let's write a more robust crawler using all the capabilities of Scrapy.
Until now, we were just printing the crawled content on stdout
or dumping it in a file. A better way to do this is to define items.py
every time we write a crawler. The advantage of doing this is that we can consume these items in our parse method, and this can also give us output in any data format, such as XML, JSON, or CSV. So, if you go back to your old crawler, the items class will have a function like this:
>>>fromscrapy.item import Item, Field >>>class NewsItem(scrapy.Item): >>> # define the fields for your item here like: >>> # name = scrapy.Field() >>> pass
Now, let's make it like the following by adding different fields:
>>>from scrapy.item import Item, Field >>>class NewsItem(Item): >>> title = Field() >>> link = Field() >>> desc = Field()
Here, we added field()
to title
, link
, and desc
. Once we have a field in place, our spider parse method can be modified to parse_news_item
, where instead dumping the parsed fields to a file now it can be consumed by an item object.
A Rule method is a way of specifying what kind of URL needs to be crawled after the current one. A Rule method provides SgmlLinkExtractor
, which is a way of defining the URL pattern that needs to be extracted from the crawled page. A Rule method also provides a callback
method, which is typically a pointer for a spider to look for the parsing method, which in this case is parse_news_item
. In case we have a different way to parse, then we can have multiple rules and parse methods. A Rule method also has a Boolean parameter to follow, which specifies whether links should be followed by each response extracted with this rule. If the callback is None, follow defaults to True: otherwise, it default to False.
One important point to note is that the Rule method does not use parse. This is because the name of the default callback method is parse()
and if we use it, we are actually overriding it, and that can stop the functionality of the crawl spider. Now, let's jump on to the following code to understand the preceding methods and parameters:
>>>from scrapy.contrib.spiders import CrawlSpider, Rule >>>from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >>>from scrapy.selector import Selector >>>from scrapy.item import NewsItem >>>class NewsSpider(CrawlSpider): >>> name = 'news' >>> allowed_domains = ['news.google.com'] >>> start_urls = ['https://news.google.com'] >>> rules = ( >>> # Extract links matching cnn.com >>> Rule(SgmlLinkExtractor(allow=('cnn.com', ), deny=(http://edition.cnn.com/', ))), >>> # Extract links matching 'news.google.com' >>> Rule(SgmlLinkExtractor(allow=('news.google.com', )), callback='parse_news_item'), >>> ) >>> def parse_news_item(self, response): >>> sel = Selector(response) >>> item = NewsItem() >>> item['title'] = sel.xpath('//title/text()').extract() >>> item[topic] = sel.xpath('/div[@class="topic"]').extract() >>> item['desc'] = sel.xpath('//td//text()').extract() >>> return item