Creating our spider

This is the code for our first spider. Save it in a file named MySpider.py under the spiders directory in your project:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (Rule(LxmlLinkExtractor(allow=())))

def parse_item(self, response):
hxs = HtmlXPathSelector(response)
element = Item()
return element

CrawlSpider provides a mechanism that allows you to follow the links that follow a certain pattern. Apart from the inherent attributes of the BaseSpider class, this class has a new rules attribute with which we can indicate to the spider the behavior that it should follow.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset