If the site provides sitemap.xml
, then a better way to crawl the site is to use SiteMapSpider
instead.
Here, given sitemap.xml
, the spider parses the URLs provided by the site itself. This is a more polite way of crawling and good practice:
>>>from scrapy.contrib.spiders import SitemapSpider >>>class MySpider(SitemapSpider): >>> sitemap_URLss = ['http://www.example.com/sitemap.xml'] >>> sitemap_rules = [('/electronics/', 'parse_electronics'), ('/apparel/', 'parse_apparel'),] >>> def 'parse_electronics'(self, response): >>> # you need to create an item for electronics, >>> return >>> def 'parse_apparel'(self, response): >>> #you need to create an item for apparel >>> return
In the preceding code, we wrote one parse method for each product category. It's a great use case if you want to build a price aggregator/comparator. You might want to parse different attributes for different products, for example, for electronics, you might want to scrape the tech specification, accessory, and price; while for apparels, you are more concerned about the size and color of the item. Try your hand at using one of the retailer sites and use shell to get the patterns to scrape the size, color, and price of different items. If you do this, you should be in a good shape to write your first industry standard spider.
In some cases, you want to crawl a website that needs you to log in before you can enter some parts of the website. Now, Scrapy has a workaround that too. They implemented FormRequest
, which is more of a POST call to the HTTP server and gets the response. Let's have a deeper look into the following spider code:
>>>class LoginSpider(BaseSpider): >>> name = 'example.com' >>> start_URLss = ['http://www.example.com/users/login.php'] >>> def parse(self, response): >>> return [FormRequest.from_response(response, formdata={'username': 'john', 'password': 'secret'}, callback=self.after_login)] >>> def after_login(self, response): >>> # check login succeed before going on >>> if "authentication failed" in response.body: >>> self.log("Login failed", level=log.ERROR) >>> return
For a website that requires just the username and password without any captcha, the preceding code should work just by adding the specific login details. This is the part of the parse method since you need to log in the first page in the most of the cases. Once you log in, you can write your own after_login
callback method with items and other details.