Scrapy item class

Scrapy provides the item class to define the output data format. Item objects are containers used to collect the extracted data and specify metadata for the field used to characterize that data. For more details, see https://doc.scrapy.org/en/1.5/topics/items.html.

Create a file named MyItem.py and add the following code into it:

import scrapy
from scrapy.loader.processors import TakeFirst

class MyItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field(output_processor=TakeFirst(),)

The next step is to describe how the information can be extracted using XPath expressions so that Scrapy can differentiate it from the rest of the HTML code on the page of each book.

To start the crawling process, it is necessary to import the CrawlerProcess class. We instantiate the class by passing it through the parameters of the configuration that we want to apply:

# setup crawler
from scrapy.crawler import CrawlerProcess
crawler = CrawlerProcess(settings)
# define the spider for the crawler
crawler.crawl(MySpider())
# start scrapy
print("STARTING ENGINE")
crawler.start()
# printed at the end of the crawling process
print("ENGINE STOPPED")

We import the necessary modules to carry out the crawling process:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
  • Rule: Allows us to establish the rules by which the crawler will be based to navigate through different links.
  • LxmlLinkExtractor: Allows us to define a callback function and regular expressions to tell the crawler which links to go through. It allows us to define the navigation rules between the links that we want to obtain.
  • HtmlXPathSelector: Allows us to apply XPath expressions.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset