The item pipeline

Let's talk about some more item postprocessing. Scrapy provides a way to define a pipeline for items as well, where you can define the kind of post processing an item has to go through. This is a very methodical and good program design.

We need to build our own item pipeline if we want to post process scraped items, such as removing noise and case conversion, and in other cases, where we want to derive some values from the object, for example, to calculate the age from DOB or to calculate the discount price from the original price. In the end, we might want to dump the item separately into a file.

The way to achieve this will be as follows:

  1. We need to define an item pipeline in setting.py:
    ITEM_PIPELINES = {
        'myproject.pipeline.CleanPipeline': 300,
        'myproject.pipeline.AgePipeline': 500,
        'myproject.pipeline.DuplicatesPipeline: 700,
        'myproject.pipeline.JsonWriterPipeline': 800,
    }
    
  2. Let's write a class to clean the items:
    >>>from scrapy.exceptions import Item
    >>>import datetime
    >>>import datetime
    >>>class AgePipeline(object):
    >>>    def process_item(self, item, spider):
    >>>        if item['DOB']:
    >>>            item['Age'] = (datetime.datetime.strptime(item['DOB'], '%d-%m-%y').date()-datetime.datetime.strptime('currentdate, '%d-%m-%y').date()).days/365
    >>>            return item
    
  3. We need to derive the age from DOB. We used Python's date functions to achieve this:
    >>>from scrapy import signals
    >>>from scrapy.exceptions import Item
    >>>class DuplicatesPipeline(object):
    >>>    def __init__(self):
    >>>        self.ids_seen = set()
    >>>    def process_item(self, item, spider):
    >>>        if item['id'] in self.ids_seen:
    >>>            raise DropItem("Duplicate item found: %s" % item)
    >>>        else:
    >>>            self.ids_seen.add(item['id'])
    >>>            return item
    
  4. We also need to remove the duplicates. Python has the set() data structure that only contains unique values, we can create a pipline DuplicatesPipeline.py like below using Scrapy :
    >>>from scrapy import signals
    >>>from scrapy.exceptions import Item
    >>>class DuplicatesPipeline(object):
    >>>    def __init__(self):
    >>>        self.ids_seen = set()
    >>>    def process_item(self, item, spider):
    >>>        if item['id'] in self.ids_seen:
    >>>            raise DropItem("Duplicate item found: %s" % item)
    >>>        else:
    >>>            self.ids_seen.add(item['id'])
    >>>            return item
    
  5. Let's finally write the item in the JSON file using JsonWriterPipeline.py pipeline:
    >>>import json
    >>>class JsonWriterPipeline(object):
    >>>    def __init__(self):
    >>>        self.file = open('items.txt', 'wb')
    >>>    def process_item(self, item, spider):
    >>>        line = json.dumps(dict(item)) + "
    "
    >>>        self.file.write(line)
    >>>        return item
    
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset