EuroPython project

In this section, we are going to build a project with Scrapy that allows us to extract the data of the sessions of the EuroPython conference following the pattern from the following URL: http://ep{year}.europython.eu/en/events/sessions. You can try with years from 2015 to 2018: for example we can try with the following URL: https://ep2018.europython.eu/events/sessions/.

To create a project with scrapy, we can execute the following command:

scrapy startproject europython

In this screenshot, we can see the result of creating a Scrapy project:

items.py is where we define the fields and the information that we are going to extract:

import scrapy
class EuropythonItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
author = scrapy.Field()
description = scrapy.Field()
date = scrapy.Field()
tags = scrapy.Field()

In the settings.py file, we define the name of the 'europython.spiders' module and the pipelines defined among which we highlight one that allows exporting the data in XML format—EuropythonXmlExport—and another that saves the data in a database SQLite— EuropythonSQLitePipeline.

You can find the following code in the settings.py file:

# Scrapy settings for europython project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'europython'
SPIDER_MODULES = ['europython.spiders']
NEWSPIDER_MODULE = 'europython.spiders'
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'europython.pipelines.EuropythonXmlExport': 200,
'europython.pipelines.EuropythonSQLitePipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
#'europython.middlewares.ProxyMiddleware': 100,
}

In the pipelines.py file we define the class that will process the results and store them in an SQLite file. For this task, we can create an entity called EuropythonSession that extends from db. Entity class available in pony ORM package (https://ponyorm.com). You need to install the pony package with the pip install pony command.

You can find the following code in the pipelines.py file:

from pony.orm import *

db = Database("sqlite", "europython.sqlite", create_db=True)

class EuropythonSession(db.Entity):
""" Pony ORM model of the europython session table """
id = PrimaryKey(int, auto=True)
author = Required(str)
title = Required(str)
description = Required(str)
date = Required(str)
tags = Required(str)

Also, we need to define a EuropythonSQLitePipeline class for processing data about author, title, description, date, tags, and storing items in the database:

class EuropythonSQLitePipeline(object):

@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline

def spider_opened(self, spider):
db.generate_mapping(check_tables=True, create_tables=True)

def spider_closed(self, spider):
db.commit()

# Insert data in database
@db_session
def process_item(self, item, spider):
# use db_session as a context manager
with db_session:
try:
strAuthor = str(item['author'])
strAuthor = strAuthor[3:len(strAuthor)-2]
strTitle = str(item['title'])
strTitle = strTitle[3:len(strTitle)-2]
strDescription = str(item['description'])
strDescription = strDescription[3:len(strDescription)-2]
strDate = str(item['date'])
strDate = strDate[3:len(strDate)-2]
strDate = strDate.replace("[u'", "").replace("']", "").replace("u'", "").replace("',", ",")
strTags = str(item['tags'])
strTags = strTags.replace("[u'", "").replace("']", "").replace("u'", "").replace("',", ",")
europython_session = EuropythonSession(author=strAuthor,title=strTitle,
description=strDescription,date=strDate,tags=strTags)
except Exception as e:
print("Error processing the items in the DB %d: %s" % (e.args[0], e.args[1]))
return item

In the europython_spider.py file we define the EuropythonSpyder class. In this class, the spider is defined, which will track the links it finds from the starting URL depending on the indicated pattern, and for each entry it will obtain the corresponding data for each session (title, author, description, date, and tags).

You can find the following code in the europython_spider.py file:

import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from europython.items import EuropythonItem

class EuropythonSpyder(CrawlSpider):
def __init__(self, year='', *args, **kwargs):
super(EuropythonSpyder, self).__init__(*args, **kwargs)
self.year = year
self.start_urls = ['http://ep'+str(self.year)+".europython.eu/en/events/sessions"]
print('start url: '+str(self.start_urls[0]))

name = "europython_spyder"
allowed_domains = ["ep2015.europython.eu","ep2016.europython.eu", "ep2017.europython.eu","ep2018.europython.eu"]

# Pattern for entries that match the conference/talks format
rules = [Rule(LxmlLinkExtractor(allow=['conference/talks']),callback='process_response')]

def process_response(self, response):
item = EuropythonItem()
print(response)
item['title'] = response.xpath("//div[contains(@class, 'grid-100')]//h1/text()").extract()
item['author'] = response.xpath("//div[contains(@class, 'talk-speakers')]//a[1]/text()").extract()
item['description'] = response.xpath("//div[contains(@class, 'cms')]//p//text()").extract()
item['date'] = response.xpath("//section[contains(@class, 'talk when')]/strong/text()").extract()
item['tags'] = response.xpath("//div[contains(@class, 'all-tags')]/span/text()").extract()
return item
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset