The imports

Here's how the script starts:

# scrape.py
import argparse
import base64
import json
import os
from bs4 import BeautifulSoup
import requests

Going through them from the top, you can see that we'll need to parse the arguments, which we'll feed to the script itself (argparse). We will need the base64 library to save the images within a JSON file (json), and we'll need to open files for writing (os). Finally, we'll need BeautifulSoup for scraping the web page easily, and requests to fetch its content. I assume you're familiar with requests as we have used it in previous chapters.

We will explore the HTTP protocol and the requests mechanism in Chapter 14, Web Development, so for now, let's just (simplistically) say that we perform an HTTP request to fetch the content of a web page. We can do it programmatically using a library, such as requests, and it's more or less the equivalent of typing a URL in your browser and pressing Enter (the browser then fetches the content of a web page and displays it to you).

Of all these imports, only the last two don't belong to the Python standard library, so make sure you have them installed:

$ pip freeze | egrep -i "soup|requests"
beautifulsoup4==4.6.0
requests==2.18.4

Of course, the version numbers might be different for you. If they're not installed, use this command to do so:

$ pip install beautifulsoup4==4.6.0 requests==2.18.4

At this point, the only thing that I reckon might confuse you is the base64/json couple, so allow me to spend a few words on that.

As we saw in the previous chapter, JSON is one of the most popular formats for data exchange between applications. It's also widely used for other purposes too, for example, to save data in a file. In our script, we're going to offer the user the ability to save images as image files, or as a JSON single file. Within the JSON, we'll put a dictionary with keys as the image names and values as their content. The only issue is that saving images in the binary format is tricky, and this is where the base64 library comes to the rescue.

The base64 library is actually quite useful. For example, every time you send an email with an image attached to it, the image gets encoded with base64 before the email is sent. On the recipient side, images are automatically decoded into their original binary format so that the email client can display them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset