Introduction to Beautiful Soup

Beautiful Soup is a Python library that's used to pull data out of HTML and XML files. It is useful with projects that involve scraping. With this library, we can navigate, search, and modify the HTML and XML files.

This library parses anything you feed in and does tree traversal work on the data. You can ask the library to find all the links whose URLs match google.com, find all the links with class bold URLs, or find all the table headers with bold text.

There are a few features that makes it useful, and they are as follows:

  1. Beautiful Soup provides us with some simple methods and Pythonic idioms to navigate, search, and modify a parse tree, which is a toolkit that is used to dissect a document and then extract what you need. We need less code to write an application.
  2. Beautiful Soup automatically converts incoming documents into Unicode and outgoing documents into UTF-8. Unless the document doesn't specify anything about encoding and Beautiful Soup isn't able to detect any, we don't have to think about encoding. Then, we will have to specify only the original encoding.
  3. Beautiful Soup can be used on top of popular Python parsers, such as lxml (https://lxml.de/) and html5lib (https://github.com/html5lib/), and lets you try various parsing strategies or trade speed for flexibility.
  4. Beautiful Soup saves you time by extracting the information you need and so makes your job easier.

Here is the simple version of the code:

import argparse
import json
import itertools
import logging
import re
import os
import uuid
import sys
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
#logger will be useful for your debugging need
def configure_logging():
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler()
handler.setFormatter(
logging.Formatter('[%(asctime)s %(levelname)s %(module)s]: %(message)s'))
logger.addHandler(handler)
return logger
logger = configure_logging()

Setting the user-agent to avoid 403 error code:


REQUEST_HEADER = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
def get_soup(url, header):
response = urlopen(Request(url, headers=header))
return BeautifulSoup(response, 'html.parser')
# initialize place for links
def get_query_url(query):
return "https://www.google.co.in/search?q=%s&source=lnms&tbm=isch" % query
# pull out specific data through navigating into source data tree
def extract_images_from_soup(soup):
image_elements = soup.find_all("div", {"class": "rg_meta"})
metadata_dicts = (json.loads(e.text) for e in image_elements)
link_type_records = ((d["ou"], d["ity"]) for d in metadata_dicts)
return link_type_records

Pass the number of images you want to pull out. By default google provides 100 images:


def extract_images(query, num_images):
url = get_query_url(query)
logger.info("Souping")
soup = get_soup(url, REQUEST_HEADER)
logger.info("Extracting image urls")
link_type_records = extract_images_from_soup(soup)
return itertools.islice(link_type_records, num_images)
def get_raw_image(url):
req = Request(url, headers=REQUEST_HEADER)
resp = urlopen(req)
return resp.read()

Saving all the downloaded images along with its extension, as shown in the following code block:

def save_image(raw_image, image_type, save_directory):
extension = image_type if image_type else 'jpg'
file_name = str(uuid.uuid4().hex) + "." + extension
save_path = os.path.join(save_directory, file_name)
with open(save_path, 'wb+') as image_file:
image_file.write(raw_image)
def download_images_to_dir(images, save_directory, num_images):
for i, (url, image_type) in enumerate(images):
try:
logger.info("Making request (%d/%d): %s", i, num_images, url)
raw_image = get_raw_image(url)
save_image(raw_image, image_type, save_directory)
except Exception as e:
logger.exception(e)
def run(query, save_directory, num_images=100):
query = '+'.join(query.split())
logger.info("Extracting image links")
images = extract_images(query, num_images)
logger.info("Downloading images")
download_images_to_dir(images, save_directory, num_images)
logger.info("Finished")
#main method to initiate the scrapper
def main():
parser = argparse.ArgumentParser(description='Scrape Google images')
#change the search term here
parser.add_argument('-s', '--search', default='apple', type=str, help='search term')

Change number of images parameter here. By default it is set to 1, as shown in following code:

parser.add_argument('-n', '--num_images', default=1, type=int, help='num images to save')
#change path according to your need
parser.add_argument('-d', '--directory', default='/Users/karthikeyan/Downloads/', type=str, help='save directory')
args = parser.parse_args()
run(args.search, args.directory, args.num_images)
if __name__ == '__main__':
main()

Save the script as a Python file and then run the code by executing the following command:

python imageScrapper.py --search "alien" --num_images 10 --directory "/Users/Karthikeyan/Downloads"

Google image scraping with a better library, including more configurable options. We will use https://github.com/hardikvasa/google-images-download.

This is a command line Python program that's used to search for keywords or key phrases on Google Images, and optionally download images to your computer. You can also invoke this script from another Python file.

This is a small and ready-to-run program. No dependencies are required for it to be installed if you only want to download up to 100 images per keyword. If you want more than 100 images per keyword, then you will need to install the Selenium library, along with ChromeDriver. Detailed instructions are provided in the Troubleshooting section.

You can use a library with more useful options.

If you prefer command line-based installation, use the following code:

$ git clone https://github.com/hardikvasa/google-images-download.git
$ cd google-images-download && sudo python setup.py install

Alternatively, you can install the library through pip:

$ pip install google_images_download

If installed via pip or using a command language interpreter (CLI), use the following command:

$ googleimagesdownload [Arguments...]

If downloaded via the UI from github.com, unzip the downloaded file, go to the google_images_download directory, and use one of the following commands:

$ python3 google_images_download.py [Arguments...]

$ python google_images_download.py [Arguments...]

If you want to use this library from another Python file, use this command:

from google_images_download import google_images_download
response = google_images_download.googleimagesdownload()
absolute_image_paths = response.download({<Arguments...>})

You can either pass the arguments directly from the command, as shown in the following examples, or you can pass it through a config file. 

You can pass more than one record through a config file. The following sample consists of two sets of records. The code will iterate through each of the records and download images based on the arguments that are passed.

The following is a sample of what a config file looks like:

{
"Records": [
{
"keywords": "apple",
"limit": 55,
"color": "red",
"print_urls": true
},
{
"keywords": "oranges",
"limit": 105,
"size": "large",
"print_urls": true
}
]
}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset