BeautifulSoup introduction

The BeautifulSoup package contains a library specialized in analyzing and searching data within an HTML file by means of various types of criteria such as the following:

  • Searches of HTML elements by means of the structure of the DOM
  • Searches through selectors
  • Tag searches

BeautifulSoup is a library used to perform web scraping operations from Python, focused on the parsing of web content such as XML, HTML, and JSON.

This tool is not intended directly for web scraping. Instead, the purpose of this tool is to provide an interface that allows access in a very simple way to the content of a web page, which makes it ideal to extract information from the web.

Among the main features, we can highlight the following:

  • Parses and allows the extraction of information from HTML documents
  • Supports multiple parsers in processing XML documents and HTML (lxml, html5lib)
  • Generates a tree structure with all the elements of the paired document
  • Very easily allows the user to search HTML elements, such as links, forms, or any HTML tag

To use it, you have to install the specific module that can be found in the official repository (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) using the following command:

pip install BeautifulSoup4

You can also see the latest version of the module on the official Python page: https://pypi.python.org/pypi/beautifulsoup4.

Once installed, the name of the package is bs4. The first thing to use the library for is to import the BeautifulSoup package from the bs4 module:

>>> from bs4 import BeautifulSoup

To be able to perform operations with an HTML document, it is necessary to create an object from the bs4.BeautifulSoup class by entering a str type object containing the HTML code and selecting the type of analyzer to be used as second parameter: bs4.BeautifulSoup (<object type str>, <analyzer type>).

To learn more about the analyzer options, you can query the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser.

To create an instance of BeautifulSoup, it is necessary to pass the parameters of the HTML document and the parser that we want to use (lxml or html5lib):

>>> bs= BeautifulSoup(contents,'lxml')

In this way, we managed to create an instance of the BeautifulSoup class, passing the HTML content of the page and the parser to be used as parameters. In the bs object we have all the information to navigate through the document and access each of the labels that are included in it. For example, if we want to access the title tag of the document, simply execute bs.title.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset