The beautifulsoup library

beautifulsoup is a library in Python, used for web scraping. It has simple methods for searching, navigating, and modifying. It is simply a toolkit used for extracting the data you needed from a web page.

Now, to use the requests and beautifulsoup functionality in your scripts you must import these two libraries using the import statement. Now, we are going to see an example of parsing a web page. Here, we are going to parse a web page, which is a top news page from the IMDb website. For that purpose, create a parse_web_page.py script and write the following content in it:

import requests
from bs4 import BeautifulSoup

page_result = requests.get('https://www.imdb.com/news/top?ref_=nv_nw_tp')
parse_obj = BeautifulSoup(page_result.content, 'html.parser')

print(parse_obj)

Run the script and you will get the output as follows:

student@ubuntu:~/work$ python3 parse_web_page.py
Output:
<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
if (typeof uet == 'function') {
uet("bb", "LoadTitle", {wb: 1});
}
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top News - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
if (typeof uet == 'function') {
uet("be", "LoadTitle", {wb: 1});
}
</script>
<script>
if (typeof uex == 'function') {
uex("ld", "LoadTitle", {wb: 1});
}
</script>
<link href="https://www.imdb.com/news/top" rel="canonical"/>
<meta content="http://www.imdb.com/news/top" property="og:url">
<script>
if (typeof uet == 'function') {
uet("bb", "LoadIcons", {wb: 1});
}

In the preceding example, we collected a page and parsed it using beautifulsoup. First, we imported the requests and beautifulsoup modules. Then, we collected the URL using the GET request and assigned that URL to the page_result variable. Next, we created a beautifulsoup object parse_obj. This object will take page_result.content as its argument from requests and then the page parsed using html.parser.

Now, we are going to extract the content from a class and a tag. To perform this operation, go to your web browser and right-click on the content, that you want to extract and scroll down so you can see the Inspect option. Click on that and you will get the class name. Mention it in your program and run your script. For that, create a extract_from_class.py script and write the following content in it:

import requests
from bs4 import BeautifulSoup

page_result = requests.get('https://www.imdb.com/news/top?ref_=nv_nw_tp')
parse_obj = BeautifulSoup(page_result.content, 'html.parser')

top_news = parse_obj.find(class_='news-article__content')
print(top_news)

Run the script and you will get the following output:

student@ubuntu:~/work$ python3 extract_from_class.py
Output :
<div class="news-article__content">
<a href="/name/nm4793987/">Issa Rae</a> and <a href="/name/nm0000368/">Laura Dern</a> are teaming up to star in a limited series called “The Dolls” currently in development at <a href="/company/co0700043/">HBO</a>.<br/><br/>Inspired by true events, the series recounts the aftermath of Christmas Eve riots in two small Arkansas towns in 1983, riots which erupted over Cabbage Patch Dolls. The series explores class, race, privilege and what it takes to be a “good mother.”<br/><br/>Rae will serve as a writer and executive producer on the series in addition to starring, with Dern also executive producing. <a href="/name/nm3308450/">Laura Kittrell</a> and <a href="/name/nm4276354/">Amy Aniobi</a> will also serve as writers and co-executive producers. <a href="/name/nm0501536/">Jayme Lemons</a> of Dern’s <a href="/company/co0641481/">Jaywalker Pictures</a> and <a href="/name/nm3973260/">Deniese Davis</a> of <a href="/company/co0363033/">Issa Rae Productions</a> will also executive produce.<br/><br/>Both Rae and Dern currently star in HBO shows, with Dern appearing in the acclaimed drama “<a href="/title/tt3920596/">Big Little Lies</a>” and Rae starring in and having created the hit comedy “<a href="/title/tt5024912/">Insecure</a>.” Dern also recently starred in the film “<a href="/title/tt4015500/">The Tale</a>,
</div>

In the preceding example, we first imported the requests and beautifulsoup modules. Then, we created a request object and assigned an URL to it. Next, we created a beautifulsoup object parse_obj. This object takes page_result.content as its argument from requests and then the page was parsed using html.parser. Next, we used beautifulsoup's find() method to get the content from the 'news-article__content' class.

Now, we are going to see an example of extracting content from a particular tag. In this example, we are going to extract the content from the <a> tag. Create an extract_from_tag.py script and write the following content in it:

import requests
from bs4 import BeautifulSoup

page_result = requests.get('https://www.imdb.com/news/top?ref_=nv_nw_tp')
parse_obj = BeautifulSoup(page_result.content, 'html.parser')

top_news = parse_obj.find(class_='news-article__content')
top_news_a_content = top_news.find_all('a')
print(top_news_a_content)

Run the script and you will get the output as follows:

student@ubuntu:~/work$ python3 extract_from_tag.py
Output:
[<a href="/name/nm4793987/">Issa Rae</a>, <a href="/name/nm0000368/">Laura Dern</a>, <a href="/company/co0700043/">HBO</a>, <a href="/name/nm3308450/">Laura Kittrell</a>, <a href="/name/nm4276354/">Amy Aniobi</a>, <a href="/name/nm0501536/">Jayme Lemons</a>, <a href="/company/co0641481/">Jaywalker Pictures</a>, <a href="/name/nm3973260/">Deniese Davis</a>, <a href="/company/co0363033/">Issa Rae Productions</a>, <a href="/title/tt3920596/">Big Little Lies</a>, <a href="/title/tt5024912/">Insecure</a>, <a href="/title/tt4015500/">The Tale</a>]

In the preceding example, we are extracting contents from the <a> tag. We used the find_all() method to extract all <a> tag contents from the 'news-article__content' class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset