Consider a situation where you want to glean all the hyperlinks from the webpage. In this section, we will do this by programming. On the other hand, this can also be done manually by viewing the view source of the web page. However this will take some time.
So let's get acquainted with a very beautiful parser called BeautifulSoup. This parser is from a third-party source and is very easy to work with. In our code, we will use version 4 of BeautifulSoup.
The requirement is the title of the HTML page and hyperlinks.
The code is as follows:
import urllib from bs4 import BeautifulSoup url = raw_input("Enter the URL ") ht= urllib.urlopen(url) html_page = ht.read() b_object = BeautifulSoup(html_page) print b_object.title print b_object.title.text for link in b_object.find_all('a'): print(link.get('href'))
The from bs4 import BeautifulSoup
statement is used to import the BeautifulSoup library. The url
variable stores the URL of the website, and urllib.urlopen(url)
opens the webpage while the ht.read()
function stores the webpage. The html_page = ht.read()
statement assigns the webpage to a html_page
variable. For better understanding, we have used this variable. In the b_object = BeautifulSoup(html_page)
statement, an object of b_object
is created. The next statement prints the title name with tags and without tags. The next b_object.find_all('a')
statement saves all the hyperlinks. The next line prints only the hyperlinks part. The output of the program will clear all doubts, and is shown in the following screenshot:
Now, you have seen how you can obtain the hyperlinks and a title by using beautiful parser.
In the next code, we will obtain a particular field with the help of BeautifulSoup:
import urllib from bs4 import BeautifulSoup url = "https://www.hackthissite.org" ht= urllib.urlopen(url) html_page = ht.read() b_object = BeautifulSoup(html_page) data = b_object.find('div', id ='notice') print data
The preceding code has taken https://www.hackthissite.org as the url
, and in the following code, we are interested in finding where <div id = notice>
is, as shown in the following screenshot:
Now let's see the output of the preceding code in the following screenshot:
Consider another example in which you want to gather information about a website. In the process of information gathering for a particular website, you have probably used http://smartwhois.com/. By using SmartWhois, you can obtain useful information about any website, such as the Registrant Name, Registrant Organization, Name Server, and so on.
In the following code, you will see how you can obtain the information from SmartWhois. In the quest of information gathering, I have studied SmartWhois and found out that its <div class="whois">
tag retains the relevant information. The following program will gather the information from this tag and save it in a file in a readable form:
import urllib from bs4 import BeautifulSoup import re domain=raw_input("Enter the domain name ") url = "http://smartwhois.com/whois/"+str(domain) ht= urllib.urlopen(url) html_page = ht.read() b_object = BeautifulSoup(html_page) file_text= open("who.txt",'a') who_is = b_object.body.find('div',attrs={'class' : 'whois'}) who_is1=str(who_is) for match in re.finditer("Domain Name:",who_is1): s= match.start() lines_raw = who_is1[s:] lines = lines_raw.split("<br/>",150) i=0 for line in lines : file_text.writelines(line) file_text.writelines(" ") print line i=i+1 if i==17 : break file_text.writelines("-"*50) file_text.writelines(" ") file_text.close()
Let's analyze the file_text= open("who.txt",'a')
statement since I hope you followed the previous code. The file_text
file object opens a who.txt
file in append mode to store the results. The who_is = b_object.body.find('div',attrs={'class' : 'whois'})
statement produces the desired result. However, who_is
does not contain all the data in string form. If we use b_object.body.find('div',attrs={'class' : 'whois'}).text
, it will output all the text that contains the tags, but this information becomes very difficult to read. The who_is1=str(who_is)
statement converts the information into string form:
for match in re.finditer("Domain Name:",who_is1): s= match.start()
The preceding code finds the starting point of the "Domain Name:"
string because our valuable information comes after this string. The lines_raw
variable contains the information after the "Domain Name:"
string. The lines = lines_raw.split("<br/>",150)
statement splits the lines by using the <br/>
delimiter, and the "lines"
variable becomes a list. It means that in an HTML page, where a break (</br>
) exists, the statement will make a new line and all lines will be stored in a list named lines. The i=0
variable is initialized as 0, which will be further used to print the number of lines as a result. The following piece of code saves the results in the form of a file that exists on a hard disk as well as displaying the results on the screen.
The screen output is as follows:
Now, let's check out the output of the code in the files:
You have seen how to obtain hyperlinks from a webpage and, by using the previous code, you can get the information about the hyperlinks. Don't stop here; instead, try to read more about BeautifulSoup at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Now, let's go through an exercise that takes domain names in a list as an input and writes the results of the findings in a single file.