There is another recipe in this module that illustrates how to use the BeautifulSoup
library to programmatically get domain names. This recipe will show you how to create a local Maltego transform, which you can then use within Maltego itself to generate information in an easy to use, graphical way. With the links gathered from this transform, this can then also be used as part of a larger spidering or crawling solution.
The following code shows how you can create a script that will output the enumerated information into the correct format for Maltego:
import urllib2 from bs4 import BeautifulSoup import sys tarurl = sys.argv[1] if tarurl[-1] == “/”: tarurl = tarurl[:-1] print”<MaltegoMessage>” print”<MaltegoTransformResponseMessage>” print” <Entities>” url = urllib2.urlopen(tarurl).read() soup = BeautifulSoup(url) for line in soup.find_all(‘a’): newline = line.get(‘href’) if newline[:4] == “http”: print”<Entity Type=”maltego.Domain”>” print”<Value>”+str(newline)+”</Value>” print”</Entity>” elif newline[:1] == “/”: combline = tarurl+newline print”<Entity Type=”maltego.Domain”>” print”<Value>”+str(combline)+”</Value>” print”</Entity>” print” </Entities>” print”</MaltegoTransformResponseMessage>” print”</MaltegoMessage>”
First we import all the necessary modules for this recipe. You may have noticed that for BeautifulSoup
, we have the following line:
from bs4 import BeautifulSoup
This is so that when we use BeautifulSoup
, we just have to type BeautifulSoup
instead of bs4.BeautifulSoup
.
We then assign the target URL supplied in the argument into a variable:
tarurl = sys.argv[1]
Once we have done that, we check to see whether the target URL ends in a /
. If it does, then we remove the last character by replacing the tarurl
variable with all but the last character of tarurl
, so that it can be used later on in the recipe when outputting relative links in full:
if tarurl[-1] == “/”: tarurl = tarurl[:-1]
We then print out the tags that form part of a Maltego transform response:
print”<MaltegoMessage>” print”<MaltegoTransformResponseMessage>” print” <Entities>”
We then open the target url
with urllib2
and store this within BeautifulSoup
:
url = urllib2.urlopen(tarurl).read() soup = BeautifulSoup(url)
We now use soup to find all <a>
tags. More specifically, we will be looking for the <a>
tags with hypertext references (links):
for line in soup.find_all(‘a’): newline = line.get(‘href’)
If the first four characters of the link are http
, we’ll output it into the correct format as an entity for Maltego:
if newline[:4] == “http”: print”<Entity Type=”maltego.Domain”>” print”<Value>”+str(newline)+”</Value>” print”</Entity>”
If the first character is a /
, which indicates that the link is a relative link, then we’ll output it to the correct format after we have prepended the target URL to the link. While this recipe shows how to deal with one example of a relative link, it is important to note that there are other types of relative links, such as just a filename (example.php
), a directory, and also a relative path dot notation (../../example.php
), as shown here:
elif newline[:1] == “/”: combline = tarurl+newline if print”<Entity Type=”maltego.Domain”>” print”<Value>”+str(combline)+”</Value>” print”</Entity>”
After we have processed all the links on the page, we close all the tags that we opened at the start of the output:
print” </Entities>” print”</MaltegoTransformResponseMessage>” print”</MaltegoMessage>”