Extracting links from a URL to Maltego

There is another recipe in this module that illustrates how to use the BeautifulSoup library to programmatically get domain names. This recipe will show you how to create a local Maltego transform, which you can then use within Maltego itself to generate information in an easy to use, graphical way. With the links gathered from this transform, this can then also be used as part of a larger spidering or crawling solution.

How to do it…

The following code shows how you can create a script that will output the enumerated information into the correct format for Maltego:

import urllib2
from bs4 import BeautifulSoup
import sys

tarurl = sys.argv[1]
if tarurl[-1] == “/”:
  tarurl = tarurl[:-1]
print”<MaltegoMessage>”
print”<MaltegoTransformResponseMessage>”
print”  <Entities>”

url = urllib2.urlopen(tarurl).read()
soup = BeautifulSoup(url)
for line in soup.find_all(‘a’):
  newline = line.get(‘href’)
  if newline[:4] == “http”:
    print”<Entity Type=”maltego.Domain”>” 
    print”<Value>”+str(newline)+”</Value>”
    print”</Entity>”
  elif newline[:1] == “/”:
    combline = tarurl+newline
    print”<Entity Type=”maltego.Domain”>” 
    print”<Value>”+str(combline)+”</Value>”
    print”</Entity>”
print”  </Entities>”
print”</MaltegoTransformResponseMessage>”
print”</MaltegoMessage>”

How it works…

First we import all the necessary modules for this recipe. You may have noticed that for BeautifulSoup, we have the following line:

from bs4 import BeautifulSoup

This is so that when we use BeautifulSoup, we just have to type BeautifulSoup instead of bs4.BeautifulSoup.

We then assign the target URL supplied in the argument into a variable:

tarurl = sys.argv[1]

Once we have done that, we check to see whether the target URL ends in a /. If it does, then we remove the last character by replacing the tarurl variable with all but the last character of tarurl, so that it can be used later on in the recipe when outputting relative links in full:

if tarurl[-1] == “/”:
  tarurl = tarurl[:-1]

We then print out the tags that form part of a Maltego transform response:

print”<MaltegoMessage>”
print”<MaltegoTransformResponseMessage>”
print”  <Entities>”

We then open the target url with urllib2 and store this within BeautifulSoup:

url = urllib2.urlopen(tarurl).read()
soup = BeautifulSoup(url)

We now use soup to find all <a> tags. More specifically, we will be looking for the <a> tags with hypertext references (links):

for line in soup.find_all(‘a’):
  newline = line.get(‘href’)

If the first four characters of the link are http, we’ll output it into the correct format as an entity for Maltego:

if newline[:4] == “http”:
    print”<Entity Type=”maltego.Domain”>”
    print”<Value>”+str(newline)+”</Value>”
    print”</Entity>”

If the first character is a /, which indicates that the link is a relative link, then we’ll output it to the correct format after we have prepended the target URL to the link. While this recipe shows how to deal with one example of a relative link, it is important to note that there are other types of relative links, such as just a filename (example.php), a directory, and also a relative path dot notation (../../example.php), as shown here:

elif newline[:1] == “/”:
    combline = tarurl+newline
    if 
    print”<Entity Type=”maltego.Domain”>”
    print”<Value>”+str(combline)+”</Value>”
    print”</Entity>”

After we have processed all the links on the page, we close all the tags that we opened at the start of the output:

print”  </Entities>”
print”</MaltegoTransformResponseMessage>”
print”</MaltegoMessage>”

There’s more…

The BeautifulSoup library contains other functions that could make your code simpler. One of these functions is called SoupStrainer. SoupStrainer will allow you to parse only the parts of the document that you want. We have left this as an exercise for you to explore.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset