Finding comments in source code

A common security issue is caused by good programming practices. During the development phase of web applications, developers will comment their code. This is very useful during this phase, as it helps with understanding the code and will serve as useful reminders for various reasons. However, when the web application is ready to be deployed in a production environment, it is best practice to remove all these comments as they may prove useful to an attacker.

This recipe will use a combination of Requests and BeautifulSoup in order to search a URL for comments, as well as searching for links on the page and searching those subsequent URLs for comments as well. The technique of following links from a page and analysing those URLs is known as spidering.

How to do it…

The following script will scrape a URL for comments and links in the source code. It will then also perform limited spidering and search linked URLs for comments:

import requests
import re

from bs4 import BeautifulSoup
import sys

if len(sys.argv) !=2:
    print "usage: %s targeturl" % (sys.argv[0])
    sys.exit(0)

urls = []

tarurl = sys.argv[1]
url = requests.get(tarurl)
comments = re.findall('<!--(.*)-->',url.text)
print "Comments on page: "+tarurl
for comment in comments:
    print comment

soup = BeautifulSoup(url.text)
for line in soup.find_all('a'):
    newline = line.get('href')
    try:
        if newline[:4] == "http":
            if tarurl in newline:
                urls.append(str(newline))
        elif newline[:1] == "/":
            combline = tarurl+newline
            urls.append(str(combline))
    except:
        pass
        print "failed"
for uurl in urls:
    print "Comments on page: "+uurl
    url = requests.get(uurl)
    comments = re.findall('<!--(.*)-->',url.text)
    for comment in comments:
        print comment

How it works…

After the initial import of the necessary modules and setting up of variables, the script first gets the source code of the target URL.

You may have noticed that for Beautifulsoup, we have the following line:

from bs4 import BeautifulSoup

This is so that when we use BeautifulSoup, we just have to type BeautifulSoup instead of bs4.BeautifulSoup.

It then searches for all instances of HTML comments and prints them out:

url = requests.get(tarurl)
comments = re.findall('<!--(.*)-->',url.text)
print "Comments on page: "+tarurl
for comment in comments:
    print comment

The script will then use Beautifulsoup in order to scrape the source code for any instances of absolute (starting with http) and relative (starting with /) links:

if newline[:4] == "http":
            if tarurl in newline:
                urls.append(str(newline))
        elif newline[:1] == "/":
            combline = tarurl+newline
            urls.append(str(combline))

Once the script has collated a list of URLs linked to from the page, it will then search each page for HTML comments.

There's more…

This recipe shows a basic example of comment scraping and spidering. It is possible to add more intelligence to this recipe to suit your needs. For instance, you may want to account for relative links that use start with . or .. to denote the current and parent directories.

You can also add more control to the spidering part. You could extract the domain from the supplied target URL and create a filter that does not scrape links for domains external to the target. This is especially useful for professional engagements where you need to adhere to a scope of targets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset