A common security issue is caused by good programming practices. During the development phase of web applications, developers will comment their code. This is very useful during this phase, as it helps with understanding the code and will serve as useful reminders for various reasons. However, when the web application is ready to be deployed in a production environment, it is best practice to remove all these comments as they may prove useful to an attacker.
This recipe will use a combination of Requests
and BeautifulSoup
in order to search a URL for comments, as well as searching for links on the page and searching those subsequent URLs for comments as well. The technique of following links from a page and analysing those URLs is known as spidering.
The following script will scrape a URL for comments and links in the source code. It will then also perform limited spidering and search linked URLs for comments:
import requests import re from bs4 import BeautifulSoup import sys if len(sys.argv) !=2: print "usage: %s targeturl" % (sys.argv[0]) sys.exit(0) urls = [] tarurl = sys.argv[1] url = requests.get(tarurl) comments = re.findall('<!--(.*)-->',url.text) print "Comments on page: "+tarurl for comment in comments: print comment soup = BeautifulSoup(url.text) for line in soup.find_all('a'): newline = line.get('href') try: if newline[:4] == "http": if tarurl in newline: urls.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls.append(str(combline)) except: pass print "failed" for uurl in urls: print "Comments on page: "+uurl url = requests.get(uurl) comments = re.findall('<!--(.*)-->',url.text) for comment in comments: print comment
After the initial import of the necessary modules and setting up of variables, the script first gets the source code of the target URL.
You may have noticed that for Beautifulsoup
, we have the following line:
from bs4 import BeautifulSoup
This is so that when we use BeautifulSoup
, we just have to type BeautifulSoup
instead of bs4.BeautifulSoup
.
It then searches for all instances of HTML comments and prints them out:
url = requests.get(tarurl) comments = re.findall('<!--(.*)-->',url.text) print "Comments on page: "+tarurl for comment in comments: print comment
The script will then use Beautifulsoup
in order to scrape the source code for any instances of absolute (starting with http
) and relative (starting with /
) links:
if newline[:4] == "http": if tarurl in newline: urls.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls.append(str(combline))
Once the script has collated a list of URLs linked to from the page, it will then search each page for HTML comments.
This recipe shows a basic example of comment scraping and spidering. It is possible to add more intelligence to this recipe to suit your needs. For instance, you may want to account for relative links that use start with .
or ..
to denote the current and parent directories.
You can also add more control to the spidering part. You could extract the domain from the supplied target URL and create a filter that does not scrape links for domains external to the target. This is especially useful for professional engagements where you need to adhere to a scope of targets.