Instead of generating your own e-mail list, you may find that a target organisation will have some that exist on their web pages. This may prove to be of higher value than e-mail addresses you have generated yourself as the likelihood of e-mail addresses on a target organisation's website being valid will be much higher than ones you have tried to guess.
For this recipe, you will need a list of pages you want to parse for e-mail addresses. You may want to visit the target organization's website and search for a sitemap. A sitemap can then be parsed for links to pages that exist within the website.
The following code will parse through responses from a list of URLs for instances of text that match an e-mail address format and save them to a file:
import urllib2 import re import time from random import randint regex = re.compile(("([a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:.[a-z0- 9!#$%&'*+/=?^_'" "{|}~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9- ]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)")) tarurl = open("urls.txt", "r") for line in tarurl: output = open("emails.txt", "a") time.sleep(randint(10, 100)) try: url = urllib2.urlopen(line).read() output.write(line) emails = re.findall(regex, url) for email in emails: output.write(email[0]+" ") print email[0] except: pass print "error" output.close()
After importing the necessary modules, you will see the assignment of the regex
variable:
regex = re.compile(("([a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:.[a-z0- 9!#$%&'*+/=?^_'" "{|}~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9- ]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))
This attempts to match an e-mail address format, for example [email protected]
, or victim at target dot com. The code then opens up a file containing the URLs:
tarurl = open("urls.txt", "r")
You might notice the use of the parameter r
. This opens the file in read-only mode. The code then loops through the list of URLs. Within the loop, a file is opened to save e-mail addresses to:
output = open("emails.txt", "a")
This time, the a
parameter is used. This indicates that any input to this file will be appended instead of overwriting the entire file. The script utilizes a sleep timer in order to avoid triggering any protective measures the target may have in place to prevent attacks:
time.sleep(randint(10, 100))
This timer will pause the script for a random amount of time between 10
and 100
seconds.
The use of exception handling when using the urlopen()
method is essential. If the response from urlopen()
is 404 (HTTP not found error)
, then the script will error and exit.
If there is a valid response, the script will then store all instances of e-mail addresses in the emails
variable:
emails = re.findall(regex, url)
It will then loop through the emails
variable and write each item in the list to the emails.txt
file and also output it to the console for confirmation:
for email in emails: output.write(email[0]+" ") print email[0]
The regular expression matching used in this recipe matches two common types of format used to represent e-mail addresses on the Internet. During the course of your learning and investigations, you may come across other formats that you might like to include in your matching. For more information on regular expressions in Python, you may want read the documentation on the Python website for regular expressions at https://docs.python.org/2/library/re.html.