One of the most popular tools to collect email addresses and other target information during a pen test is theHarvester, written by Christian Martorella @laramies of the Edge-Security Group. The source code can be found here: https://github.com/laramies/theHarvester, but it comes with Kali by default. Motivated by the rewrite of metagoofil, I decided to rewrite theHarvester, and update the email collection portion. Currently,the DNS portion is not included.
theHarvester Collection Modes
The new theHarvester offers both the traditional passive and a new active email collection mode.
The existing passive mode scrapes emails from Google
@"example.com" search results without visiting the target domain
or sub-domains. This is quicker, but not as thorough in returning email addresses. In passive mode, the updated
theHarvester also searches Google for pages utilizing the
example.com -site:example.com search criteria. This allows
the script to passively find emails on sites, like forums, that are not necessarily affiliated with the target domain,
because of the
- search operator in front of
The active mode searches Google for pages utilizing the
site:example.com search criteria. The Python
Once a list of URLs from the Google search results is retrieved, the script visits each site, and scrapes the page looking for email addresses. This is considered active reconnaissance since each site is being visited from one IP and the behavior looks like a bot.
The remaining updates deal with the switches. The same switches were kept as in the original metagoofil to avoid confusion, with new ones also added.
-a switch specifies active mode, and specifies to scrape and search for emails on the target domain and possible sub-domains. This could be considered noisy and a precursor to a social engineering attack.
-b switch specifies the data source and currently only supports Google, PGP from https://pgp.mit.edu/, and 'all', which is just both of them.
-f switch writes all the emails to a
domain + date-time stamped .txt file (e.g., example.com_20151201_175822)
instead of an HTML file. This allows for quick copy/paste or as an input file for other tools.
The addition of the
-e delay switch allows you to specify the time delay in seconds between searches. If you request searches too quickly, Google will think you are a script or bot and will block your IP address for a while. Experiment
to see what works best for you.
-n switch specifies the amount of threads to use during active URL retrieval to search for email addresses, with
the default being 8.
-t switch specifies the amount of time to wait before trying to access a stale/defunct site in active
The original theHarvester had a module to clean up HTML results in order to extract emails. The email cleaning portion
of that module is folded into the
googlesearch module code. The email regular expression is still the same.
The emails are converted to lowercase and sorted in alphabetical order for a cleaner look.
The current theHarvester passive approach is stealthier, because it only extracts email addresses from the Google search results, however, the active mode is more comprehensive. The stealth vs quantity decision is up to you as a pen tester.
Not sure if I will add the other search engine functionality (Yahoo!, Baidu), the API key dependent searching (Shodan), or the other tools (dnsenum, dnsrecon). Usually during a social engineering campaign, I really only care about the email addresses.
All of the code can be found on the Opsdisk Github repository here: https://github.com/opsdisk/theHarvester. Comments, suggestions, and improvements are always welcome. Be sure to follow @opsdisk on Twitter for the latest updates.