pagodo.py - Passive Google Dorking

Introduction

The goal of this project was to develop a passive Google dork script to collect potentially vulnerable web pages and applications on the Internet. There are 2 parts. The first is ghdb_scraper.py that retrieves Google Dorks and the second portion is pagodo.py that leverages the information gathered by ghdb_scraper.py.

To my knowledge, AeonDave beat pagodo.py to market with the similar "doork" tool (https://github.com/AeonDave/doork), so definitely check that one out too. I have been working sporadically on the scripts for a couple months and finally got them to a publishable point.

tl;dr

The code can be found here: https://github.com/opsdisk/pagodo

What are Google Dorks?

The awesome folks at Offensive Security maintain the Google Hacking Database (GHDB) found here: https://www.exploit-db.com/google-hacking-database. It is a collection of Google searches, called dorks, that can be used to find potentially vulnerable boxes or other juicy info that is picked up by Google's search bots.

ghdb_homepage

ghdb_scraper.py

To start off, pagodo.py needs a list of all the current Google dorks. Unfortunately, the entire database cannot be easily downloaded. A couple of older projects did this, but the code was slightly stale and it wasn't multi-threaded...so collecting ~3800 Google Dorks would take a long time. ghdb_scraper.py is the resulting Python script.

The primary inspiration was taken from dustyfresh's ghdb-scrape (https://github.com/dustyfresh/ghdb-scrape). Code was also reused from aquabot (https://github.com/opsdisk/aquabot) and the rewrite of metagoofil (https://github.com/opsdisk/metagoofil).

The Google dorks start at number 5 and go up to 4318 as of this writing, but that does not mean there is a corresponding Google dork for each number. An example URL with the Google dork specified is: https://www.exploit-db.com/ghdb/11/
There really isn't any rhyme or reason, so putting a large arbitrary max like 5000 would cover you. The script is not smart enough to detect the end of the Google dorks.

ghdb_scraper.py Execution Flow

The flow of execution is pretty simple:

  • Fill a queue with Google dork numbers to retrieve based off a range
  • Worker threads retrieve the dork number from the queue, retrieve the page using urllib2, then process the page to extract the Google dork using the BeautifulSoup HTML parsing library
  • Print the results to the screen and optionally save them to a file (to be used by pagodo.py for example)

The website doesn't block requests using 3 threads, but did for 8...your mileage may vary.

ghdb_scraper.py Switches

The script's switches are self explanatory:

-n MINDORKNUM     Minimum Google dork number to start at (Default: 5).
-x MAXDORKNUM     Maximum Google dork number, not the total, to retrieve
                    (Default: 5000). It is currently around 3800. There is no
                    logic in this script to determine when it has reached the
                    end.
-d SAVEDIRECTORY  Directory to save downloaded files (Default: cwd, ".")
-s                Save the Google dorks to google_dorks_<TIMESTAMP>.txt file
-t NUMTHREADS     Number of search threads (Default: 3)

To run it

python ghdb_scraper.py -n 5 -x 3785 -f -t 3

pagodo.py

Now that a file with the most recent Google dorks exists, it can be fed into pagodo.py using the -g switch to start collecting potentially vulnerable public applications. pagodo.py leverages the google python library to search Google for sites with the Google dork, such as:

intitle:"ListMail Login" admin -demo

The -d switch can be used to specify a domain and functions as the Google search operator:

site:example.com  

Performing ~3800 search requests to Google as fast as possible will simply not work. Google will rightfully detect it as a bot and block your IP for a set period of time. In order to make the search queries appear more human, a couple of enhancements have been made. A pull request was made and accepted by the maintainer of the Python google module to allow for User-Agent randomization in the Google search queries. This feature is available in 1.9.3 (https://pypi.python.org/pypi/google) and allows you to randomize the different user agents used for each search. This emulates the different browsers used in a large corporate environment.

The second enhancement focuses on randomizing the time between search queries. A minimum delay is specified using the -e option (default is 30.0 seconds) and a jitter factor is used to add time on to the minimum delay number. A list of 50 jitter times is created and one is randomly appended to the minimum delay time for each Google dork search.

# Create an array of jitter values to add to delay, favoring longer search times.
self.jitter = numpy.random.uniform(low=self.delay, high=jitter * self.delay, size=(50,))

Latter in the script, a random time is selected from the jitter array and added to the delay.

pauseTime = self.delay + random.choice(self.jitter)

Experiment with the values, but the defaults successfully worked without Google blocking my IP. Note that it could take a few hours/days to run so be sure you have the time...the first successful run took over 48 hours.

pagodo.py Switches

The script's switches are self explanatory:

    -d DOMAIN       Domain to search for Google dork hits.
    -g GOOGLEDORKS  File containing Google dorks, 1 per line.
    -j JITTER       jitter factor (multipled times delay value) added to
                    randomize lookups times. Default: 1.50
    -l SEARCHMAX    Maximum results to search (default 100).
    -s              Save the html links to pagodo_results__<TIMESTAMP>.txt file.
    -e DELAY        Minimum delay (in seconds) between searches...jitter (up to
                    [jitter X delay] value) is added to this value to randomize
                    lookups. If it's too small Google may block your IP, too big
                    and your search may take a while. Default: 30.0

To run it

python pagodo.py -d example.com -g dorks.txt -l 50 -s -e 35.0 -j 1.1

Future Work

Future work includes grabbing the Google dork description to provide some context around the dork and why it is in the Google Hacking Database as seen below.

google_dork_description

Conclusion

All of the code can be found on the Opsdisk Github repository here: https://github.com/opsdisk/pagodo. Comments, suggestions, and improvements are always welcome. Be sure to follow @opsdisk on Twitter for the latest updates.