-
In the attributes of selected tags: <a>, <link>, <script> etc.
-
In plain text and comments (using regexp)
- Javascript files often contain links (sometimes using relative paths). These links will be verified once inserted into the queue and crawled. Only valid links will be stored.
-
robots.txt
- GET parameters in links.
- POST parameters in form input fields.
- Sometimes form entries and input fields are commented out. Look for input fields in the comments as well.
- Cookies:
- Store cookie names and example values.
- While crawling, identify known HTTP headers that sometimes contain
sensitive information.
- Server, X-AspNet-Version, X-Powered-By, etc.
- Once a new link is located it can simply be added to the queue.
- Every part of the crawler should eliminate duplicates early on, preferably before they are even crawled.
- Consider the following example: The spider is multi threaded and uses a number of goroutines to crawl pages. The goroutines identify the same link on two different pages, and insert them at more or less the same time into the queue. The queue itself should be able to recognize duplicates before inserting them.
- It should be possible to specify a white list which limits the scope of the crawler. No page should ever be crawled unless it is within the active scope.
- The scope could be controlled using the following:
- domain name
- relative path (for a certain domain)
- file extension
- It should be possible to specify a black list, which the crawler will consult before actually crawling a page.
- For instance "logout.php" could be resonable to have in the back list.
- It should be possible to interrupt the crawling of a page and continue on a later date.
- Results should be stored in a well specified manner, so they can be uesd by other applications such as a fuzzer or vulnerability scanner.