Skip to content

bhanu13/CrawlnPeek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrawlnPeek | A Micro WebCrawler and Network Visualizer

The program crawls the given website URL by following anchor tags based on Breadth First Search and indexes the website. Then it saves the relevant crawled data in JSON and visualizes the domain's connectivity.

Usage:

python main.py http://www.example.com

Features:

- Creates a list of all pages indexed on a website - Creates a list of indexed pages with their relative depths and respective predecessors - Creates an image of the website network - Saves the indexed URLs to a JSON file - Added robustness to handle complex data parsing and broken hyperlinks - Added limits for maxdepth and maxpages indexed - Added support for relative links i.e. (href = "/source")

Requirements:

- Requests or Requests[Security] to allow true SSL connections - JSON library - Matplotlib to allow plotting the graph - Networkx to create the Graph using the list data

Examples:

An image of codeacademy.com Network:
An image of google.com Network @ 100 pages:

NOTE: Crawling can sometimes a really long time depending on the maxpages specified. It has a default value of 100 pages.

Future:

- Set up a web app to perform crawling on a given user query and then present an interactive visualization. - Use d3.js to visualize the website tree.

Author - bagarwa2

About

WebCrawler | Network Visualizer | Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages