A simple python crawler with BFS and Pagerank based priority queue. The project contains following files:
py_crawler.py: Python script for crawler which could be run as
>> python py_crawler.py
Beware that the program would require an API key from bing. You can find one at azure.com
It takes the following three inputs:
search_term: the query for the focused crawler
Method: bfs or pagerank
num_pages: Number of pages to be crawled
The python script was used to generate the following four logs:
- ebbets field_bfs.log
- knuckle sandwich_bfs.log
- ebbets field_pagerank.log
- knuckle sandwich_pagerank.log
Each of the log files contain 1000 crawled urls and other relevant information.
Description: The python crawler has two settings
1. BFS: Uses a simple queue and crawls pages according to the BFS algorithm
2. Page Rank: Maintains a priority queue running page rank on graph each time after crawling 30 urls
Major Functions:
Function Name | Description | Library Used |
---|---|---|
get_seed | Gets first 10 links from Bing | PyBing |
can_fetch_url | Checks robots.txt for access allowance | Python RobotExclusion |
save_file | Saves html contents of crawled urls | Python urllib |
save_file | Catches various HTTP Error Codes | Python urllib.HTTPError |
Normalize | Normalizes url and adds scheme (‘http’) | Python urlnorm |
get_links | Parsed the html file for links | BeautifulSoup |
validate_links | Makes sure only html files are crawled | None |
max_per_domain | Rate Control | tldextract |
Non Working Features:
Haven’t catered to cased where cis.poly.edu is same as csserv2.poly.edu