- Use redis queue to support distributed crawler.
- Print exception when error.
- Use proxy ip to avoid crawler against replies.
- Seperate html parser part to a single file.
- Implements bloom filter to judge wether an element exists in a collection.
- Design database and table scheme.
- Use mysql as persistent layer, store user data.
- Encapsulate python mysql operate library.
- Memory queue to support BFS crawler.
- Hash set to to story visited urls, avoid duplicate visits.
- Crawler can fetch username, personal brief info, current industry, education university, major subject and social activities such as answer count, article count, ask question count, collection count, follower count, follow live count, follow topic count, follow column count, follow question count, follow collection count.
- Print information in console with a format pattern.
- Support utf-8 encoding