Skip to content

Python 3 support and many more

Compare
Choose a tag to compare
@sibiryakov sibiryakov released this 18 Aug 09:41
· 404 commits to master since this release
  • Full Python 3 support πŸ‘ πŸ‘ 🍻 (#106), all the thanks goes to @Preetwinder.
  • canonicalize_url method removed in favor of w3lib implementation.
  • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes #131)
  • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
  • HBaseQueue supports delayed requests now. β€˜crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
  • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
  • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
  • Strategy worker refactoring to simplify it’s customization from subclasses.
  • Fixed a bug with extracted links distribution over spider log partitions (#129).