Skip to content

Version 0.9.2 some major changes in document parsing

Compare
Choose a tag to compare
@AndyTheFactory AndyTheFactory released this 14 Jan 11:36
· 35 commits to master since this release
97fdcb0
  • You can now us the module as a command line interface (CLI). Usage: python -m newspaper --url https://www.test.com. More information in the documentation.
  • I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
  • Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously news_pool was replaced with a fetch_news() function.
  • Caching is now much more flexible. You can disable it completely or for one request.
  • You can now use newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.
  • protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.