-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248
base: develop
Are you sure you want to change the base?
Conversation
…d install bs3 under py3
* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223
See https://github.com/vetal4444/python-goose/tree/python_3 grangier#220 Conflicts: goose/text.py
Python 3 support
Html fetching is now done with requests Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)
1.0.28: * Move to requests as network library
Some special tags can be false positive, so we had to porcess them all to select best top node
Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding
Moving to requests as http library made test mocks, that used urllib mocking, incorrect This commit fixes tests by using mock_requests library for mocking, instead of urllib one.
It is not clear why it was there in the first place, as valid html does not contain such header. Again this is not connected to the test itself.
This benefits to automatic cookie handling, keep alive connection and may be some other features
After moving to requests http backend cookies are handled correctly. Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working
Python 3.4, Python 3.5 added
* Requests used for images. Same http session is used for all requests. * Analyze all possible text root nodes and select best one, do not stop on first text root node candidate * Improve text selection filters
Config parameter is `known_context_patterns' Default: { 'known_context_patterns': [ {'attr': 'class', 'value': 'short-story'}, {'attr': 'itemprop', 'value': 'articleBody'}, {'attr': 'class', 'value': 'post-content'}, {'attr': 'class', 'value': 'g-content'}, {'tag': 'article'}, ] }
When performing network requests, use request timeout, provided by goose configuration
supportSwallowing errors makes it difficult to understand whether something went wrong with network, goose, or target resource. So strict mode (now default) is introduced. With this mode goose will raise Exception instead of returning empty responses.
@grangier please merge this, Python 3 compatibility would be great to have |
@grangier +1 on merging this PR. Python3 support is really needed. |
@grainger Pleas merge, we are no more using python2x |
FYI, I've produced a pypi package I appreciate all the work that @grangier has done, but I really needed goose to work on python3. If you'd like to fix any bugs, tests, etc I'm more than happy to put in time to look at pull requests and merge them. Thank you. |
Russian is an example. So this fixes Russian articles are not extracted #223