-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248
base: develop
Are you sure you want to change the base?
Commits on Jun 29, 2014
-
Merge branch 'release/v1.0.19'
Xavier Grangier committedJun 29, 2014 Configuration menu - View commit details
-
Copy full SHA for 8e2f875 - Browse repository at this point
Copy the full SHA 8e2f875View commit details
Commits on Jul 14, 2014
-
Xavier Grangier committed
Jul 14, 2014 Configuration menu - View commit details
-
Copy full SHA for 93e8239 - Browse repository at this point
Copy the full SHA 93e8239View commit details -
Merge branch 'master' of github.com:grangier/python-goose
Xavier Grangier committedJul 14, 2014 Configuration menu - View commit details
-
Copy full SHA for a275c45 - Browse repository at this point
Copy the full SHA a275c45View commit details
Commits on Sep 14, 2014
-
Xavier Grangier committed
Sep 14, 2014 Configuration menu - View commit details
-
Copy full SHA for fba20fd - Browse repository at this point
Copy the full SHA fba20fdView commit details -
Xavier Grangier committed
Sep 14, 2014 Configuration menu - View commit details
-
Copy full SHA for f5dc260 - Browse repository at this point
Copy the full SHA f5dc260View commit details
Commits on Jan 6, 2015
-
Configuration menu - View commit details
-
Copy full SHA for 3b058f9 - Browse repository at this point
Copy the full SHA 3b058f9View commit details
Commits on Apr 8, 2015
-
Configuration menu - View commit details
-
Copy full SHA for 4478e69 - Browse repository at this point
Copy the full SHA 4478e69View commit details -
Configuration menu - View commit details
-
Copy full SHA for 94b1a20 - Browse repository at this point
Copy the full SHA 94b1a20View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6d91565 - Browse repository at this point
Copy the full SHA 6d91565View commit details
Commits on Apr 9, 2015
-
Configuration menu - View commit details
-
Copy full SHA for 79a12dd - Browse repository at this point
Copy the full SHA 79a12ddView commit details -
Configuration menu - View commit details
-
Copy full SHA for 76af358 - Browse repository at this point
Copy the full SHA 76af358View commit details -
Configuration menu - View commit details
-
Copy full SHA for f44c2af - Browse repository at this point
Copy the full SHA f44c2afView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2e18083 - Browse repository at this point
Copy the full SHA 2e18083View commit details
Commits on Apr 15, 2015
-
Configuration menu - View commit details
-
Copy full SHA for b7884f1 - Browse repository at this point
Copy the full SHA b7884f1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9028761 - Browse repository at this point
Copy the full SHA 9028761View commit details -
Configuration menu - View commit details
-
Copy full SHA for 74743ab - Browse repository at this point
Copy the full SHA 74743abView commit details -
Remove install from requirements file in travis.yml as no way to avoi…
…d install bs3 under py3
Configuration menu - View commit details
-
Copy full SHA for 5fbc788 - Browse repository at this point
Copy the full SHA 5fbc788View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1d02932 - Browse repository at this point
Copy the full SHA 1d02932View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9091e38 - Browse repository at this point
Copy the full SHA 9091e38View commit details
Commits on Apr 16, 2015
-
Configuration menu - View commit details
-
Copy full SHA for 8fa55b4 - Browse repository at this point
Copy the full SHA 8fa55b4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1ef277b - Browse repository at this point
Copy the full SHA 1ef277bView commit details
Commits on Nov 13, 2015
-
Fix unicode processing +
support* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223
Configuration menu - View commit details
-
Copy full SHA for 964eb48 - Browse repository at this point
Copy the full SHA 964eb48View commit details
Commits on Jan 11, 2016
-
Merge into python3 support from Vetal4444
See https://github.com/vetal4444/python-goose/tree/python_3 grangier#220 Conflicts: goose/text.py
Configuration menu - View commit details
-
Copy full SHA for dbbbb7f - Browse repository at this point
Copy the full SHA dbbbb7fView commit details
Commits on Jan 12, 2016
-
Configuration menu - View commit details
-
Copy full SHA for 40cdd84 - Browse repository at this point
Copy the full SHA 40cdd84View commit details
Commits on Jan 13, 2016
-
Move to requests as network lib
Html fetching is now done with requests Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)
Configuration menu - View commit details
-
Copy full SHA for 7292933 - Browse repository at this point
Copy the full SHA 7292933View commit details -
Configuration menu - View commit details
-
Copy full SHA for 87808d2 - Browse repository at this point
Copy the full SHA 87808d2View commit details -
Some special tags can be false positive, so we had to porcess them all to select best top node
Configuration menu - View commit details
-
Copy full SHA for 5b4ef12 - Browse repository at this point
Copy the full SHA 5b4ef12View commit details -
Do not parse binary to text with requests
Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding
Configuration menu - View commit details
-
Copy full SHA for 15aeb23 - Browse repository at this point
Copy the full SHA 15aeb23View commit details -
Configuration menu - View commit details
-
Copy full SHA for f7eee8b - Browse repository at this point
Copy the full SHA f7eee8bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 21bc2b2 - Browse repository at this point
Copy the full SHA 21bc2b2View commit details
Commits on Jan 14, 2016
-
Configuration menu - View commit details
-
Copy full SHA for 6849ce6 - Browse repository at this point
Copy the full SHA 6849ce6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 024760c - Browse repository at this point
Copy the full SHA 024760cView commit details -
Configuration menu - View commit details
-
Copy full SHA for a5bd141 - Browse repository at this point
Copy the full SHA a5bd141View commit details
Commits on Jan 20, 2016
-
Moving to requests as http library made test mocks, that used urllib mocking, incorrect This commit fixes tests by using mock_requests library for mocking, instead of urllib one.
Configuration menu - View commit details
-
Copy full SHA for da6cc1d - Browse repository at this point
Copy the full SHA da6cc1dView commit details -
Configuration menu - View commit details
-
Copy full SHA for c64b245 - Browse repository at this point
Copy the full SHA c64b245View commit details -
Configuration menu - View commit details
-
Copy full SHA for c7a207d - Browse repository at this point
Copy the full SHA c7a207dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 320fa7d - Browse repository at this point
Copy the full SHA 320fa7dView commit details -
Configuration menu - View commit details
-
Copy full SHA for c0d6b69 - Browse repository at this point
Copy the full SHA c0d6b69View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4724f82 - Browse repository at this point
Copy the full SHA 4724f82View commit details -
Drop xml declaration in test data
It is not clear why it was there in the first place, as valid html does not contain such header. Again this is not connected to the test itself.
Configuration menu - View commit details
-
Copy full SHA for 0398cd1 - Browse repository at this point
Copy the full SHA 0398cd1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 397465c - Browse repository at this point
Copy the full SHA 397465cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2900678 - Browse repository at this point
Copy the full SHA 2900678View commit details -
Use same http session for text & images
This benefits to automatic cookie handling, keep alive connection and may be some other features
Configuration menu - View commit details
-
Copy full SHA for fd0712a - Browse repository at this point
Copy the full SHA fd0712aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 60ab80a - Browse repository at this point
Copy the full SHA 60ab80aView commit details -
Configuration menu - View commit details
-
Copy full SHA for b61bd4c - Browse repository at this point
Copy the full SHA b61bd4cView commit details
Commits on Jan 21, 2016
-
After moving to requests http backend cookies are handled correctly. Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working
Configuration menu - View commit details
-
Copy full SHA for f712600 - Browse repository at this point
Copy the full SHA f712600View commit details -
Configuration menu - View commit details
-
Copy full SHA for abd427b - Browse repository at this point
Copy the full SHA abd427bView commit details -
* Requests used for images. Same http session is used for all requests. * Analyze all possible text root nodes and select best one, do not stop on first text root node candidate * Improve text selection filters
Configuration menu - View commit details
-
Copy full SHA for 9632746 - Browse repository at this point
Copy the full SHA 9632746View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8644cfe - Browse repository at this point
Copy the full SHA 8644cfeView commit details -
Make list of known DOM nodes configurable
Config parameter is `known_context_patterns' Default: { 'known_context_patterns': [ {'attr': 'class', 'value': 'short-story'}, {'attr': 'itemprop', 'value': 'articleBody'}, {'attr': 'class', 'value': 'post-content'}, {'attr': 'class', 'value': 'g-content'}, {'tag': 'article'}, ] }
Configuration menu - View commit details
-
Copy full SHA for 118d220 - Browse repository at this point
Copy the full SHA 118d220View commit details -
Use http timeout provided by configuration
When performing network requests, use request timeout, provided by goose configuration
Configuration menu - View commit details
-
Copy full SHA for 28de450 - Browse repository at this point
Copy the full SHA 28de450View commit details
Commits on Jan 26, 2016
-
Raise exception on network error in strict mode
Swallowing errors makes it difficult to understand whether something went wrong with network, goose, or target resource. So strict mode (now default) is introduced. With this mode goose will raise Exception instead of returning empty responses.
Configuration menu - View commit details
-
Copy full SHA for 67858ee - Browse repository at this point
Copy the full SHA 67858eeView commit details -
Configuration menu - View commit details
-
Copy full SHA for 69e5d80 - Browse repository at this point
Copy the full SHA 69e5d80View commit details