Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Lol4t0 · 2015-11-13T15:30:11Z

As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing.
Russian is an example. So this fixes Russian articles are not extracted #223
Python 3 support (Add python 3 support #220 merged)
Move to requests library for http backend. This makes Handle gzipped pages gracefully #244, add gzip deflation to HtmlFetcher #237, adding cookies support #64 obsolete and fixes some issues in the tracker
Analyze all possible text root nodes and select best one, do not stop on first text root node candidate
Improve text selection filters

…r this.

…d install bs3 under py3

* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223

See https://github.com/vetal4444/python-goose/tree/python_3 grangier#220 Conflicts: goose/text.py

Python 3 support

Html fetching is now done with requests Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)

1.0.28: * Move to requests as network library

Some special tags can be false positive, so we had to porcess them all to select best top node

Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding

Moving to requests as http library made test mocks, that used urllib mocking, incorrect This commit fixes tests by using mock_requests library for mocking, instead of urllib one.

It is not clear why it was there in the first place, as valid html does not contain such header. Again this is not connected to the test itself.

This benefits to automatic cookie handling, keep alive connection and may be some other features

After moving to requests http backend cookies are handled correctly. Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working

Python 3.4, Python 3.5 added

* Requests used for images. Same http session is used for all requests. * Analyze all possible text root nodes and select best one, do not stop on first text root node candidate * Improve text selection filters

Config parameter is `known_context_patterns' Default: { 'known_context_patterns': [ {'attr': 'class', 'value': 'short-story'}, {'attr': 'itemprop', 'value': 'articleBody'}, {'attr': 'class', 'value': 'post-content'}, {'attr': 'class', 'value': 'g-content'}, {'tag': 'article'}, ] }

When performing network requests, use request timeout, provided by goose configuration

Swallowing errors makes it difficult to understand whether something went wrong with network, goose, or target resource. So strict mode (now default) is introduced. With this mode goose will raise Exception instead of returning empty responses.

andreis · 2016-03-15T15:47:56Z

@grangier please merge this, Python 3 compatibility would be great to have

adityarustgi · 2016-05-02T19:47:28Z

@grangier +1 on merging this PR. Python3 support is really needed.

sandeepsayone · 2016-05-30T09:09:04Z

@grainger Pleas merge, we are no more using python2x

lababidi · 2017-03-28T20:23:19Z

FYI, I've produced a pypi package goose3 that can be found at https://github.com/goose3/goose3

I appreciate all the work that @grangier has done, but I really needed goose to work on python3. If you'd like to fix any bugs, tests, etc I'm more than happy to put in time to look at pull requests and merge them. Thank you.

Xavier Grangier and others added 30 commits June 29, 2014 11:33

Merge branch 'release/v1.0.19'

8e2f875

Merge branch 'release/1.0.20'

93e8239

Merge branch 'master' of github.com:grangier/python-goose

a275c45

Merge branch 'release/1.0.21'

fba20fd

Merge branch 'release/1.0.22'

f5dc260

Merge remote-tracking branch 'origin/master' into develop

3b058f9

Merge remote-tracking branch 'upstream/develop' into develop

4478e69

Add python3 support

94b1a20

Update requirements

6d91565

Add python3 to CLASSIFIERS

79a12dd

Optimize imports

76af358

Restore python 2.6 support

f44c2af

Try to fix tests in python 2.6

2e18083

Fix smart_str

b7884f1

Fix ValueError if we get document with set encoding. Add test case fo…

9028761

…r this.

Add py 3.4 to travis.yml

74743ab

Remove install from requirements file in travis.yml as no way to avoi…

5fbc788

…d install bs3 under py3

Close image file after use

1d02932

Fix tests

9091e38

Fix encoding detection

8fa55b4

Fix test runner under py3

1ef277b

Merge into python3 support from Vetal4444

dbbbb7f

See https://github.com/vetal4444/python-goose/tree/python_3 grangier#220 Conflicts: goose/text.py

Merge pull request #1 from Lol4t0/python_3

40cdd84

Python 3 support

Move to requests as network lib

7292933

Html fetching is now done with requests Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)

Draft new release

87808d2

1.0.28: * Move to requests as network library

Allow multiple 'special tags'

5b4ef12

Some special tags can be false positive, so we had to porcess them all to select best top node

Do not parse binary to text with requests

15aeb23

Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding

Fix `map only iterable once' in py3 issue

f7eee8b

Additional content extraction filter

21bc2b2

Lol4t0 and others added 21 commits January 14, 2016 16:05

Fix clean issue

6849ce6

Do not stop on first found text candidate

024760c

Add all sort of disclaimers to the trash

a5bd141

Fix tests crash with requests

da6cc1d

Moving to requests as http library made test mocks, that used urllib mocking, incorrect This commit fixes tests by using mock_requests library for mocking, instead of urllib one.

Fix deprecated class warning

c64b245

Use requests for image extraction

c7a207d

Fix image extraction mocker

320fa7d

Move test dependencies out of main deps

c0d6b69

Undo wrong edit

4724f82

Drop xml declaration in test data

0398cd1

It is not clear why it was there in the first place, as valid html does not contain such header. Again this is not connected to the test itself.

Move unittest2 to test dependencies

397465c

Try to build with python 3.5

2900678

Use same http session for text & images

fd0712a

This benefits to automatic cookie handling, keep alive connection and may be some other features

Fix function name

60ab80a

Fix API

b61bd4c

Remove obsolete known issues

f712600

After moving to requests http backend cookies are handled correctly. Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working

Adjust classifiers

abd427b

Python 3.4, Python 3.5 added

Draft new release 1.0.29

9632746

* Requests used for images. Same http session is used for all requests. * Analyze all possible text root nodes and select best one, do not stop on first text root node candidate * Improve text selection filters

Remove unused import

8644cfe

Use http timeout provided by configuration

28de450

When performing network requests, use request timeout, provided by goose configuration

Lol4t0 changed the title ~~Fix unicode processing +   support~~ Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features Jan 23, 2016

Lol4t0 added 2 commits January 26, 2016 17:17

Do not remove other headers when setting User-agent

69e5d80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Lol4t0 commented Nov 13, 2015

andreis commented Mar 15, 2016

adityarustgi commented May 2, 2016

sandeepsayone commented May 30, 2016

lababidi commented Mar 28, 2017

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Are you sure you want to change the base?

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Conversation

Lol4t0 commented Nov 13, 2015

andreis commented Mar 15, 2016

adityarustgi commented May 2, 2016

sandeepsayone commented May 30, 2016

lababidi commented Mar 28, 2017