Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Open
wants to merge 53 commits into
base: develop
Choose a base branch
from

Commits on Jun 29, 2014

  1. Merge branch 'release/v1.0.19'

    Xavier Grangier committed Jun 29, 2014
    Configuration menu
    Copy the full SHA
    8e2f875 View commit details
    Browse the repository at this point in the history

Commits on Jul 14, 2014

  1. Merge branch 'release/1.0.20'

    Xavier Grangier committed Jul 14, 2014
    Configuration menu
    Copy the full SHA
    93e8239 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'master' of github.com:grangier/python-goose

    Xavier Grangier committed Jul 14, 2014
    Configuration menu
    Copy the full SHA
    a275c45 View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2014

  1. Merge branch 'release/1.0.21'

    Xavier Grangier committed Sep 14, 2014
    Configuration menu
    Copy the full SHA
    fba20fd View commit details
    Browse the repository at this point in the history
  2. Merge branch 'release/1.0.22'

    Xavier Grangier committed Sep 14, 2014
    Configuration menu
    Copy the full SHA
    f5dc260 View commit details
    Browse the repository at this point in the history

Commits on Jan 6, 2015

  1. Configuration menu
    Copy the full SHA
    3b058f9 View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2015

  1. Configuration menu
    Copy the full SHA
    4478e69 View commit details
    Browse the repository at this point in the history
  2. Add python3 support

    vetal4444 committed Apr 8, 2015
    Configuration menu
    Copy the full SHA
    94b1a20 View commit details
    Browse the repository at this point in the history
  3. Update requirements

    vetal4444 committed Apr 8, 2015
    Configuration menu
    Copy the full SHA
    6d91565 View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2015

  1. Add python3 to CLASSIFIERS

    vetal4444 committed Apr 9, 2015
    Configuration menu
    Copy the full SHA
    79a12dd View commit details
    Browse the repository at this point in the history
  2. Optimize imports

    vetal4444 committed Apr 9, 2015
    Configuration menu
    Copy the full SHA
    76af358 View commit details
    Browse the repository at this point in the history
  3. Restore python 2.6 support

    vetal4444 committed Apr 9, 2015
    Configuration menu
    Copy the full SHA
    f44c2af View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    2e18083 View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2015

  1. Fix smart_str

    vetal4444 committed Apr 15, 2015
    Configuration menu
    Copy the full SHA
    b7884f1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9028761 View commit details
    Browse the repository at this point in the history
  3. Add py 3.4 to travis.yml

    vetal4444 committed Apr 15, 2015
    Configuration menu
    Copy the full SHA
    74743ab View commit details
    Browse the repository at this point in the history
  4. Remove install from requirements file in travis.yml as no way to avoi…

    …d install bs3 under py3
    vetal4444 committed Apr 15, 2015
    Configuration menu
    Copy the full SHA
    5fbc788 View commit details
    Browse the repository at this point in the history
  5. Close image file after use

    vetal4444 committed Apr 15, 2015
    Configuration menu
    Copy the full SHA
    1d02932 View commit details
    Browse the repository at this point in the history
  6. Fix tests

    vetal4444 committed Apr 15, 2015
    Configuration menu
    Copy the full SHA
    9091e38 View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2015

  1. Fix encoding detection

    vetal4444 committed Apr 16, 2015
    Configuration menu
    Copy the full SHA
    8fa55b4 View commit details
    Browse the repository at this point in the history
  2. Fix test runner under py3

    vetal4444 committed Apr 16, 2015
    Configuration menu
    Copy the full SHA
    1ef277b View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2015

  1. Fix unicode processing +   support

    * As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
    * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing.
     Russian is an example. So this fixes grangier#223
    Lol4t0 committed Nov 13, 2015
    Configuration menu
    Copy the full SHA
    964eb48 View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2016

  1. Configuration menu
    Copy the full SHA
    dbbbb7f View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2016

  1. Merge pull request #1 from Lol4t0/python_3

    Python 3 support
    Lol4t0 committed Jan 12, 2016
    Configuration menu
    Copy the full SHA
    40cdd84 View commit details
    Browse the repository at this point in the history

Commits on Jan 13, 2016

  1. Move to requests as network lib

    Html fetching is now done with requests
    
    Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)
    Lol4t0 committed Jan 13, 2016
    Configuration menu
    Copy the full SHA
    7292933 View commit details
    Browse the repository at this point in the history
  2. Draft new release

    1.0.28:
    
      * Move to requests as network library
    Lol4t0 committed Jan 13, 2016
    Configuration menu
    Copy the full SHA
    87808d2 View commit details
    Browse the repository at this point in the history
  3. Allow multiple 'special tags'

    Some special tags can be false positive, so we had to porcess them all to select best top node
    Lol4t0 committed Jan 13, 2016
    Configuration menu
    Copy the full SHA
    5b4ef12 View commit details
    Browse the repository at this point in the history
  4. Do not parse binary to text with requests

    Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding
    Lol4t0 committed Jan 13, 2016
    Configuration menu
    Copy the full SHA
    15aeb23 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f7eee8b View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    21bc2b2 View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2016

  1. Fix clean issue

    Lol4t0 committed Jan 14, 2016
    Configuration menu
    Copy the full SHA
    6849ce6 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    024760c View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a5bd141 View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2016

  1. Fix tests crash with requests

    Moving to requests as http library made test mocks, that used urllib mocking, incorrect
    This commit fixes tests by using mock_requests library for mocking, instead of urllib one.
    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    da6cc1d View commit details
    Browse the repository at this point in the history
  2. Fix deprecated class warning

    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    c64b245 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c7a207d View commit details
    Browse the repository at this point in the history
  4. Fix image extraction mocker

    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    320fa7d View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c0d6b69 View commit details
    Browse the repository at this point in the history
  6. Undo wrong edit

    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    4724f82 View commit details
    Browse the repository at this point in the history
  7. Drop xml declaration in test data

    It is not clear why it was there in the first place, as valid html does not contain such header.
    
    Again this is not connected to the test itself.
    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    0398cd1 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    397465c View commit details
    Browse the repository at this point in the history
  9. Try to build with python 3.5

    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    2900678 View commit details
    Browse the repository at this point in the history
  10. Use same http session for text & images

    This benefits to automatic cookie handling, keep alive connection and may be some other features
    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    fd0712a View commit details
    Browse the repository at this point in the history
  11. Fix function name

    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    60ab80a View commit details
    Browse the repository at this point in the history
  12. Fix API

    Lol4t0 committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    b61bd4c View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2016

  1. Remove obsolete known issues

    After moving to requests http backend cookies are handled correctly.
    
    Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working
    Lol4t0 committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    f712600 View commit details
    Browse the repository at this point in the history
  2. Adjust classifiers

    Python 3.4, Python 3.5 added
    Lol4t0 committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    abd427b View commit details
    Browse the repository at this point in the history
  3. Draft new release 1.0.29

    * Requests used for images. Same http session is used for all requests.
    * Analyze all possible text root nodes and select best one, do not stop on first text root node candidate
    * Improve text selection filters
    Lol4t0 committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    9632746 View commit details
    Browse the repository at this point in the history
  4. Remove unused import

    Lol4t0 committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    8644cfe View commit details
    Browse the repository at this point in the history
  5. Make list of known DOM nodes configurable

    Config parameter is `known_context_patterns'
    Default:
    
    	{
    		'known_context_patterns': [
    		    {'attr': 'class', 'value': 'short-story'},
    		    {'attr': 'itemprop', 'value': 'articleBody'},
    		    {'attr': 'class', 'value': 'post-content'},
    		    {'attr': 'class', 'value': 'g-content'},
    		    {'tag': 'article'},
    		]
    	}
    Lol4t0 committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    118d220 View commit details
    Browse the repository at this point in the history
  6. Use http timeout provided by configuration

    When performing network requests, use request timeout, provided by goose configuration
    Lol4t0 committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    28de450 View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2016

  1. Raise exception on network error in strict mode

    Swallowing errors makes it difficult to understand whether something went wrong with network, goose, or target resource.
    
    So strict mode (now default) is introduced. With this mode goose will raise Exception instead of returning empty responses.
    Lol4t0 committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    67858ee View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    69e5d80 View commit details
    Browse the repository at this point in the history