Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Open
wants to merge 53 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
8e2f875
Merge branch 'release/v1.0.19'
Jun 29, 2014
93e8239
Merge branch 'release/1.0.20'
Jul 14, 2014
a275c45
Merge branch 'master' of github.com:grangier/python-goose
Jul 14, 2014
fba20fd
Merge branch 'release/1.0.21'
Sep 14, 2014
f5dc260
Merge branch 'release/1.0.22'
Sep 14, 2014
3b058f9
Merge remote-tracking branch 'origin/master' into develop
pistolero Jan 6, 2015
4478e69
Merge remote-tracking branch 'upstream/develop' into develop
vetal4444 Apr 8, 2015
94b1a20
Add python3 support
vetal4444 Apr 8, 2015
6d91565
Update requirements
vetal4444 Apr 8, 2015
79a12dd
Add python3 to CLASSIFIERS
vetal4444 Apr 9, 2015
76af358
Optimize imports
vetal4444 Apr 9, 2015
f44c2af
Restore python 2.6 support
vetal4444 Apr 9, 2015
2e18083
Try to fix tests in python 2.6
vetal4444 Apr 9, 2015
b7884f1
Fix smart_str
vetal4444 Apr 15, 2015
9028761
Fix ValueError if we get document with set encoding. Add test case fo…
vetal4444 Apr 15, 2015
74743ab
Add py 3.4 to travis.yml
vetal4444 Apr 15, 2015
5fbc788
Remove install from requirements file in travis.yml as no way to avoi…
vetal4444 Apr 15, 2015
1d02932
Close image file after use
vetal4444 Apr 15, 2015
9091e38
Fix tests
vetal4444 Apr 15, 2015
8fa55b4
Fix encoding detection
vetal4444 Apr 16, 2015
1ef277b
Fix test runner under py3
vetal4444 Apr 16, 2015
964eb48
Fix unicode processing + ` ` support
Lol4t0 Nov 13, 2015
dbbbb7f
Merge into python3 support from Vetal4444
Lol4t0 Jan 11, 2016
40cdd84
Merge pull request #1 from Lol4t0/python_3
Lol4t0 Jan 12, 2016
7292933
Move to requests as network lib
Lol4t0 Jan 13, 2016
87808d2
Draft new release
Lol4t0 Jan 13, 2016
5b4ef12
Allow multiple 'special tags'
Lol4t0 Jan 13, 2016
15aeb23
Do not parse binary to text with requests
Lol4t0 Jan 13, 2016
f7eee8b
Fix `map only iterable once' in py3 issue
Lol4t0 Jan 13, 2016
21bc2b2
Additional content extraction filter
Lol4t0 Jan 13, 2016
6849ce6
Fix clean issue
Lol4t0 Jan 14, 2016
024760c
Do not stop on first found text candidate
Lol4t0 Jan 14, 2016
a5bd141
Add all sort of disclaimers to the trash
Lol4t0 Jan 14, 2016
da6cc1d
Fix tests crash with requests
Lol4t0 Jan 20, 2016
c64b245
Fix deprecated class warning
Lol4t0 Jan 20, 2016
c7a207d
Use requests for image extraction
Lol4t0 Jan 20, 2016
320fa7d
Fix image extraction mocker
Lol4t0 Jan 20, 2016
c0d6b69
Move test dependencies out of main deps
Lol4t0 Jan 20, 2016
4724f82
Undo wrong edit
Lol4t0 Jan 20, 2016
0398cd1
Drop xml declaration in test data
Lol4t0 Jan 20, 2016
397465c
Move `unittest2` to test dependencies
Lol4t0 Jan 20, 2016
2900678
Try to build with python 3.5
Lol4t0 Jan 20, 2016
fd0712a
Use same http session for text & images
Lol4t0 Jan 20, 2016
60ab80a
Fix function name
Lol4t0 Jan 20, 2016
b61bd4c
Fix API
Lol4t0 Jan 20, 2016
f712600
Remove obsolete known issues
Lol4t0 Jan 21, 2016
abd427b
Adjust classifiers
Lol4t0 Jan 21, 2016
9632746
Draft new release 1.0.29
Lol4t0 Jan 21, 2016
8644cfe
Remove unused import
Lol4t0 Jan 21, 2016
118d220
Make list of known DOM nodes configurable
Lol4t0 Jan 21, 2016
28de450
Use http timeout provided by configuration
Lol4t0 Jan 21, 2016
67858ee
Raise exception on network error in strict mode
Lol4t0 Jan 26, 2016
69e5d80
Do not remove other headers when setting `User-agent`
Lol4t0 Jan 26, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ language: python
python:
- 2.6
- 2.7
- 3.4
- 3.5

install:
- pip install -r requirements.txt --use-mirrors
- pip install jieba
- python setup.py install

script: python setup.py test
20 changes: 1 addition & 19 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ class.


Goose in Korean
----------------
---------------

In order to use Goose in Korean you have to use the StopWordsKorean
class.
Expand All @@ -197,24 +197,6 @@ class.
14년째 세계 각국의 통신·안전·전파 규격 시험과 인증 한 우물만 파고 있는 이 회사 박채규 대표가 만나기로 한 주인공이다.
그는 전기전자·무선통신·자동차 전장품 분야에


Known issues
------------

- There are some issues with unicode URLs.
- Cookie handling : Some websites need cookie handling. At the moment the only work around is to use the raw_html extraction. For instance:

>>> import urllib2
>>> import goose
>>> url = "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp"
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
>>> response = opener.open(url)
>>> raw_html = response.read()
>>> g = goose.Goose()
>>> a = g.extract(raw_html=raw_html)
>>> a.cleaned_text
u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs t'

TODO
----

Expand Down
10 changes: 6 additions & 4 deletions goose/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
limitations under the License.
"""
import os
import platform
from tempfile import mkstemp

from goose.version import version_info, __version__
Expand Down Expand Up @@ -64,9 +63,12 @@ def crawl(self, crawl_candiate):
try:
crawler = Crawler(self.config)
article = crawler.crawl(crawl_candiate)
except (UnicodeDecodeError, ValueError):
self.config.parser_class = parsers[0]
return self.crawl(crawl_candiate)
except (UnicodeDecodeError, ValueError) as e:
if parsers:
self.config.parser_class = parsers[0]
return self.crawl(crawl_candiate)
else:
raise e
return article

def initialize(self):
Expand Down
7 changes: 4 additions & 3 deletions goose/cleaners.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
from __future__ import unicode_literals

from goose.utils import ReplaceSequence


Expand Down Expand Up @@ -48,7 +50,7 @@ def __init__(self, config, article):
"|communitypromo|runaroundLeft|subscribe|vcard|articleheadings"
"|date|^print$|popup|author-dropdown|tools|socialtools|byline"
"|konafilter|KonaFilter|breadcrumbs|^fn$|wp-caption-text"
"|legende|ajoutVideo|timestamp|js_replies"
"|legende|ajoutVideo|timestamp|js_replies|disclaim"
)
self.regexp_namespace = "http://exslt.org/regular-expressions"
self.nauthy_ids_re = "//*[re:test(@id, '%s', 'i')]" % self.remove_nodes_re
Expand All @@ -66,8 +68,7 @@ def __init__(self, config, article):
.append("\t")\
.append("^\\s+$")

def clean(self):
doc_to_clean = self.article.doc
def clean(self, doc_to_clean):
doc_to_clean = self.clean_body_classes(doc_to_clean)
doc_to_clean = self.clean_article_tags(doc_to_clean)
doc_to_clean = self.clean_em_tags(doc_to_clean)
Expand Down
23 changes: 21 additions & 2 deletions goose/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
"""
import os
import tempfile

import six

from goose.text import StopWords
from goose.parsers import Parser
from goose.parsers import ParserSoup
Expand All @@ -30,10 +33,20 @@
HTTP_DEFAULT_TIMEOUT = 30

AVAILABLE_PARSERS = {
'lxml': Parser,
'soup': ParserSoup,
'lxml': Parser
}

if six.PY2:
AVAILABLE_PARSERS['soup'] = ParserSoup

KNOWN_ARTICLE_CONTENT_PATTERNS = [
{'attr': 'class', 'value': 'short-story'},
{'attr': 'itemprop', 'value': 'articleBody'},
{'attr': 'class', 'value': 'post-content'},
{'attr': 'class', 'value': 'g-content'},
{'tag': 'article'},
]


class Configuration(object):

Expand Down Expand Up @@ -99,6 +112,12 @@ def __init__(self):
# http timeout
self.http_timeout = HTTP_DEFAULT_TIMEOUT

# known context patterns. Goose at first will search context at dom nodes, qualifying these patterns
self.known_context_patterns = KNOWN_ARTICLE_CONTENT_PATTERNS

# Strict mode. Generate exceptions on errors instead of swallowing them
self.strict = True

def get_parser(self):
return AVAILABLE_PARSERS[self.parser_class]

Expand Down
21 changes: 11 additions & 10 deletions goose/crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
from goose.cleaners import StandardDocumentCleaner
from goose.outputformatters import StandardOutputFormatter

from goose.network import HtmlFetcher
from goose.network import NetworkFetcher


class CrawlCandidate(object):
Expand Down Expand Up @@ -99,11 +99,13 @@ def __init__(self, config):
# title extractor
self.title_extractor = self.get_title_extractor()

# html fetcher
self.fetcher = NetworkFetcher(self.config)

# image extrator
self.image_extractor = self.get_image_extractor()

# html fetcher
self.htmlfetcher = HtmlFetcher(self.config)


# TODO : log prefix
self.logPrefix = "crawler:"
Expand Down Expand Up @@ -161,7 +163,10 @@ def crawl(self, crawl_candidate):
self.article.doc = article_body

# before we do any calcs on the body itself let's clean up the document
self.article.doc = self.cleaner.clean()
if not isinstance(self.article.doc, list):
self.article.doc = [self.cleaner.clean(self.article.doc)]
else:
self.article.doc = list(map(lambda doc1: self.cleaner.clean(deepcopy(doc1)), self.article.doc))

# big stuff
self.article.top_node = self.extractor.calculate_best_node()
Expand Down Expand Up @@ -212,11 +217,7 @@ def get_html(self, crawl_candidate, parsing_candidate):
return crawl_candidate.raw_html

# fetch HTML
html = self.htmlfetcher.get_html(parsing_candidate.url)
self.article.additional_data.update({
'request': self.htmlfetcher.request,
'result': self.htmlfetcher.result,
})
html = self.fetcher.fetch(parsing_candidate.url)
return html

def get_metas_extractor(self):
Expand Down Expand Up @@ -244,7 +245,7 @@ def get_title_extractor(self):
return TitleExtractor(self.config, self.article)

def get_image_extractor(self):
return ImageExtractor(self.config, self.article)
return ImageExtractor(self.fetcher, self.config, self.article)

def get_video_extractor(self):
return VideoExtractor(self.config, self.article)
Expand Down
3 changes: 3 additions & 0 deletions goose/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .network import NetworkError

__all__ = ['NetworkError']
33 changes: 14 additions & 19 deletions goose/extractors/content.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,6 @@
from goose.extractors import BaseExtractor


KNOWN_ARTICLE_CONTENT_TAGS = [
{'attr': 'itemprop', 'value': 'articleBody'},
{'attr': 'class', 'value': 'post-content'},
{'tag': 'article'},
]


class ContentExtractor(BaseExtractor):

def get_language(self):
Expand All @@ -47,16 +40,17 @@ def get_language(self):
return self.config.target_language

def get_known_article_tags(self):
for item in KNOWN_ARTICLE_CONTENT_TAGS:
nodes = self.parser.getElementsByTag(
self.article.doc,
**item)
if len(nodes):
return nodes[0]
nodes = []
for item in self.config.known_context_patterns:
nodes.extend(self.parser.getElementsByTag(
self.article.doc,
**item))
if len(nodes):
return nodes
return None

def is_articlebody(self, node):
for item in KNOWN_ARTICLE_CONTENT_TAGS:
for item in self.config.known_context_patterns:
# attribute
if "attr" in item and "value" in item:
if self.parser.getAttribute(node, item['attr']) == item['value']:
Expand Down Expand Up @@ -260,7 +254,7 @@ def update_score(self, node, addToScore):
if score_string:
current_score = int(score_string)

new_score = current_score + addToScore
new_score = current_score + int(addToScore)
self.parser.setAttribute(node, "gravityScore", str(new_score))

def update_node_count(self, node, add_to_count):
Expand Down Expand Up @@ -315,16 +309,17 @@ def get_node_gravity_score(self, node):
return None
return int(grvScoreString)

def nodes_to_check(self, doc):
def nodes_to_check(self, docs):
"""\
returns a list of nodes we want to search
on like paragraphs and tables
"""
nodes_to_check = []

for tag in ['p', 'pre', 'td']:
items = self.parser.getElementsByTag(doc, tag=tag)
nodes_to_check += items
for doc in docs:
for tag in ['p', 'pre', 'td']:
items = self.parser.getElementsByTag(doc, tag=tag)
nodes_to_check += items
return nodes_to_check

def is_table_and_no_para_exist(self, e):
Expand Down
9 changes: 4 additions & 5 deletions goose/extractors/images.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import re
import os

from urlparse import urlparse, urljoin
from six.moves.urllib.parse import urlparse, urljoin

from goose.extractors import BaseExtractor
from goose.image import Image
Expand All @@ -48,9 +48,10 @@ def __init__(self, node, parent_depth, sibling_depth):

class ImageExtractor(BaseExtractor):

def __init__(self, config, article):
def __init__(self, fetcher, config, article):
super(ImageExtractor, self).__init__(config, article)

self.fetcher = fetcher
self.custom_site_mapping = {}

self.load_customesite_mapping()
Expand Down Expand Up @@ -333,9 +334,7 @@ def get_local_image(self, src):
"""\
returns the bytes of the image file on disk
"""
local_image = ImageUtils.store_image(None,
self.link_hash, src, self.config)
return local_image
return ImageUtils.store_image(self.fetcher, self.link_hash, src, self.config)

def get_clean_domain(self):
if self.article.domain:
Expand Down
4 changes: 2 additions & 2 deletions goose/extractors/metas.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@
"""

import re
from urlparse import urljoin
from urlparse import urlparse

from six.moves.urllib.parse import urlparse, urljoin

from goose.extractors import BaseExtractor

Expand Down
4 changes: 2 additions & 2 deletions goose/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def __init__(self):
self.extraction_type = "NA"

# stores how many bytes this image is.
self.bytes = long(0)
self.bytes = 0

def get_src(self):
return self.src
Expand Down Expand Up @@ -87,7 +87,7 @@ def set_mime_type(self, mime_type):
class LocallyStoredImage(object):

def __init__(self, src='', local_filename='',
link_hash='', bytes=long(0), file_extension='', height=0, width=0):
link_hash='', bytes=0, file_extension='', height=0, width=0):
self.src = src
self.local_filename = local_filename
self.link_hash = link_hash
Expand Down
Loading