Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractors/title.py] None value for site_name in line 40 #266

Open
kbandla opened this issue Mar 11, 2017 · 0 comments
Open

[extractors/title.py] None value for site_name in line 40 #266

kbandla opened this issue Mar 11, 2017 · 0 comments

Comments

@kbandla
Copy link

kbandla commented Mar 11, 2017

Trigger

>>> from goose import Goose
>>> url = ' https://www.alienvault.com/blogs/security-essentials/11-simple-yet-important-tips-to-secure-aws'
>>> g = Goose()
>>> article = g.extract(url=url)

Traceback

Traceback (most recent call last):
    article = g.extract(url=url)
  File "/scripts/venv/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract
    return self.crawl(cc)
  File "/scripts/venv/lib/python2.7/site-packages/goose/__init__.py", line 66, in crawl
    article = crawler.crawl(crawl_candiate)
  File "/scripts/venv/lib/python2.7/site-packages/goose/crawler.py", line 154, in crawl
    self.article.title = self.title_extractor.extract()
  File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 99, in extract
    return self.get_title()
  File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 78, in get_title
    return self.clean_title(title)
  File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 42, in clean_title
    title = title.replace(site_name, '').strip()
TypeError: expected a string or other character buffer object

Fix

Make sure to check the value of site_name after this line. If it is None, dont fix the title.

        if "site_name" in self.article.opengraph.keys():
            site_name = self.article.opengraph['site_name']
            # remove the site name from title
            if site_name:
                title = title.replace(site_name, '').strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant