Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goose fails in extracting articles from The New York Times #252

Closed
manalsali opened this issue Dec 22, 2015 · 5 comments
Closed

Goose fails in extracting articles from The New York Times #252

manalsali opened this issue Dec 22, 2015 · 5 comments

Comments

@manalsali
Copy link

following code:

import urllib2
import goose
url = "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html?_r=0"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.cleaned_text
u''
Empty string is returned.

@robmcdan
Copy link

Your url string looks malformed, but anyway I fixed a similar issue in #225 and pull-requested but I don't think it was ever approved. My fork has this issue fixed and I've been parsing NYtimes happily ever since.

Good luck!

Rob

@grangier
Copy link
Owner

@robmcdan the PR has not been accepted because it breaks the test case suites

@manalsali
Copy link
Author

@robmcdan @grangier I just bypassed using goose and wrote a little snippet using BeautifulSoup and urllib2 that works.

@richardpetithory
Copy link

I'm curious; this issue was close because there is a work-around involving some external manipulation?

@manalsali
Copy link
Author

@richardpetithory No, I closed this issue because @robmcdan has an issue open with this exact same problem #225 so I don't think there is a need for two open requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants