Goose fails in extracting articles from The New York Times #252

manalsali · 2015-12-22T13:46:31Z

following code:

import urllib2
import goose
url = "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html?_r=0"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.cleaned_text
u''
Empty string is returned.

robmcdan · 2015-12-22T17:03:37Z

Your url string looks malformed, but anyway I fixed a similar issue in #225 and pull-requested but I don't think it was ever approved. My fork has this issue fixed and I've been parsing NYtimes happily ever since.

Good luck!

Rob

grangier · 2015-12-22T17:05:00Z

@robmcdan the PR has not been accepted because it breaks the test case suites

manalsali · 2015-12-22T17:56:40Z

@robmcdan @grangier I just bypassed using goose and wrote a little snippet using BeautifulSoup and urllib2 that works.

richardpetithory · 2015-12-23T01:08:46Z

I'm curious; this issue was close because there is a work-around involving some external manipulation?

manalsali · 2015-12-23T10:40:38Z

@richardpetithory No, I closed this issue because @robmcdan has an issue open with this exact same problem #225 so I don't think there is a need for two open requests.

manalsali closed this as completed Dec 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goose fails in extracting articles from The New York Times #252

Goose fails in extracting articles from The New York Times #252

manalsali commented Dec 22, 2015

robmcdan commented Dec 22, 2015

grangier commented Dec 22, 2015

manalsali commented Dec 22, 2015

richardpetithory commented Dec 23, 2015

manalsali commented Dec 23, 2015

Goose fails in extracting articles from The New York Times #252

Goose fails in extracting articles from The New York Times #252

Comments

manalsali commented Dec 22, 2015

robmcdan commented Dec 22, 2015

grangier commented Dec 22, 2015

manalsali commented Dec 22, 2015

richardpetithory commented Dec 23, 2015

manalsali commented Dec 23, 2015