Blank pages extracted in a crawl. #81

nehakansal · 2018-12-08T00:59:00Z

I ran a crawl on a site and a lot of pages were blank, in the sense that the screenshot was blank and the raw_content in the output didn't have the html body extracted, just had header and scripts stuff in it. On comparing that to the non-blank pages, I couldnt see a difference to conclude why it could have happened and no errors in the logs (both Undercrawler logs and splash logs). Any idea why that would happen? Any pointers to what can I look into? Thanks.

lopuhin · 2018-12-10T08:12:49Z

@nehakansal blank pages with only header with scripts is common when javascript execution is disabled, but in undercrawler it seems it shouldn't be... I would check a few things:

is this reproducible?
do pages render fine in the browser?
do pages render with a simple default splash script (you can use splash UI to test that)?

nehakansal · 2018-12-11T07:58:19Z

Yes, it's reproducible but its not the same pages that are blank every time from what I have seen, I will double check on that.
They render fine in a regular browser
I will check on this.

Thanks for the pointers, @lopuhin .

nehakansal · 2018-12-11T23:25:52Z

I double checked and its not the same pages every time I run a crawl.
I tried some of the urls on the Splash UI, and I see the same behavior, where some urls work and others dont as in the rendered .png image is blank for the ones that don't and the html string is incomplete.

Would you please try them on a Splash UI to see if you can spot a difference between a url that works and one that doesn't? Or can you please guide me further on what to look for?
Here is a list of some of those urls, hopefully at least one of these will work if/when you test them.

Thanks!

lopuhin · 2018-12-12T07:30:35Z

@nehakansal sure, to test the page in splash UI you have to do the following:

go to the splash URL with your browser (if you don't have a readily accessible splash, use docker run -p 8050:8050 scrapinghub/splash and go to http://localhost:8050)
you'll see something like this

- here you have a small splash script, you can change the URL from google.com to one of the above URLs, and click "Render me!" button - I did that for https://ada.com/conditions/hypertensive-retinopathy/ and with default 0.5 s wait first got a good page, and then a blank page, and with a 2.5 s wait got a normal page. So it means that the page renders fine in splash, and the the issue is that either some page elements might take longer to download, or there is something in the headless horsemen script used in undercrawler which is causing issues

nehakansal · 2018-12-13T05:41:13Z

Thanks, @lopuhin. You might have misunderstood my previous message. I was able to test few urls on Splash UI. I was requesting you to run it on your end to see if you get the same behavior and if the Splash UI stats give you any clue.

When I tested them on the Splash UI earlier, I hadn't noticed the wait time I could change, I ran all with the default 0.5 and with that some worked and others didnt. After reading your message, I changed the wait time to 5.0 and still some work and others don't.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blank pages extracted in a crawl. #81

Blank pages extracted in a crawl. #81

nehakansal commented Dec 8, 2018

lopuhin commented Dec 10, 2018

nehakansal commented Dec 11, 2018

nehakansal commented Dec 11, 2018 •

edited

Loading

lopuhin commented Dec 12, 2018

nehakansal commented Dec 13, 2018

Blank pages extracted in a crawl. #81

Blank pages extracted in a crawl. #81

Comments

nehakansal commented Dec 8, 2018

lopuhin commented Dec 10, 2018

nehakansal commented Dec 11, 2018

nehakansal commented Dec 11, 2018 • edited Loading

lopuhin commented Dec 12, 2018

nehakansal commented Dec 13, 2018

nehakansal commented Dec 11, 2018 •

edited

Loading