Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank pages extracted in a crawl. #81

Open
nehakansal opened this issue Dec 8, 2018 · 5 comments
Open

Blank pages extracted in a crawl. #81

nehakansal opened this issue Dec 8, 2018 · 5 comments

Comments

@nehakansal
Copy link

I ran a crawl on a site and a lot of pages were blank, in the sense that the screenshot was blank and the raw_content in the output didn't have the html body extracted, just had header and scripts stuff in it. On comparing that to the non-blank pages, I couldnt see a difference to conclude why it could have happened and no errors in the logs (both Undercrawler logs and splash logs). Any idea why that would happen? Any pointers to what can I look into? Thanks.

@lopuhin
Copy link
Contributor

lopuhin commented Dec 10, 2018

@nehakansal blank pages with only header with scripts is common when javascript execution is disabled, but in undercrawler it seems it shouldn't be... I would check a few things:

  • is this reproducible?
  • do pages render fine in the browser?
  • do pages render with a simple default splash script (you can use splash UI to test that)?

@nehakansal
Copy link
Author

  • Yes, it's reproducible but its not the same pages that are blank every time from what I have seen, I will double check on that.
  • They render fine in a regular browser
  • I will check on this.

Thanks for the pointers, @lopuhin .

@nehakansal
Copy link
Author

nehakansal commented Dec 11, 2018

  • I double checked and its not the same pages every time I run a crawl.
  • I tried some of the urls on the Splash UI, and I see the same behavior, where some urls work and others dont as in the rendered .png image is blank for the ones that don't and the html string is incomplete.

Would you please try them on a Splash UI to see if you can spot a difference between a url that works and one that doesn't? Or can you please guide me further on what to look for?
Here is a list of some of those urls, hopefully at least one of these will work if/when you test them.

Thanks!

@lopuhin
Copy link
Contributor

lopuhin commented Dec 12, 2018

@nehakansal sure, to test the page in splash UI you have to do the following:

  • go to the splash URL with your browser (if you don't have a readily accessible splash, use docker run -p 8050:8050 scrapinghub/splash and go to http://localhost:8050)
  • you'll see something like this

screenshot 2018-12-12 at 10 10 27

- here you have a small splash script, you can change the URL from google.com to one of the above URLs, and click "Render me!" button - I did that for https://ada.com/conditions/hypertensive-retinopathy/ and with default 0.5 s wait first got a good page, and then a blank page, and with a 2.5 s wait got a normal page. So it means that the page renders fine in splash, and the the issue is that either some page elements might take longer to download, or there is something in the headless horsemen script used in undercrawler which is causing issues

screenshot 2018-12-12 at 10 30 04

@nehakansal
Copy link
Author

Thanks, @lopuhin. You might have misunderstood my previous message. I was able to test few urls on Splash UI. I was requesting you to run it on your end to see if you get the same behavior and if the Splash UI stats give you any clue.

When I tested them on the Splash UI earlier, I hadn't noticed the wait time I could change, I ran all with the default 0.5 and with that some worked and others didnt. After reading your message, I changed the wait time to 5.0 and still some work and others don't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants