-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blank pages extracted in a crawl. #81
Comments
@nehakansal blank pages with only header with scripts is common when javascript execution is disabled, but in undercrawler it seems it shouldn't be... I would check a few things:
|
Thanks for the pointers, @lopuhin . |
Would you please try them on a Splash UI to see if you can spot a difference between a url that works and one that doesn't? Or can you please guide me further on what to look for?
Thanks! |
@nehakansal sure, to test the page in splash UI you have to do the following:
|
Thanks, @lopuhin. You might have misunderstood my previous message. I was able to test few urls on Splash UI. I was requesting you to run it on your end to see if you get the same behavior and if the Splash UI stats give you any clue. When I tested them on the Splash UI earlier, I hadn't noticed the wait time I could change, I ran all with the default 0.5 and with that some worked and others didnt. After reading your message, I changed the wait time to 5.0 and still some work and others don't. |
I ran a crawl on a site and a lot of pages were blank, in the sense that the screenshot was blank and the raw_content in the output didn't have the html body extracted, just had header and scripts stuff in it. On comparing that to the non-blank pages, I couldnt see a difference to conclude why it could have happened and no errors in the logs (both Undercrawler logs and splash logs). Any idea why that would happen? Any pointers to what can I look into? Thanks.
The text was updated successfully, but these errors were encountered: