Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug spider? #40

Open
thibault opened this issue Nov 21, 2018 · 7 comments
Open

How to debug spider? #40

thibault opened this issue Nov 21, 2018 · 7 comments

Comments

@thibault
Copy link
Contributor

Hi @JulienParis,

I'm testing my own instance of OpenScraper.

So far, despite reading the documention, I've been unable to get any real data out of OpenScraper.

I've defined a simple data model (one field), added a simple contributor, but when I "Crawl" the spider, the dataset stays empty.

Now, I'm not too sure where to go from here. I've tested and re-tested my xpaths expressions, and although I might be wrong, it seems to me everything is ok here. How do I get feedback about the scraping results? How do I know what happened during the scrolling and what went wrong exactly?

@JulienParis
Copy link
Collaborator

For now the only way to get feedbacks while scrapping is to run it with the terminal open (for instance having your local instance run from the terminal and checking the outputs, or checking the log files)...

Could you share your scraper config (screenshot) to get an idea how you had your first try ?

@DavidBruant
Copy link

Hi @thibault, good to see you here :-)
(i don't have answers to your questions, just saying hi :-) )

@JulienParis
Copy link
Collaborator

JulienParis commented Nov 21, 2018

@thibault
I'm also trying with my own instance but get no results from "http://www.ademe.fr/actualites/appels-a-projets "... same as you :( ... Trying to figure out what is the bug...

I tried with that :

  • start_urls : http://www.ademe.fr/actualites/appels-a-projets
  • item_xpath : //section/ul/li
  • name (or whatever custom field) : .//div[@class="content"]//h2/a/text()

I got nothing weird in my log, no error message, but the page is not loaded...

::: INFO log_pipeline 181121 18:58:15 ::: pipelines:80 -in- __init__() ::: 		>>> MongodbPipeline / __init__ ...
::: INFO log_pipeline 181121 18:58:15 ::: pipelines:87 -in- __init__() ::: 		--- MongodbPipeline / os.getcwd() : /Users/jpy/Dropbox/_FLASK/_CIS/_POC_EIG/CIS_scrapnado/openscraper

::: INFO scrapy.middleware 181121 18:58:15 ::: middleware:53 -in- from_settings() ::: 		Enabled item pipelines:
    ['scraper.pipelines.MongodbPipeline']
::: INFO scrapy.core.engine 181121 18:58:15 ::: engine:256 -in- open_spider() ::: 		Spider opened
::: DEBUG log_pipeline 181121 18:58:15 ::: pipelines:116 -in- open_spider() ::: 		>>> MongodbPipeline / open_spider ...

::: INFO scrapy.extensions.logstats 181121 18:58:15 ::: logstats:48 -in- log() ::: 		Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
::: INFO log_scraper 181121 18:58:15 ::: masterspider:354 -in- start_requests() ::: 		--- GenericSpider.start_requests ...
::: INFO log_scraper 181121 18:58:15 ::: masterspider:358 -in- start_requests() ::: 		--- GenericSpider.start_requests / url : http://www.ademe.fr/actualites/appels-a-projets
::: INFO log_scraper 181121 18:58:15 ::: masterspider:363 -in- start_requests() ::: 		--- GenericSpider.start_requests / starting first Scrapy request...
::: INFO scrapy.core.engine 181121 18:58:16 ::: engine:295 -in- close_spider() ::: 		Closing spider (finished)
::: DEBUG log_pipeline 181121 18:58:16 ::: pipelines:137 -in- close_spider() ::: 		>>> MongodbPipeline / close_spider ...

Very weird indeed

Meanwhile you can start to try with this website to check if it's the code or the website creating trouble :
capture d ecran 2018-11-21 a 18 56 20
capture d ecran 2018-11-21 a 18 56 32
capture d ecran 2018-11-21 a 18 56 40

@JulienParis
Copy link
Collaborator

JulienParis commented Nov 21, 2018

... I added the quotestoscrap scraper and its working fine... It must be something related to the ademe website (or the scrapy settings because requests are doing fine)...
I tried that with a pure request from a python shell :

>>> import requests
>>> r = requests.get('http://www.ademe.fr/actualites/appels-a-projets')
>>> print r.content

and no problem... So it's Scrapy or the website

@JulienParis
Copy link
Collaborator

JulienParis commented Nov 21, 2018

@thibault
I think I got it !! There is something going wrong with the scrapy settings...
I commented the line 139 in masterspider.py file :
this one --> settings.set( "RANDOMIZE_DOWNLOAD_DELAY" , RANDOMIZE_DOWNLOAD_DELAY )
And then I could scrap the ademe website.

So you could either comment this same line on your instance, or change the RANDOMIZE_DOWNLOAD_DELAY var to false( RANDOMIZE_DOWNLOAD_DELAY = False in you settings_scrapy.py file)... Or even better I could add this option in the "advanced settings" as a new feature ...

@JulienParis
Copy link
Collaborator

JulienParis commented Nov 21, 2018

@thibault
so I added some new features in "advanced settings" with this commit : 92d9908

This allows to override the default Scrapy settings with your own advanced settings. For instance in your case with Ademe those settings seems to work :

capture d ecran 2018-11-21 a 20 41 55

@thibault
Copy link
Contributor Author

@JulienParis Wow, it seems I gave you work for the entire afternoon :)

Thank you for taking the time to help. I will try your solution, and will get back to you with the results.

@DavidBruant Hi ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants