How to debug spider? #40

thibault · 2018-11-21T14:20:49Z

I'm testing my own instance of OpenScraper.

So far, despite reading the documention, I've been unable to get any real data out of OpenScraper.

I've defined a simple data model (one field), added a simple contributor, but when I "Crawl" the spider, the dataset stays empty.

Now, I'm not too sure where to go from here. I've tested and re-tested my xpaths expressions, and although I might be wrong, it seems to me everything is ok here. How do I get feedback about the scraping results? How do I know what happened during the scrolling and what went wrong exactly?

JulienParis · 2018-11-21T17:31:56Z

For now the only way to get feedbacks while scrapping is to run it with the terminal open (for instance having your local instance run from the terminal and checking the outputs, or checking the log files)...

Could you share your scraper config (screenshot) to get an idea how you had your first try ?

DavidBruant · 2018-11-21T17:53:35Z

Hi @thibault, good to see you here :-)
(i don't have answers to your questions, just saying hi :-) )

JulienParis · 2018-11-21T17:59:20Z

@thibault
I'm also trying with my own instance but get no results from "http://www.ademe.fr/actualites/appels-a-projets "... same as you :( ... Trying to figure out what is the bug...

I tried with that :

start_urls : http://www.ademe.fr/actualites/appels-a-projets
item_xpath : //section/ul/li
name (or whatever custom field) : .//div[@class="content"]//h2/a/text()

I got nothing weird in my log, no error message, but the page is not loaded...

::: INFO log_pipeline 181121 18:58:15 ::: pipelines:80 -in- __init__() ::: 		>>> MongodbPipeline / __init__ ...
::: INFO log_pipeline 181121 18:58:15 ::: pipelines:87 -in- __init__() ::: 		--- MongodbPipeline / os.getcwd() : /Users/jpy/Dropbox/_FLASK/_CIS/_POC_EIG/CIS_scrapnado/openscraper

::: INFO scrapy.middleware 181121 18:58:15 ::: middleware:53 -in- from_settings() ::: 		Enabled item pipelines:
    ['scraper.pipelines.MongodbPipeline']
::: INFO scrapy.core.engine 181121 18:58:15 ::: engine:256 -in- open_spider() ::: 		Spider opened
::: DEBUG log_pipeline 181121 18:58:15 ::: pipelines:116 -in- open_spider() ::: 		>>> MongodbPipeline / open_spider ...

::: INFO scrapy.extensions.logstats 181121 18:58:15 ::: logstats:48 -in- log() ::: 		Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
::: INFO log_scraper 181121 18:58:15 ::: masterspider:354 -in- start_requests() ::: 		--- GenericSpider.start_requests ...
::: INFO log_scraper 181121 18:58:15 ::: masterspider:358 -in- start_requests() ::: 		--- GenericSpider.start_requests / url : http://www.ademe.fr/actualites/appels-a-projets
::: INFO log_scraper 181121 18:58:15 ::: masterspider:363 -in- start_requests() ::: 		--- GenericSpider.start_requests / starting first Scrapy request...
::: INFO scrapy.core.engine 181121 18:58:16 ::: engine:295 -in- close_spider() ::: 		Closing spider (finished)
::: DEBUG log_pipeline 181121 18:58:16 ::: pipelines:137 -in- close_spider() ::: 		>>> MongodbPipeline / close_spider ...

Very weird indeed

Meanwhile you can start to try with this website to check if it's the code or the website creating trouble :

JulienParis · 2018-11-21T18:08:33Z

... I added the quotestoscrap scraper and its working fine... It must be something related to the ademe website (or the scrapy settings because requests are doing fine)...
I tried that with a pure request from a python shell :

>>> import requests
>>> r = requests.get('http://www.ademe.fr/actualites/appels-a-projets')
>>> print r.content

and no problem... So it's Scrapy or the website

JulienParis · 2018-11-21T18:38:56Z

@thibault
I think I got it !! There is something going wrong with the scrapy settings...
I commented the line 139 in masterspider.py file :
this one --> settings.set( "RANDOMIZE_DOWNLOAD_DELAY" , RANDOMIZE_DOWNLOAD_DELAY )
And then I could scrap the ademe website.

So you could either comment this same line on your instance, or change the RANDOMIZE_DOWNLOAD_DELAY var to false( RANDOMIZE_DOWNLOAD_DELAY = False in you settings_scrapy.py file)... Or even better I could add this option in the "advanced settings" as a new feature ...

JulienParis · 2018-11-21T19:42:33Z

@thibault
so I added some new features in "advanced settings" with this commit : 92d9908

This allows to override the default Scrapy settings with your own advanced settings. For instance in your case with Ademe those settings seems to work :

thibault · 2018-11-22T08:29:54Z

@JulienParis Wow, it seems I gave you work for the entire afternoon :)

Thank you for taking the time to help. I will try your solution, and will get back to you with the results.

@DavidBruant Hi ! :)

JulienParis mentioned this issue Nov 21, 2018

add option "RANDOMIZE_DOWNLOAD_DELAY" true/false in the "advanced settings" #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug spider? #40

How to debug spider? #40

thibault commented Nov 21, 2018

JulienParis commented Nov 21, 2018

DavidBruant commented Nov 21, 2018

JulienParis commented Nov 21, 2018 •

edited

Loading

JulienParis commented Nov 21, 2018 •

edited

Loading

JulienParis commented Nov 21, 2018 •

edited

Loading

JulienParis commented Nov 21, 2018 •

edited

Loading

thibault commented Nov 22, 2018

How to debug spider? #40

How to debug spider? #40

Comments

thibault commented Nov 21, 2018

JulienParis commented Nov 21, 2018

DavidBruant commented Nov 21, 2018

JulienParis commented Nov 21, 2018 • edited Loading

JulienParis commented Nov 21, 2018 • edited Loading

JulienParis commented Nov 21, 2018 • edited Loading

JulienParis commented Nov 21, 2018 • edited Loading

thibault commented Nov 22, 2018

JulienParis commented Nov 21, 2018 •

edited

Loading

JulienParis commented Nov 21, 2018 •

edited

Loading

JulienParis commented Nov 21, 2018 •

edited

Loading

JulienParis commented Nov 21, 2018 •

edited

Loading