You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )
I am running the following command:
scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.
Also I cannot find the IMAGES_ENABLED in the settings file to stop downloading images
PS: I have not activated Splash as I do not have access to docker on my local laptop.
Could you please shed some light on?
The text was updated successfully, but these errors were encountered:
Hi,
I have to say amazing tool.
I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )
I am running the following command:
scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.
Also I cannot find the
IMAGES_ENABLED
in the settings file to stop downloading imagesPS: I have not activated Splash as I do not have access to docker on my local laptop.
Could you please shed some light on?
The text was updated successfully, but these errors were encountered: