-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native support for X-Crawlera-Session #27
Comments
you mean so you don't have to include On the other hand, I don't think Crawlera returns a different session by itself, so I can't think on a way this could happen, could you maybe deep into your example? |
Here is how I do it:
Also, I need to watch if request returned 503 (which for my case means that server banned me) and then set self.session_id = "create" Rather than that I'd like to do something like: |
@scrapy-plugins/core any ideas on enabling this in the middleware? I am not really a big fan of using Crawlera sessions with Scrapy, specially when only dealing with 1 Session. |
@dchaplinsky , @eLRuLL : I believe this can be done at middleware level indeed, perhaps with the same design as the revamped robotstxt middleware, i.e. returning a deferred on Crawlera has a |
Well, I'm also not a big fan to use sessions but for some sites it's the only option I have. |
This is a very old problem, but I would like to update it to see if we should work on it, but I really see two problems here:
@scrapy-plugins/core Could someone also please share some opinions respective to this particular problem (mostly point 2)? Thanks |
Re: Re: slowness: it depends on the behaviour the scraper wants to follow: if it is to mimic user behaviour, then yes, one request should come after the other, but there are websites that run with a lot of IP/cookie sensitive XHR requests: naturally, the scraper is free to run them faster one after the other and even concurrently, again, to follow closely what happens in real world. Re: resetting Re: ordering |
@dchaplinsky did you find a solve for the robotstxt middleware? Could you post your solution? I am looking to develop a solution for the same issue, but am having trouble passing session_id flags/var between the middleware and spider. |
@brooj095 , to be honest, I barely remember the issue I've been working on. |
Going back to @eLRuLL’s comment, I don’t see how we could improve things for point 2, so I think we should instead work on point 1, and I think that’s made of two parts:
|
Should we consider #85 a fix for this? |
It'd be great if the plugin can be configured that it'll use/re-use the sessions mechanism.
Because managing it in spiders like that:
is a little bit ugly
The text was updated successfully, but these errors were encountered: