-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added concurrency limits to reduce number of errors #7
Conversation
Great spotting! I also like the formatting of your PR. Should we make it a template for future PRs? |
Hmm.. interesting! |
Yes, go ahead and create a PR template. It will be good to have structured PR in future.
I see! There are few invalid URLs such as https://developers.klarna.com/ and https://developers.klarna.com/documentation/kco-v3/payments-api/ which cannot be reached or simply do not exists. Let me also try to find legit sources of API Documentation. Should I verify the links? Also, I was thinking, we need not scrap the websites at every run. Here, we can introduce "versioning" wherein, the last date the URL was scrapped can be mentioned in the On separate note, did you review the code yet, @in-c0 ? If everything seems okay, you can merge the PR. Thanks! 🖱️ |
Okay! Thank you for your guidance.
I think I'll do a whole revamp and use Github Actions for these automated URL table management. More on Issue #8
Agreed that we shouldn't need to scrape every run, though I think certain pages are more time-sensitive than others (policies, TOS, ... ). I think what we could do is, in addition to the proposed solution for Issue #8, we scrape daily for the users to download and host it somewhere for easy fetching. If not daily, we can try to scrape as often as we practically can within our allowed resources. Introducing versioning is a great idea, but I'm wary of causing confusion with existing versioning of the APIs. What's your thoughts on this?
Yep I have , but I have not merged yet as I am still reviewing - I tried to check out your branch and confirm that "The number now is around 68~ out of 564 URLs", but mine still stays at 40 after running |
@pradhanhitesh Could you please take a look at the log and notice any difference to your output? |
To figure out the exact number, please refer to this There are few URLs which allow scrapping (20~), and then there are URLs for which robots.txt is not fetched and therefore, default is to scrap them. Now, of those sites, I have seen that few websites return So, if my understanding is correct:
We can version the datasets, instead, which will be preferable. Setting up cron jobs via Actions will be suitable and we can fetch the latest dataset-folder with up-to-date documents for users to scroll through and browse. |
Compared to me, you have less number of |
Thank you for clarifying this! And apologies for the delay in getting back to you. The branch seems to be working as expected - I'll merge it into main 👍
Thank you for taking the time to summarize this. The number of unallowed websites surprises me! I'll look into finding a more reliable way to fetch valid URLs. |
Description
In this fix, I have added concurrency limits in the
fast-scraper.js
code usingpLimit
. The default concurrent request which can be added as requests toCrawlee
is set atconst limit = pLimit(10);
.Related Issue
Fixes #6
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
Now, the URLs which are loaded from
api-docs-url.csv
can be classified into four categories based on their behaviour generated fromCrawlee
:fetch fail
error.This fix aimed at reducing the number of errors due to category 4. Without limits, all the URLs resulted in
fetch fail
error. With limits, the number now is around 68~ out of 564 URLs.