added concurrency limits to reduce number of errors #7

pradhanhitesh · 2025-01-15T16:00:23Z

Description

In this fix, I have added concurrency limits in the fast-scraper.js code using pLimit. The default concurrent request which can be added as requests to Crawlee is set at const limit = pLimit(10);.

Related Issue

Fixes #6

Type of Change

Bug fix
Enhancement

How Has This Been Tested?

Unit Tests
Manual Testing

Checklist

I have performed a self-review of my code.
I have written unit tests and/or integration tests.
My changes generate fewer new warnings or errors.
My code follows the repository's coding standards.

Additional Notes

Now, the URLs which are loaded from api-docs-url.csv can be classified into four categories based on their behaviour generated from Crawlee:

Robots.txt fetched; but does not allow scrapping.
Robots.txt fetched; and allows scrapping.
Robots.txt could not be fetched; proceeds with scrapping.
Could not fetch the URL; therefore causes fetch fail error.

This fix aimed at reducing the number of errors due to category 4. Without limits, all the URLs resulted in fetch fail error. With limits, the number now is around 68~ out of 564 URLs.

in-c0 · 2025-01-16T13:06:02Z

Great spotting! I also like the formatting of your PR. Should we make it a template for future PRs?

in-c0 · 2025-01-16T13:12:39Z

Hmm.. interesting!
I will have to look into finding a more reliable way to generate URLs for public API documents and how to reduce the number of the invalid URLs! The method I used is asking ChatGPT with the prompt "generate the urls pointing to the appropriate page of popular public APIs based on this table" and repeatedly asking to generate "10 more" and merging the tables together - but obviously hallucination seems to be an issue here. Will create a new issue for this,

pradhanhitesh · 2025-01-16T13:51:43Z

Great spotting! I also like the formatting of your PR. Should we make it a template for future PRs?

Yes, go ahead and create a PR template. It will be good to have structured PR in future.

Hmm.. interesting! I will have to look into finding a more reliable way to generate URLs for public API documents and how to reduce the number of the invalid URLs! The method I used is asking ChatGPT with the prompt "generate the urls pointing to the appropriate page of popular public APIs based on this table" and repeatedly asking to generate "10 more" and merging the tables together - but obviously hallucination seems to be an issue here. Will create a new issue for this,

I see! There are few invalid URLs such as https://developers.klarna.com/ and https://developers.klarna.com/documentation/kco-v3/payments-api/ which cannot be reached or simply do not exists. Let me also try to find legit sources of API Documentation. Should I verify the links?

Also, I was thinking, we need not scrap the websites at every run. Here, we can introduce "versioning" wherein, the last date the URL was scrapped can be mentioned in the api-url-docs.csv. If the site is being scrapped within one month of last-scrapped date, then no need to scrap, since we expect the site to not be changed within a month. Just a thought, maybe we can explore the idea. What do you think?

On separate note, did you review the code yet, @in-c0 ? If everything seems okay, you can merge the PR. Thanks! 🖱️

in-c0 · 2025-01-16T14:11:46Z

Yes, go ahead and create a PR template. It will be good to have structured PR in future.

Okay! Thank you for your guidance.

I see! There are few invalid URLs such as https://developers.klarna.com/ and https://developers.klarna.com/documentation/kco-v3/payments-api/ which cannot be reached or simply do not exists. Let me also try to find legit sources of API Documentation. Should I verify the links?

I think I'll do a whole revamp and use Github Actions for these automated URL table management. More on Issue #8

we need not scrap the websites at every run. Here, we can introduce "versioning" wherein, the last date the URL was scrapped can be mentioned in the api-url-docs.csv. If the site is being scrapped within one month of last-scrapped date, then no need to scrap, since we expect the site to not be changed within a month. Just a thought, maybe we can explore the idea. What do you think?

Agreed that we shouldn't need to scrape every run, though I think certain pages are more time-sensitive than others (policies, TOS, ... ). I think what we could do is, in addition to the proposed solution for Issue #8, we scrape daily for the users to download and host it somewhere for easy fetching. If not daily, we can try to scrape as often as we practically can within our allowed resources. Introducing versioning is a great idea, but I'm wary of causing confusion with existing versioning of the APIs. What's your thoughts on this?

On separate note, did you review the code yet, @in-c0 ? If everything seems okay, you can merge the PR. Thanks! 🖱️

Yep I have , but I have not merged yet as I am still reviewing - I tried to check out your branch and confirm that "The number now is around 68~ out of 564 URLs", but mine still stays at 40 after running fast-scraper.js on improve-concurrency branch. I've been trying to figure out what's going on.

in-c0 · 2025-01-16T14:25:53Z

mine still stays at 40 after running fast-scraper.js on improve-concurrency branch

@pradhanhitesh Could you please take a look at the log and notice any difference to your output?
log.txt

pradhanhitesh · 2025-01-16T15:02:10Z

"The number now is around 68~ out of 564 URLs", but mine still stays at 40 after running fast-scraper.js on improve-concurrency branch. I've been trying to figure out what's going on.

To figure out the exact number, please refer to this .csv file here. The number of .json files that I generate after running fast-scrapper.js is close to 40 as well. The number of errors due to fetch error keeps fluctuating and is therefore not constant at 68.

There are few URLs which allow scrapping (20~), and then there are URLs for which robots.txt is not fetched and therefore, default is to scrap them. Now, of those sites, I have seen that few websites return 404 Not Found Page when scrapped which obviously do not contain any useful information. Therefore in this line of code in if (!data.apiName || !data.url || !data.title || !data.content || data.title.includes('404')), I remove most invalid files and only keep those which have valid apiName, URL, title does not include 404.

So, if my understanding is correct:

We scrap 20 websites (because robots.txt allow us to)
We scrap more 57~ websites (because robots.txt could not be fetched, therefore, we default to allow scrap). Now, out of these, some return 404 error page which is captured in the title``. I prune these files using cleanUpAndMoveFiles functions. I feel more than 30+ .json files are deleted here because they contain 404 error page`. As you mentioned earlier, ChatGPT hallucination problems might have generated non-existent websites.
We cannot scrap around 68-71 websites because of fetch fail error. This number keeps fluctuating.
We cannot scrap 416 websites (because robots.txt do not allow us to)

Introducing versioning is a great idea, but I'm wary of causing confusion with existing versioning of the APIs. What's your thoughts on this?

We can version the datasets, instead, which will be preferable. Setting up cron jobs via Actions will be suitable and we can fetch the latest dataset-folder with up-to-date documents for users to scroll through and browse.

pradhanhitesh · 2025-01-16T15:14:58Z

mine still stays at 40 after running fast-scraper.js on improve-concurrency branch

@pradhanhitesh Could you please take a look at the log and notice any difference to your output? log.txt

Compared to me, you have less number of fetch fail errors, around like 4~. Also number of requests via Crawlee is 90~ for you but for me it is INFO CheerioCrawler: Finished! Total 60 requests: 51 succeeded, 9 failed. {"terminal":true}, only 60! There are some discrepancies in the numbers here. Is it because of geo-restrictions? I am not sure maybe. Let me try executing this in Python and checking the numbers once.

in-c0 · 2025-01-28T02:22:01Z

The number of .json files that I generate after running fast-scrapper.js is close to 40 as well. The number of errors due to fetch error keeps fluctuating and is therefore not constant at 68.

Thank you for clarifying this! And apologies for the delay in getting back to you. The branch seems to be working as expected - I'll merge it into main 👍

We scrap 20 websites (because robots.txt allow us to)

We scrap more 57~ websites (because robots.txt could not be fetched, therefore, we default to allow scrap). Now, out of these, some return 404 error page which is captured in the title. I prune these files using ``cleanUpAndMoveFiles functions. I feel more than 30+ .json files are deleted here because they contain 404 error page`. As you mentioned earlier, ChatGPT hallucination problems might have generated non-existent websites.

We cannot scrap around 68-71 websites because of fetch fail error. This number keeps fluctuating.

We cannot scrap 416 websites (because robots.txt do not allow us to)

Thank you for taking the time to summarize this. The number of unallowed websites surprises me! I'll look into finding a more reliable way to fetch valid URLs.

pradhanhitesh and others added 2 commits January 15, 2025 21:12

added concurrency limits to reduce number of errors

5f51496

Merge branch 'UpdAPI:main' into improve-concurrency

9f98032

in-c0 self-requested a review January 16, 2025 13:07

in-c0 merged commit af3dbd6 into UpdAPI:main Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added concurrency limits to reduce number of errors #7

added concurrency limits to reduce number of errors #7

pradhanhitesh commented Jan 15, 2025 •

edited

Loading

in-c0 commented Jan 16, 2025

in-c0 commented Jan 16, 2025 •

edited

Loading

pradhanhitesh commented Jan 16, 2025

in-c0 commented Jan 16, 2025

in-c0 commented Jan 16, 2025

pradhanhitesh commented Jan 16, 2025

pradhanhitesh commented Jan 16, 2025

in-c0 commented Jan 28, 2025

added concurrency limits to reduce number of errors #7

added concurrency limits to reduce number of errors #7

Conversation

pradhanhitesh commented Jan 15, 2025 • edited Loading

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

in-c0 commented Jan 16, 2025

in-c0 commented Jan 16, 2025 • edited Loading

pradhanhitesh commented Jan 16, 2025

in-c0 commented Jan 16, 2025

in-c0 commented Jan 16, 2025

pradhanhitesh commented Jan 16, 2025

pradhanhitesh commented Jan 16, 2025

in-c0 commented Jan 28, 2025

pradhanhitesh commented Jan 15, 2025 •

edited

Loading

in-c0 commented Jan 16, 2025 •

edited

Loading