Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added concurrency limits to reduce number of errors #7

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

pradhanhitesh
Copy link
Contributor

@pradhanhitesh pradhanhitesh commented Jan 15, 2025

Description

In this fix, I have added concurrency limits in the fast-scraper.js code using pLimit. The default concurrent request which can be added as requests to Crawlee is set at const limit = pLimit(10);.

Related Issue

Fixes #6

Type of Change

  • Bug fix
  • Enhancement

How Has This Been Tested?

  • Unit Tests
  • Manual Testing

Checklist

  • I have performed a self-review of my code.
  • I have written unit tests and/or integration tests.
  • My changes generate fewer new warnings or errors.
  • My code follows the repository's coding standards.

Additional Notes

Now, the URLs which are loaded from api-docs-url.csv can be classified into four categories based on their behaviour generated from Crawlee:

  1. Robots.txt fetched; but does not allow scrapping.
  2. Robots.txt fetched; and allows scrapping.
  3. Robots.txt could not be fetched; proceeds with scrapping.
  4. Could not fetch the URL; therefore causes fetch fail error.

This fix aimed at reducing the number of errors due to category 4. Without limits, all the URLs resulted in fetch fail error. With limits, the number now is around 68~ out of 564 URLs.

@in-c0
Copy link
Collaborator

in-c0 commented Jan 16, 2025

Great spotting! I also like the formatting of your PR. Should we make it a template for future PRs?

@in-c0 in-c0 self-requested a review January 16, 2025 13:07
@in-c0
Copy link
Collaborator

in-c0 commented Jan 16, 2025

Hmm.. interesting!
I will have to look into finding a more reliable way to generate URLs for public API documents and how to reduce the number of the invalid URLs! The method I used is asking ChatGPT with the prompt "generate the urls pointing to the appropriate page of popular public APIs based on this table" and repeatedly asking to generate "10 more" and merging the tables together - but obviously hallucination seems to be an issue here. Will create a new issue for this,

@pradhanhitesh
Copy link
Contributor Author

Great spotting! I also like the formatting of your PR. Should we make it a template for future PRs?

Yes, go ahead and create a PR template. It will be good to have structured PR in future.

Hmm.. interesting! I will have to look into finding a more reliable way to generate URLs for public API documents and how to reduce the number of the invalid URLs! The method I used is asking ChatGPT with the prompt "generate the urls pointing to the appropriate page of popular public APIs based on this table" and repeatedly asking to generate "10 more" and merging the tables together - but obviously hallucination seems to be an issue here. Will create a new issue for this,

I see! There are few invalid URLs such as https://developers.klarna.com/ and https://developers.klarna.com/documentation/kco-v3/payments-api/ which cannot be reached or simply do not exists. Let me also try to find legit sources of API Documentation. Should I verify the links?

Also, I was thinking, we need not scrap the websites at every run. Here, we can introduce "versioning" wherein, the last date the URL was scrapped can be mentioned in the api-url-docs.csv. If the site is being scrapped within one month of last-scrapped date, then no need to scrap, since we expect the site to not be changed within a month. Just a thought, maybe we can explore the idea. What do you think?

On separate note, did you review the code yet, @in-c0 ? If everything seems okay, you can merge the PR. Thanks! 🖱️

@in-c0
Copy link
Collaborator

in-c0 commented Jan 16, 2025

Yes, go ahead and create a PR template. It will be good to have structured PR in future.

Okay! Thank you for your guidance.

I see! There are few invalid URLs such as https://developers.klarna.com/ and https://developers.klarna.com/documentation/kco-v3/payments-api/ which cannot be reached or simply do not exists. Let me also try to find legit sources of API Documentation. Should I verify the links?

I think I'll do a whole revamp and use Github Actions for these automated URL table management. More on Issue #8

we need not scrap the websites at every run. Here, we can introduce "versioning" wherein, the last date the URL was scrapped can be mentioned in the api-url-docs.csv. If the site is being scrapped within one month of last-scrapped date, then no need to scrap, since we expect the site to not be changed within a month. Just a thought, maybe we can explore the idea. What do you think?

Agreed that we shouldn't need to scrape every run, though I think certain pages are more time-sensitive than others (policies, TOS, ... ). I think what we could do is, in addition to the proposed solution for Issue #8, we scrape daily for the users to download and host it somewhere for easy fetching. If not daily, we can try to scrape as often as we practically can within our allowed resources. Introducing versioning is a great idea, but I'm wary of causing confusion with existing versioning of the APIs. What's your thoughts on this?

On separate note, did you review the code yet, @in-c0 ? If everything seems okay, you can merge the PR. Thanks! 🖱️

Yep I have , but I have not merged yet as I am still reviewing - I tried to check out your branch and confirm that "The number now is around 68~ out of 564 URLs", but mine still stays at 40 after running fast-scraper.js on improve-concurrency branch. I've been trying to figure out what's going on.

@in-c0
Copy link
Collaborator

in-c0 commented Jan 16, 2025

mine still stays at 40 after running fast-scraper.js on improve-concurrency branch

@pradhanhitesh Could you please take a look at the log and notice any difference to your output?
log.txt

@pradhanhitesh
Copy link
Contributor Author

"The number now is around 68~ out of 564 URLs", but mine still stays at 40 after running fast-scraper.js on improve-concurrency branch. I've been trying to figure out what's going on.

To figure out the exact number, please refer to this .csv file here. The number of .json files that I generate after running fast-scrapper.js is close to 40 as well. The number of errors due to fetch error keeps fluctuating and is therefore not constant at 68.

There are few URLs which allow scrapping (20~), and then there are URLs for which robots.txt is not fetched and therefore, default is to scrap them. Now, of those sites, I have seen that few websites return 404 Not Found Page when scrapped which obviously do not contain any useful information. Therefore in this line of code in if (!data.apiName || !data.url || !data.title || !data.content || data.title.includes('404')), I remove most invalid files and only keep those which have valid apiName, URL, title does not include 404.

So, if my understanding is correct:

  1. We scrap 20 websites (because robots.txt allow us to)
  2. We scrap more 57~ websites (because robots.txt could not be fetched, therefore, we default to allow scrap). Now, out of these, some return 404 error page which is captured in the title``. I prune these files using cleanUpAndMoveFiles functions. I feel more than 30+ .json files are deleted here because they contain 404 error page`. As you mentioned earlier, ChatGPT hallucination problems might have generated non-existent websites.
  3. We cannot scrap around 68-71 websites because of fetch fail error. This number keeps fluctuating.
  4. We cannot scrap 416 websites (because robots.txt do not allow us to)

Introducing versioning is a great idea, but I'm wary of causing confusion with existing versioning of the APIs. What's your thoughts on this?

We can version the datasets, instead, which will be preferable. Setting up cron jobs via Actions will be suitable and we can fetch the latest dataset-folder with up-to-date documents for users to scroll through and browse.

@pradhanhitesh
Copy link
Contributor Author

mine still stays at 40 after running fast-scraper.js on improve-concurrency branch

@pradhanhitesh Could you please take a look at the log and notice any difference to your output? log.txt

Compared to me, you have less number of fetch fail errors, around like 4~. Also number of requests via Crawlee is 90~ for you but for me it is INFO CheerioCrawler: Finished! Total 60 requests: 51 succeeded, 9 failed. {"terminal":true}, only 60! There are some discrepancies in the numbers here. Is it because of geo-restrictions? I am not sure maybe. Let me try executing this in Python and checking the numbers once.

@in-c0
Copy link
Collaborator

in-c0 commented Jan 28, 2025

The number of .json files that I generate after running fast-scrapper.js is close to 40 as well. The number of errors due to fetch error keeps fluctuating and is therefore not constant at 68.

Thank you for clarifying this! And apologies for the delay in getting back to you. The branch seems to be working as expected - I'll merge it into main 👍

  • We scrap 20 websites (because robots.txt allow us to)
  • We scrap more 57~ websites (because robots.txt could not be fetched, therefore, we default to allow scrap). Now, out of these, some return 404 error page which is captured in the title. I prune these files using ``cleanUpAndMoveFiles functions. I feel more than 30+ .json files are deleted here because they contain 404 error page`. As you mentioned earlier, ChatGPT hallucination problems might have generated non-existent websites.
  • We cannot scrap around 68-71 websites because of fetch fail error. This number keeps fluctuating.
  • We cannot scrap 416 websites (because robots.txt do not allow us to)

Thank you for taking the time to summarize this. The number of unallowed websites surprises me! I'll look into finding a more reliable way to fetch valid URLs.

@in-c0 in-c0 merged commit af3dbd6 into UpdAPI:main Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUGS] Current implementation of concurrency is causing errors
2 participants