Use the standard Apify SDK to develop a simple web crawler that downloads all the product pages from a German supermarket called Edeka. The key outcomes of should be: *The crawler should run locally on your machine as a Node.js based CLI application
- The crawler should not use a browser (please base it on Apify's CheerioCrawler)
- The site to crawl is: https://www.edeka24.de
- The crawler should find and download every product page available on the site (here is an example of one: https://www.edeka24.de/Wein/Sekt-Prosecco/Deutschland/Kessler-Sekt-Hochgewae chs-Chardonnay-Brut-0-75L.html)
- All product pages found by your crawler should be saved locally to disk in a dedicated Apify KeyValueStore named "product-pages"
- Resources that are not product pages (e.g. the home page) can be downloaded but if you do so, these should be stored in a separate Apify KeyValueStore
In order to run the application, please run the following command:
npm run start
- Check if current page is:
- in same domain as pretended
- a valid URL (ignore tel and other types of URLs)
- if extension is valid
- if is a relative or absolute URL ( if relative add the domain before being analyzed)
- Get the title (I am considering it a unique identifier),sanitize it, and the page content
- Verify if it is a product page or a general page
- Verify if already saved ( if not save )
- Gather all links in page
- Sanitize gathered links (check if the URL is valid and so on...)
- Add to requestQueue ( to be later analyzed again)
- product page was identified as ending in .html (this can or not be the case - when analyzing the site, only found products with url ending in .html)
- using title as identifier might not be the most accurate method
- Add only not already existing urls to requestQueue
- Add command argument to only add and crawl product pages( this should be based on a url identifier)
- Improve the page identification system -> title may not be the best identifier
- Improve logging system and add a final summary to profile the system