Goal

Use the standard Apify SDK to develop a simple web crawler that downloads all the product pages from a German supermarket called Edeka. The key outcomes of should be: *The crawler should run locally on your machine as a Node.js based CLI application

The crawler should not use a browser (please base it on Apify's CheerioCrawler)
The site to crawl is: https://www.edeka24.de
The crawler should find and download every product page available on the site (here is an example of one: https://www.edeka24.de/Wein/Sekt-Prosecco/Deutschland/Kessler-Sekt-Hochgewae chs-Chardonnay-Brut-0-75L.html)
All product pages found by your crawler should be saved locally to disk in a dedicated Apify KeyValueStore named "product-pages"
Resources that are not product pages (e.g. the home page) can be downloaded but if you do so, these should be stored in a separate Apify KeyValueStore

Run the application

In order to run the application, please run the following command:

npm run start

Thinking Process

Check if current page is:
- in same domain as pretended
- a valid URL (ignore tel and other types of URLs)
- if extension is valid
- if is a relative or absolute URL ( if relative add the domain before being analyzed)
Get the title (I am considering it a unique identifier),sanitize it, and the page content
Verify if it is a product page or a general page
Verify if already saved ( if not save )
Gather all links in page
Sanitize gathered links (check if the URL is valid and so on...)
Add to requestQueue ( to be later analyzed again)

Identified problems

product page was identified as ending in .html (this can or not be the case - when analyzing the site, only found products with url ending in .html)
using title as identifier might not be the most accurate method

What I would do to improve

Add only not already existing urls to requestQueue
Add command argument to only add and crawl product pages( this should be based on a url identifier)
Improve the page identification system -> title may not be the best identifier
Improve logging system and add a final summary to profile the system

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.editorconfig		.editorconfig
.eslintrc		.eslintrc
.gitignore		.gitignore
Dockerfile		Dockerfile
INPUT_SCHEMA.json		INPUT_SCHEMA.json
README.md		README.md
apify.json		apify.json
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal

Run the application

Thinking Process

Identified problems

What I would do to improve

About

Releases

Packages

Languages

BN0v0/web_scrapper

Folders and files

Latest commit

History

Repository files navigation

Goal

Run the application

Thinking Process

Identified problems

What I would do to improve

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages