-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update crawler.ts: Add hostname check to keep crawler on the same domain. #15
base: main
Are you sure you want to change the base?
Update crawler.ts: Add hostname check to keep crawler on the same domain. #15
Conversation
Added hostname check before adding a url to the queue to make sure we don't follow social links to the end of the earth if editing the URLs we're crawling. If the found URL has the same hostname as the starting URL, it will add that URL to the queue. Thanks for the project!
Tested and works like a charm. I'd also like to extend crawler to take an optional regex, what do you think? |
src/app/api/crawl/crawler.ts
Outdated
@@ -1,5 +1,5 @@ | |||
import cheerio from 'cheerio'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your linter shouldnt be making changes for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR now only brings in a hostname check
@@ -80,8 +102,20 @@ class Crawler { | |||
|
|||
private extractUrls(html: string, baseUrl: string): string[] { | |||
const $ = cheerio.load(html); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor observation: This will give a 'deprecated' warning. You can do import { load } from 'cheerio';
and use the load by itself instead.
Problem
The existing crawler logic adds any found URLs to the queue without checking if they belong to the same domain as the starting URL. This could result in the crawler venturing off into unrelated domains, especially social links, which may not be desirable for the scope of the crawl.
Solution
Introduced a hostname check in the addNewUrlsToQueue function. This ensures that only URLs that have the same hostname as the starting URL are added to the queue for crawling. This feature helps in restricting the crawler within the scope of the initial domain, thereby making the crawl more focused and efficient.
Type of Change
New feature (non-breaking change which adds functionality)
Test Plan
Run the crawler on a starting URL that contains both internal and external links.
Verify that only URLs belonging to the same hostname as the starting URL are added to the queue.
Optionally, you can print the queue or keep logs to verify that the URLs are indeed from the same hostname.