Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix relative URL joining #570 #581

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Sparshsing
Copy link

@Sparshsing Sparshsing commented Jan 29, 2025

Summary

Fixes #570
Relative url joining has been fixed. The issue was due to incorrect usage of link protection using <> tags.

List of files changed and why

crawl4ai/html2text.py -
While parsing the "a" tags in handle_tag method, the url in "href" attr was being protected by surrounding with <> in the opening tag, and then in the closing tag it was joined with base_url. To fix this, the protetion with <> is done after the url is joined with the base_url.
The same process has been done in the o() method.

crawl4ai\markdown_generation_strategy.py -
In the conversion of links to citations for markdown_v2, we remove the <> symbols from the url if it was protected, then we join the url with base_url.
Also the fast_urljoin method was modified, since it would join a url incorrectly if the base url included a path eg. https://docs.crawl4ai.com/core/crawler-result/.

How Has This Been Tested?

It was tested by generating markdown and markdown_v2 of few urls.
eg. https://docs.crawl4ai.com/, https://docs.crawl4ai.com/core/crawler-result/
The links in the fixed markups were working as expected.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Sparshsing
Copy link
Author

Maybe there should be a check at the top level code to add / at the end of url if not exist, as it can cause issues with relative urls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant