Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #570
Relative url joining has been fixed. The issue was due to incorrect usage of link protection using <> tags.
List of files changed and why
crawl4ai/html2text.py -
While parsing the "a" tags in handle_tag method, the url in "href" attr was being protected by surrounding with <> in the opening tag, and then in the closing tag it was joined with base_url. To fix this, the protetion with <> is done after the url is joined with the base_url.
The same process has been done in the o() method.
crawl4ai\markdown_generation_strategy.py -
In the conversion of links to citations for markdown_v2, we remove the <> symbols from the url if it was protected, then we join the url with base_url.
Also the fast_urljoin method was modified, since it would join a url incorrectly if the base url included a path eg. https://docs.crawl4ai.com/core/crawler-result/.
How Has This Been Tested?
It was tested by generating markdown and markdown_v2 of few urls.
eg. https://docs.crawl4ai.com/, https://docs.crawl4ai.com/core/crawler-result/
The links in the fixed markups were working as expected.
Checklist: