-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalize hyperlink extractor #538
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense at a high level, I'm mostly just concerned with getting the CI passing.
I'm not 100% sure what still needs to be done for handling the filename stuff, but happy to take another look when ready, or discuss further if you have any questions. I will say, if we need to enable some more complex html extraction, I think it could be reasonable to bring in beautiful-soup. Also, it's not really a big deal if we end up using different filenames from those coming from the upstream sources, so if we can construct our own meaningful files without extracting them from the html, that's a totally fine option.
9cab852
to
1ef6e13
Compare
1ef6e13
to
58be86e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, @jdangerx do you think we should we hold off on merging if you're testing beautiful soup, or go for it and we can replace with BS in the future?
Overview
Closes #XXXX.
What problem does this address?
<a>
tags<a href="...">filename.zip</a>
What did you change in this PR?
Testing
WIP
Note that the change from datatype set -> dict should not be an issue downstream as long as the result of
get_hyperlinks
is looped over, since a simple for loop returns the keys, which are the hyperlinks, and ignores the values. This should be checked.Errors might occur if there is text between the tags but it lacks information that is present in the hyperlink, resulting in hyperlinks not being downloaded.solved by checking both name and link for patternTo-do list
Tasks