Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize hyperlink extractor #538

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

marianneke
Copy link
Member

@marianneke marianneke commented Jan 23, 2025

Overview

Closes #XXXX.

What problem does this address?

  • Hyperlink extractor does not return URLs in tags other than <a> tags
  • Hyperlink extractor also does not return information inside the tags, which could contain the actual filename, e.g. <a href="...">filename.zip</a>
  • In the case of the new data archiver Add USGS USWTDB data #539 the name would be needed for extracting properties such as dataset upload date.

What did you change in this PR?

  • Hyperlink extractor returns all hyperlinks (based on some list of attributes rather than only the href attribute in the tag)
  • Return value is dict instead of set, with values {hyperlink: name} where name contains the text between tags, if present, and defaults to hyperlink itself.
  • Check for name matches in both the name and the hyperlink instead of just the hyperlink

Testing

WIP
Note that the change from datatype set -> dict should not be an issue downstream as long as the result of get_hyperlinks is looped over, since a simple for loop returns the keys, which are the hyperlinks, and ignores the values. This should be checked.
Errors might occur if there is text between the tags but it lacks information that is present in the hyperlink, resulting in hyperlinks not being downloaded. solved by checking both name and link for pattern

To-do list

Tasks

Preview Give feedback

@marianneke marianneke marked this pull request as draft January 23, 2025 15:39
@marianneke marianneke self-assigned this Jan 23, 2025
Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense at a high level, I'm mostly just concerned with getting the CI passing.

I'm not 100% sure what still needs to be done for handling the filename stuff, but happy to take another look when ready, or discuss further if you have any questions. I will say, if we need to enable some more complex html extraction, I think it could be reasonable to bring in beautiful-soup. Also, it's not really a big deal if we end up using different filenames from those coming from the upstream sources, so if we can construct our own meaningful files without extracting them from the html, that's a totally fine option.

src/pudl_archiver/archivers/classes.py Outdated Show resolved Hide resolved
@e-belfer e-belfer mentioned this pull request Jan 28, 2025
@marianneke marianneke marked this pull request as ready for review January 29, 2025 19:46
@marianneke marianneke force-pushed the marianneke-generalize-hyperlink-extractor branch from 9cab852 to 1ef6e13 Compare January 29, 2025 19:47
@marianneke marianneke requested a review from zschira January 29, 2025 19:47
@marianneke marianneke force-pushed the marianneke-generalize-hyperlink-extractor branch from 1ef6e13 to 58be86e Compare January 30, 2025 16:46
Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @jdangerx do you think we should we hold off on merging if you're testing beautiful soup, or go for it and we can replace with BS in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants