generalize hyperlink extractor #538

marianneke · 2025-01-23T15:39:13Z

Overview

Closes #XXXX.

What problem does this address?

Hyperlink extractor does not return URLs in tags other than <a> tags
Hyperlink extractor also does not return information inside the tags, which could contain the actual filename, e.g. <a href="...">filename.zip</a>
In the case of the new data archiver Add USGS USWTDB data #539 the name would be needed for extracting properties such as dataset upload date.

What did you change in this PR?

Hyperlink extractor returns all hyperlinks (based on some list of attributes rather than only the href attribute in the tag)
Return value is dict instead of set, with values {hyperlink: name} where name contains the text between tags, if present, and defaults to hyperlink itself.
Check for name matches in both the name and the hyperlink instead of just the hyperlink

Testing

WIP
Note that the change from datatype set -> dict should not be an issue downstream as long as the result of get_hyperlinks is looped over, since a simple for loop returns the keys, which are the hyperlinks, and ignores the values. This should be checked.
~~Errors might occur if there is text between the tags but it lacks information that is present in the hyperlink, resulting in hyperlinks not being downloaded.~~ solved by checking both name and link for pattern

To-do list

Tasks

Give feedback

How to handle cases where the text between tags does not contain the filename (so we may want to default to the hyperlink for the name in that case)
Use the name as the filename if downstream usecase does expect a valid filename, and the hyperlink does not contain the filename
Update relevant documentation - like comments, docstrings, README, release notes, etc.
Review the PR yourself and call out any questions or issues you have
Options

zschira

This all makes sense at a high level, I'm mostly just concerned with getting the CI passing.

I'm not 100% sure what still needs to be done for handling the filename stuff, but happy to take another look when ready, or discuss further if you have any questions. I will say, if we need to enable some more complex html extraction, I think it could be reasonable to bring in beautiful-soup. Also, it's not really a big deal if we end up using different filenames from those coming from the upstream sources, so if we can construct our own meaningful files without extracting them from the html, that's a totally fine option.

src/pudl_archiver/archivers/classes.py

…urn name

zschira

This looks good to me, @jdangerx do you think we should we hold off on merging if you're testing beautiful soup, or go for it and we can replace with BS in the future?

marianneke marked this pull request as draft January 23, 2025 15:39

marianneke mentioned this pull request Jan 23, 2025

Add USGS USWTDB data #539

Open

marianneke self-assigned this Jan 23, 2025

marianneke mentioned this pull request Jan 23, 2025

Initial work on DOE IRA archiver #537

Closed

zschira requested changes Jan 24, 2025

View reviewed changes

src/pudl_archiver/archivers/classes.py Outdated Show resolved Hide resolved

e-belfer mentioned this pull request Jan 28, 2025

Archive EIA NEMS #552

Merged

marianneke marked this pull request as ready for review January 29, 2025 19:46

marianneke force-pushed the marianneke-generalize-hyperlink-extractor branch from 9cab852 to 1ef6e13 Compare January 29, 2025 19:47

marianneke requested a review from zschira January 29, 2025 19:47

Marianne Hoogeveen added 3 commits January 30, 2025 11:44

generalize hyperlink extractor to handle urls from other tags and ret…

6871ba1

…urn name

handle both cases where file name or link contains relevant search term

fb1e7da

resolve linting error

58be86e

marianneke force-pushed the marianneke-generalize-hyperlink-extractor branch from 1ef6e13 to 58be86e Compare January 30, 2025 16:46

krivard mentioned this pull request Jan 31, 2025

Add archiver for NREL Standard Scenarios #563

Draft

zschira approved these changes Jan 31, 2025

View reviewed changes

Merge branch 'main' into marianneke-generalize-hyperlink-extractor

35a4a9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generalize hyperlink extractor #538

generalize hyperlink extractor #538

marianneke commented Jan 23, 2025 •

edited

Loading

Tasks

zschira left a comment

zschira left a comment

generalize hyperlink extractor #538

Are you sure you want to change the base?

generalize hyperlink extractor #538

Conversation

marianneke commented Jan 23, 2025 • edited Loading

Overview

Testing

To-do list

Tasks

zschira left a comment

Choose a reason for hiding this comment

zschira left a comment

Choose a reason for hiding this comment

marianneke commented Jan 23, 2025 •

edited

Loading