Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tco48 journal data loader #61

Merged
merged 2 commits into from
Jul 19, 2024
Merged

Tco48 journal data loader #61

merged 2 commits into from
Jul 19, 2024

Conversation

JPrevost
Copy link
Member

Why are these changes being introduced:

  • Loading journals from an external source is how we'll be internally
    detecting journal name matches in this experimental detector.

Relevant ticket(s):

How does this address that need:

  • Adds data harvester from OpenAlex Sources API endpoint
  • Adds data loader from the result of the OpenAlex harvest which can
    load from local or remote file

Document any side effects to this change:

  • Technically there are more than Journals being loaded in the current
    query configuration from OpenAlex. We are pulling both Journals and
    Book Series. It felt like Journals was still a fine internal name for
    this but we could also consider Serials to be more accurate.

  • Remote file loading only handles URI hosted files. If we move the
    OpenAlex Harvester functionality into a data pipeline that exports into
    S3 directly in the future we may want to add S3 support to the loader
    rather than exposing the S3 files via https (although the data is open
    so there is no harm in exposing direclty via https so it we could choose
    either)

  • Removes filtering of SSN as it also catches ISSN (see separate commit message)

We output ISSN data regularly with this application and this filter has been annoying.

As we never intend to use SSN data in this application it feels safe enough to remove this
rather than renaming ISSN to something less meaningful to avoid the filtering.
Rails.application.config.filter_parameters += [
:passw, :secret, :token, :_key, :crypt, :salt, :certificate, :otp, :ssn
Rails.application.config.filter_parameters += %i[
passw secret token _key crypt salt certificate otp
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the change was auto-formatted when I resaved to match our current best practices. The only practical change was removal of :ssn from the list of filters.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-61 July 19, 2024 13:48 Inactive
@jazairi jazairi self-assigned this Jul 19, 2024
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the rake tasks work as expected. Struggling with a headache this afternoon, so this hasn't been the most thorough review. (Mostly, I was hoping to better understand what cursors do before approving, but it works and I trust your judgment.)

Both comments are non-blocking. The Detector::Journal.delete_all call gives me some pause, but I couldn't think of a reason why we couldn't clear that table each time.

lib/tasks/journals.rake Outdated Show resolved Hide resolved
end

# Delete all journals. We do this to simplify the loader process to avoid consideration of updates/deletes.
Detector::Journal.delete_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an edge case in which we wouldn't want to delete everything first? I'm guessing not, but this does worry me a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. If we end up loading from multiple sources for some reason then we'd likely only want to delete from this source... but we're not there yet so this should be fine for now.

@JPrevost
Copy link
Member Author

@jazairi cursors are cool. TIMDEX definitely needs them and OpenSearch already supports them so we can add them easily(ish). Basically it's just more efficient pagination and addresses the problem of how to page past 10,000ish records in a search without needing to calculate each preceding page before showing the requested page. They are temporary in nature and allow for getting "all records that matched your query" in a neat way.

Why are these changes being introduced:

* Loading journals from an external source is how we'll be internally
  detecting journal name matches in this experimental detector.

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-41
* https://mitlibraries.atlassian.net/browse/TCO-48

How does this address that need:

* Adds data harvester from OpenAlex Sources API endpoint
* Adds data loader from the result of the OpenAlex harvest which can
  load from local or remote file

Document any side effects to this change:

* Technically there are more than Journals being loaded in the current
query configuration from OpenAlex. We are pulling both Journals and
Book Series. It felt like Journals was still a fine internal name for
this but we could also consider Serials to be more accurate.

* Remote file loading only handles URI hosted files. If we move the
OpenAlex Harvester functionality into a data pipeline that exports into
S3 directly in the future we may want to add S3 support to the loader
rather than exposing the S3 files via https (although the data is open
so there is no harm in exposing direclty via https so it we could choose
either)
@JPrevost JPrevost force-pushed the tco48-journal-data-loader branch from a1f77da to 9838367 Compare July 19, 2024 19:54
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-61 July 19, 2024 19:54 Inactive
@JPrevost JPrevost merged commit cc5c517 into main Jul 19, 2024
2 checks passed
@JPrevost JPrevost deleted the tco48-journal-data-loader branch July 19, 2024 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants