Tco48 journal data loader #61

JPrevost · 2024-07-19T13:46:31Z

Why are these changes being introduced:

Loading journals from an external source is how we'll be internally
detecting journal name matches in this experimental detector.

Relevant ticket(s):

How does this address that need:

Adds data harvester from OpenAlex Sources API endpoint
Adds data loader from the result of the OpenAlex harvest which can
load from local or remote file

Document any side effects to this change:

Technically there are more than Journals being loaded in the current
query configuration from OpenAlex. We are pulling both Journals and
Book Series. It felt like Journals was still a fine internal name for
this but we could also consider Serials to be more accurate.
Remote file loading only handles URI hosted files. If we move the
OpenAlex Harvester functionality into a data pipeline that exports into
S3 directly in the future we may want to add S3 support to the loader
rather than exposing the S3 files via https (although the data is open
so there is no harm in exposing direclty via https so it we could choose
either)
Removes filtering of SSN as it also catches ISSN (see separate commit message)

We output ISSN data regularly with this application and this filter has been annoying. As we never intend to use SSN data in this application it feels safe enough to remove this rather than renaming ISSN to something less meaningful to avoid the filtering.

JPrevost · 2024-07-19T13:47:32Z

config/initializers/filter_parameter_logging.rb

-Rails.application.config.filter_parameters += [
-  :passw, :secret, :token, :_key, :crypt, :salt, :certificate, :otp, :ssn
+Rails.application.config.filter_parameters += %i[
+  passw secret token _key crypt salt certificate otp
 ]


Most of the change was auto-formatted when I resaved to match our current best practices. The only practical change was removal of :ssn from the list of filters.

jazairi

I confirmed that the rake tasks work as expected. Struggling with a headache this afternoon, so this hasn't been the most thorough review. (Mostly, I was hoping to better understand what cursors do before approving, but it works and I trust your judgment.)

Both comments are non-blocking. The Detector::Journal.delete_all call gives me some pause, but I couldn't think of a reason why we couldn't clear that table each time.

lib/tasks/journals.rake

jazairi · 2024-07-19T17:58:06Z

lib/tasks/journals.rake

+    end
+
+    # Delete all journals. We do this to simplify the loader process to avoid consideration of updates/deletes.
+    Detector::Journal.delete_all


Is there an edge case in which we wouldn't want to delete everything first? I'm guessing not, but this does worry me a bit.

Not yet. If we end up loading from multiple sources for some reason then we'd likely only want to delete from this source... but we're not there yet so this should be fine for now.

JPrevost · 2024-07-19T19:50:07Z

@jazairi cursors are cool. TIMDEX definitely needs them and OpenSearch already supports them so we can add them easily(ish). Basically it's just more efficient pagination and addresses the problem of how to page past 10,000ish records in a search without needing to calculate each preceding page before showing the requested page. They are temporary in nature and allow for getting "all records that matched your query" in a neat way.

Why are these changes being introduced: * Loading journals from an external source is how we'll be internally detecting journal name matches in this experimental detector. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-41 * https://mitlibraries.atlassian.net/browse/TCO-48 How does this address that need: * Adds data harvester from OpenAlex Sources API endpoint * Adds data loader from the result of the OpenAlex harvest which can load from local or remote file Document any side effects to this change: * Technically there are more than Journals being loaded in the current query configuration from OpenAlex. We are pulling both Journals and Book Series. It felt like Journals was still a fine internal name for this but we could also consider Serials to be more accurate. * Remote file loading only handles URI hosted files. If we move the OpenAlex Harvester functionality into a data pipeline that exports into S3 directly in the future we may want to add S3 support to the loader rather than exposing the S3 files via https (although the data is open so there is no harm in exposing direclty via https so it we could choose either)

JPrevost commented Jul 19, 2024

View reviewed changes

JPrevost requested review from matt-bernhardt and jazairi July 19, 2024 13:47

mitlib temporarily deployed to tacos-api-pipeline-pr-61 July 19, 2024 13:48 Inactive

jazairi self-assigned this Jul 19, 2024

jazairi approved these changes Jul 19, 2024

View reviewed changes

JPrevost force-pushed the tco48-journal-data-loader branch from a1f77da to 9838367 Compare July 19, 2024 19:54

JPrevost temporarily deployed to tacos-api-pipeline-pr-61 July 19, 2024 19:54 Inactive

JPrevost merged commit cc5c517 into main Jul 19, 2024
2 checks passed

JPrevost deleted the tco48-journal-data-loader branch July 19, 2024 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tco48 journal data loader #61

Tco48 journal data loader #61

JPrevost commented Jul 19, 2024

JPrevost Jul 19, 2024

jazairi left a comment

jazairi Jul 19, 2024

JPrevost Jul 19, 2024

JPrevost commented Jul 19, 2024

Tco48 journal data loader #61

Tco48 journal data loader #61

Conversation

JPrevost commented Jul 19, 2024

JPrevost Jul 19, 2024

Choose a reason for hiding this comment

jazairi left a comment

Choose a reason for hiding this comment

jazairi Jul 19, 2024

Choose a reason for hiding this comment

JPrevost Jul 19, 2024

Choose a reason for hiding this comment

JPrevost commented Jul 19, 2024