-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tco48 journal data loader #61
Conversation
We output ISSN data regularly with this application and this filter has been annoying. As we never intend to use SSN data in this application it feels safe enough to remove this rather than renaming ISSN to something less meaningful to avoid the filtering.
Rails.application.config.filter_parameters += [ | ||
:passw, :secret, :token, :_key, :crypt, :salt, :certificate, :otp, :ssn | ||
Rails.application.config.filter_parameters += %i[ | ||
passw secret token _key crypt salt certificate otp | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the change was auto-formatted when I resaved to match our current best practices. The only practical change was removal of :ssn
from the list of filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that the rake tasks work as expected. Struggling with a headache this afternoon, so this hasn't been the most thorough review. (Mostly, I was hoping to better understand what cursors do before approving, but it works and I trust your judgment.)
Both comments are non-blocking. The Detector::Journal.delete_all
call gives me some pause, but I couldn't think of a reason why we couldn't clear that table each time.
end | ||
|
||
# Delete all journals. We do this to simplify the loader process to avoid consideration of updates/deletes. | ||
Detector::Journal.delete_all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an edge case in which we wouldn't want to delete everything first? I'm guessing not, but this does worry me a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. If we end up loading from multiple sources for some reason then we'd likely only want to delete from this source... but we're not there yet so this should be fine for now.
@jazairi cursors are cool. TIMDEX definitely needs them and OpenSearch already supports them so we can add them easily(ish). Basically it's just more efficient pagination and addresses the problem of how to page past 10,000ish records in a search without needing to calculate each preceding page before showing the requested page. They are temporary in nature and allow for getting "all records that matched your query" in a neat way. |
Why are these changes being introduced: * Loading journals from an external source is how we'll be internally detecting journal name matches in this experimental detector. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-41 * https://mitlibraries.atlassian.net/browse/TCO-48 How does this address that need: * Adds data harvester from OpenAlex Sources API endpoint * Adds data loader from the result of the OpenAlex harvest which can load from local or remote file Document any side effects to this change: * Technically there are more than Journals being loaded in the current query configuration from OpenAlex. We are pulling both Journals and Book Series. It felt like Journals was still a fine internal name for this but we could also consider Serials to be more accurate. * Remote file loading only handles URI hosted files. If we move the OpenAlex Harvester functionality into a data pipeline that exports into S3 directly in the future we may want to add S3 support to the loader rather than exposing the S3 files via https (although the data is open so there is no harm in exposing direclty via https so it we could choose either)
a1f77da
to
9838367
Compare
Why are these changes being introduced:
detecting journal name matches in this experimental detector.
Relevant ticket(s):
How does this address that need:
load from local or remote file
Document any side effects to this change:
Technically there are more than Journals being loaded in the current
query configuration from OpenAlex. We are pulling both Journals and
Book Series. It felt like Journals was still a fine internal name for
this but we could also consider Serials to be more accurate.
Remote file loading only handles URI hosted files. If we move the
OpenAlex Harvester functionality into a data pipeline that exports into
S3 directly in the future we may want to add S3 support to the loader
rather than exposing the S3 files via https (although the data is open
so there is no harm in exposing direclty via https so it we could choose
either)
Removes filtering of SSN as it also catches ISSN (see separate commit message)