Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tco48 journal data loader #61

Merged
merged 2 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions config/initializers/filter_parameter_logging.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
# Configure parameters to be partially matched (e.g. passw matches password) and filtered from the log file.
# Use this to limit dissemination of sensitive information.
# See the ActiveSupport::ParameterFilter documentation for supported notations and behaviors.
Rails.application.config.filter_parameters += [
:passw, :secret, :token, :_key, :crypt, :salt, :certificate, :otp, :ssn
Rails.application.config.filter_parameters += %i[
passw secret token _key crypt salt certificate otp
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the change was auto-formatted when I resaved to match our current best practices. The only practical change was removal of :ssn from the list of filters.

110 changes: 110 additions & 0 deletions lib/tasks/journals.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# frozen_string_literal: true

# Loaders can bulk load data
namespace :journals do
# openalex_harvester uses the OpenAleX Sources API endpoint to harvest to a local json file
#
# @note currently, we are limiting to just core Sources but not limiting to journals. We may want to consider
# filtering to journals and not core to compare which is better at some point. This would be done by changing the
# `base_url`. As of July 2024, there are 27722 `journal` and 2084 `book series` records in core which felt like a
# good size for initial experimentation.
#
# @note see https://docs.openalex.org/api-entities/sources for full API documentation
#
# @example
# @param path [String] email address OpenAlex can contact with any issues or concerns with our harvester.
# In development, use your own. If used in production, use a team Moira list.
desc 'Harvest from Open Alex'
task :openalex_harvester, %i[email] => :environment do |_task, args|
raise ArgumentError.new, 'Email is required' unless args.email.present?

base_url = 'https://api.openalex.org/sources?filter=is_core:true'
next_cursor = '*'
email = args.email
sleep_timer = 1 # value in seconds
per_page = 200 # max 200 per openalex api docs
filename = "tmp/openalex_core_#{DateTime.now.strftime('%Y_%m_%d')}.json"

f = File.open(filename, 'w')

# setup initial json structure in the file. This feels a bit clunky but works.
f.write('{')
f.write('"core":[')

records_processed = 0 # purely for informational output

until next_cursor.nil?
next_url = "#{base_url}&cursor=#{next_cursor}&per_page=#{per_page}&mailto=#{email}"

Rails.logger.info("Next url request #{next_url}")
resp = HTTP.headers(accept: 'application/json').get(next_url)

json = resp.parse

next_cursor = json['meta']['next_cursor']
total_records = json['meta']['count']

json['results'].each do |item|
records_processed += 1
record = {
title: item['display_name'],
issns: item['issn'],
publisher: item['host_organization_name'],
alternate_titles: item['alternate_titles'],
abbreviated_title: item['abbreviated_title'],
type: item['type']
}
f.write(JSON.pretty_generate(record))
f.write(',') unless records_processed == total_records # skips final comma
end

pp "Processed #{records_processed} of #{total_records}"

sleep(sleep_timer)
end

# close the json structure in the file
f.write(']')
f.write('}')
end

# openalex_loader can bulk load Journal information
#
# A file to load can be generated by running the `openalex_harvester` task
#
# @example
# bin/rails journals:openalex_loader['local_path_to_file.json']
#
# @example
# bin/rails journals:openalex_loader['https://SERVER/remote_path_to_file.json']
#
# @param path [String] local file path or URI to a JSON file to load
desc 'Load from OpenAlex harvest'
task :openalex_loader, %i[file] => :environment do |_task, args|
raise ArgumentError.new, 'File is required' unless args.file.present?

# does the file look like a path or a URI
if URI(args.file).scheme
Rails.logger.info("Loading data from remote file #{args.file}")
data = URI.open(args.file, 'rb', &:read)
else
Rails.logger.info("Loading data from local file #{args.file}")
data = File.read(args.file)
end

# Delete all journals. We do this to simplify the loader process to avoid consideration of updates/deletes.
Detector::Journal.delete_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an edge case in which we wouldn't want to delete everything first? I'm guessing not, but this does worry me a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. If we end up loading from multiple sources for some reason then we'd likely only want to delete from this source... but we're not there yet so this should be fine for now.


# not ideal, we should consider streaming the file rather than loading it fully into memory
json = JSON.parse(data)

json['core'].each do |journal|
Detector::Journal.create(name: journal['title'],
additional_info: { issns: journal['issns'],
publisher: journal['publisher'],
alternate_titles: journal['alternate_titles'],
type: journal['type'],
abbreviated_title: journal['abbreviated_title'] })
end
end
end
Loading