Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load existing checksum data from AIP manifests or a CSV #2

Merged
merged 3 commits into from
Mar 4, 2024

Conversation

barmintor
Copy link
Member

No description provided.

@barmintor barmintor requested a review from elohanlon March 1, 2024 23:37
log_io.print("xsrc,#{checksum_algorithm.name},#{checksum_value},#{transfer_source_path}\n")
return
end
unless checksum = Checksum.find_or_create_by(value: checksum_value, transfer_source: transfer_source, checksum_algorithm: checksum_algorithm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth adding a multi-column index on [:value, :transfer_source, :checksum_algorithm] to support this query?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at the very least, probably worth adding an index on the :value column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big index. Two of the fields are indexed, so hoping that (1) mySql is sorting that out and not doing a table scan and (2) we can let this slide on creating batch checksums? FWIW I was able to create 270k pretty fast yesterday.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find by value would be intense. Though if that's a query we want to support...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I wasn't thinking about the index on transfer_source_id! You're right, MySQL is probably using that one automatically, so it makes sense that the 270k operation was quick!

Find by value is probably only necessary if we also want to use this table to identify duplicates, but that can always come later if needed.

Consider my suggestions here withdrawn!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something weird about this table that might indicate an error in here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, stale metadata - just needed to analyze.

open("log/checksum-csv-#{start.to_i}#{'-dry_run' if dry_run}.log", "w") do |log|
CSV.foreach(csv_path, headers: true).each do |row|
# checksum_algorithm_name,checksum_value,transfer_source_path
checksum_algorithm_name = row['checksum_algorithm_name']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though our database is using case-insensitive collation for comparison, we might want to downcase any sha256 checksum hex values here for stored value consistency. (But no modifications to CRC32C base64 values of course, since those are case-sensitive.)

Although we could just make sure that our input spreadsheets always contain downcase sha256 values, it might be a little safer to have that quick conversion here, especially if we'll be gathering already-calculated sha256 values from a number of sources that may or may not have been downcased.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't seem unreasonable at all, especially if we are gathering precalculated checksums from other sources as inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just pushed a commit that does that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure there's no loads running before you pull please!

@elohanlon
Copy link
Member

@barmintor Heads up that I just rebased this branch on main to pick up some CI changes.

…ays downcase sha256 checksum values before storing them
@elohanlon elohanlon merged commit 778fa01 into main Mar 4, 2024
1 check passed
@elohanlon elohanlon deleted the checksum-loaders branch March 4, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants