-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load existing checksum data from AIP manifests or a CSV #2
Conversation
03ba6eb
to
dbd8118
Compare
log_io.print("xsrc,#{checksum_algorithm.name},#{checksum_value},#{transfer_source_path}\n") | ||
return | ||
end | ||
unless checksum = Checksum.find_or_create_by(value: checksum_value, transfer_source: transfer_source, checksum_algorithm: checksum_algorithm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth adding a multi-column index on [:value, :transfer_source, :checksum_algorithm]
to support this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at the very least, probably worth adding an index on the :value
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big index. Two of the fields are indexed, so hoping that (1) mySql is sorting that out and not doing a table scan and (2) we can let this slide on creating batch checksums? FWIW I was able to create 270k pretty fast yesterday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find by value would be intense. Though if that's a query we want to support...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I wasn't thinking about the index on transfer_source_id! You're right, MySQL is probably using that one automatically, so it makes sense that the 270k operation was quick!
Find by value is probably only necessary if we also want to use this table to identify duplicates, but that can always come later if needed.
Consider my suggestions here withdrawn!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something weird about this table that might indicate an error in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, stale metadata - just needed to analyze
.
open("log/checksum-csv-#{start.to_i}#{'-dry_run' if dry_run}.log", "w") do |log| | ||
CSV.foreach(csv_path, headers: true).each do |row| | ||
# checksum_algorithm_name,checksum_value,transfer_source_path | ||
checksum_algorithm_name = row['checksum_algorithm_name'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though our database is using case-insensitive collation for comparison, we might want to downcase any sha256 checksum hex values here for stored value consistency. (But no modifications to CRC32C base64 values of course, since those are case-sensitive.)
Although we could just make sure that our input spreadsheets always contain downcase sha256 values, it might be a little safer to have that quick conversion here, especially if we'll be gathering already-calculated sha256 values from a number of sources that may or may not have been downcased.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't seem unreasonable at all, especially if we are gathering precalculated checksums from other sources as inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Just pushed a commit that does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure there's no loads running before you pull please!
a0eae8a
to
9418f67
Compare
@barmintor Heads up that I just rebased this branch on main to pick up some CI changes. |
…ays downcase sha256 checksum values before storing them
e99580e
to
b0ccce3
Compare
No description provided.