load existing checksum data from AIP manifests or a CSV #2

barmintor · 2024-03-01T23:37:00Z

No description provided.

elohanlon · 2024-03-02T00:32:44Z

lib/atc/loaders/checksum_loader.rb

+      log_io.print("xsrc,#{checksum_algorithm.name},#{checksum_value},#{transfer_source_path}\n")
+      return
+    end
+    unless checksum = Checksum.find_or_create_by(value: checksum_value, transfer_source: transfer_source, checksum_algorithm: checksum_algorithm)


Do you think it's worth adding a multi-column index on [:value, :transfer_source, :checksum_algorithm] to support this query?

Or at the very least, probably worth adding an index on the :value column.

big index. Two of the fields are indexed, so hoping that (1) mySql is sorting that out and not doing a table scan and (2) we can let this slide on creating batch checksums? FWIW I was able to create 270k pretty fast yesterday.

Find by value would be intense. Though if that's a query we want to support...

Ah! I wasn't thinking about the index on transfer_source_id! You're right, MySQL is probably using that one automatically, so it makes sense that the 270k operation was quick!

Find by value is probably only necessary if we also want to use this table to identify duplicates, but that can always come later if needed.

Consider my suggestions here withdrawn!

There's something weird about this table that might indicate an error in here.

NVM, stale metadata - just needed to analyze.

elohanlon · 2024-03-02T19:35:10Z

lib/tasks/checksums.rake

+      open("log/checksum-csv-#{start.to_i}#{'-dry_run' if dry_run}.log", "w") do |log|
+        CSV.foreach(csv_path, headers: true).each do |row|
+          # checksum_algorithm_name,checksum_value,transfer_source_path
+          checksum_algorithm_name = row['checksum_algorithm_name']


Even though our database is using case-insensitive collation for comparison, we might want to downcase any sha256 checksum hex values here for stored value consistency. (But no modifications to CRC32C base64 values of course, since those are case-sensitive.)

Although we could just make sure that our input spreadsheets always contain downcase sha256 values, it might be a little safer to have that quick conversion here, especially if we'll be gathering already-calculated sha256 values from a number of sources that may or may not have been downcased.

What do you think?

that doesn't seem unreasonable at all, especially if we are gathering precalculated checksums from other sources as inputs.

Great! Just pushed a commit that does that.

make sure there's no loads running before you pull please!

elohanlon · 2024-03-02T20:44:22Z

@barmintor Heads up that I just rebased this branch on main to pick up some CI changes.

…ays downcase sha256 checksum values before storing them

barmintor requested a review from elohanlon March 1, 2024 23:37

barmintor force-pushed the checksum-loaders branch from 03ba6eb to dbd8118 Compare March 1, 2024 23:37

elohanlon reviewed Mar 2, 2024

View reviewed changes

barmintor added 2 commits March 2, 2024 15:43

load existing checksum data from AIP manifests or a CSV

9ea217e

pass start_time param to checksum loader

9418f67

elohanlon force-pushed the checksum-loaders branch from a0eae8a to 9418f67 Compare March 2, 2024 20:43

Update atc:checksums:csv and atc:checksums:aip tasks so that they alw…

b0ccce3

…ays downcase sha256 checksum values before storing them

elohanlon force-pushed the checksum-loaders branch from e99580e to b0ccce3 Compare March 2, 2024 20:59

elohanlon approved these changes Mar 4, 2024

View reviewed changes

elohanlon merged commit 778fa01 into main Mar 4, 2024
1 check passed

elohanlon deleted the checksum-loaders branch March 4, 2024 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load existing checksum data from AIP manifests or a CSV #2

load existing checksum data from AIP manifests or a CSV #2

barmintor commented Mar 1, 2024

elohanlon Mar 2, 2024

elohanlon Mar 2, 2024

barmintor Mar 2, 2024

barmintor Mar 2, 2024

elohanlon Mar 2, 2024

barmintor Mar 2, 2024

barmintor Mar 2, 2024

elohanlon Mar 2, 2024

barmintor Mar 2, 2024

elohanlon Mar 2, 2024

barmintor Mar 2, 2024

elohanlon commented Mar 2, 2024

load existing checksum data from AIP manifests or a CSV #2

load existing checksum data from AIP manifests or a CSV #2

Conversation

barmintor commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elohanlon commented Mar 2, 2024