Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AN-380 Make call cache hashing strategy configurable per filesystem and backend #7683

Open
wants to merge 41 commits into
base: develop
Choose a base branch
from

Conversation

jgainerdewar
Copy link
Collaborator

@jgainerdewar jgainerdewar commented Jan 24, 2025

Description

See Configuring.md changes for a description of what this does and how it works.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@jgainerdewar jgainerdewar marked this pull request as ready for review February 4, 2025 21:48
@jgainerdewar jgainerdewar requested a review from a team as a code owner February 4, 2025 21:48
Copy link
Collaborator

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches the mental picture I developed during mob times. Nice work!

Comment on lines 532 to 539
###### Notes on GCS Hashing Strategy

For some high-throughput production use cases that run many, many copies of the same task differing by only one input file,
the collision rate of `crc32c` may be unacceptably high. To dramatically reduce the chance of collision at the cost of
reducing the collection of tasks that can be call cached, we recommend `hashing-strategy: ["md5", "identity"]`. This
will use `md5` hashes when they exist, and fall back to the very strict `identity` strategy when they do not. Because
all GCS files created by Cromwell are guaranteed to have `md5`, `identity` comes into play only for user-provided workflow
input files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explanation, one thing to add could be "why?" md5 does not exist, namely parallel composite upload

@@ -174,23 +176,39 @@ object GcsBatchSizeCommand {
file.objectBlobId.map(GcsBatchSizeCommand(file, _))
}

case class GcsBatchCrc32Command(override val file: GcsPath, override val blob: BlobId, setUserProject: Boolean = false)
extends IoHashCommand(file)
case class GcsBatchHashCommand(override val file: GcsPath,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely optional bikeshedding.

I've never been the biggest fan of this name, since this command is for a single file... it does not contain a batch.

Doubly so now that GCP Batch competes with GCS batch for headspace.

Suggested change
case class GcsBatchHashCommand(override val file: GcsPath,
case class GcsHashCommand(override val file: GcsPath,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you but would rather leave as-is and make a larger-scale change to this whole package later.

private def getFileHashForGcsPath(gcsPath: GcsPath): IO[FileHash] = delayedIoFromTry {
gcsPath.objectBlobId.map(id => FileHash(HashType.GcsCrc32c, gcsPath.cloudStorage.get(id).getCrc32c))
}
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also extremely optional, I wonder if we could apply ✨ More Types ✨ to DRY out this group of functions.

Suggested change
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] =
private def getFileHashForPath[T](path: T, hashStrategy: FileHashStrategy[T]): IO[Option[FileHash]] =

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, it would just mean that function would have a big match/case statement in it... which might be nicer, I'll give it a try. I kind of like the idea of doing cool stuff with FileHashStrategy[T] but don't want to hold this up that much further.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and decided to leave this as-is.

fsConfigs <- configurationDescriptor.backendConfig.as[Option[Config]]("filesystems").toList
fsKey <- fsConfigs.root.keySet().asScala
configKey = s"${fsKey}.caching.hashing-strategy"
// TODO this allows users to override with an empty list to prevent caching, is that desirable or a footgun?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a global call-caching.enabled flag that seems like a better choice.

I don't it makes sense to let users toggle call caching on a per-filesystem basis because one backend can use multiple filesystems in the same task (e.g. GCS + DRS).

I would make empty list an exception if call-caching.enabled is true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will change.

Copy link
Contributor

@LizBaldo LizBaldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I only have a few minor comments, the documentation is great and super clear to follow, especially for somebody not as familiar with call caching like me.

val defaultHashingStrategies: Map[String, FileHashStrategy] = Map.empty

// Hashing strategy to use if none is specified.
val fallbackHashingStrategy: FileHashStrategy = FileHashStrategy.Md5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the list of md5 + identity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generic fallback across all backends and filesystems, since identity is only implemented for GCP I'm afraid that would confuse people.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!

}

object FileHashStrategy extends LazyLogging {
// Define some commonly-used strategies so they're easier to refer to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also add the identity strategy here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I don't think it will ever be referred to in code since we aren't hard-coding it anywhere - we'll just be setting it in config. I'm a little reluctant to add technically dead code, but if you think it would be useful for documenting suggested strategies I can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh gotcha, I am a bit fuzzy around the identity matching still, that is technically not a hash I guess, leaving it out is fine, especially if it would confuse the next dev working in that area.

docs/Configuring.md Outdated Show resolved Hide resolved
}
}
}
// it should "fail if DrsPath hash doesn't match checksum" in {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man good catch, I commented out these tests to "fix later" and then forgot about them. Later is now, will fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolved this by removing the functionality checked by these tests.

Copy link
Collaborator

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info: test failures are the requester-pays enablement scream test

// blob id, which includes bucket name, blob name, and generation. This slash-delimited string is md5-hashed to
// obtain a string of a consistent length. This is used for call caching. Logic for generating this fingerprint
// is centralized here because different codepaths are dealing with different representations of the GS object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to DRY the below a little more, but not sure I can without finding the magical Google library type that StorageObject and Blob both inherit from. I've poked around a little for that and haven't found anything useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants