-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AN-380 Make call cache hashing strategy configurable per filesystem and backend #7683
base: develop
Are you sure you want to change the base?
Conversation
…nto jd_AN-380_hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matches the mental picture I developed during mob times. Nice work!
docs/Configuring.md
Outdated
###### Notes on GCS Hashing Strategy | ||
|
||
For some high-throughput production use cases that run many, many copies of the same task differing by only one input file, | ||
the collision rate of `crc32c` may be unacceptably high. To dramatically reduce the chance of collision at the cost of | ||
reducing the collection of tasks that can be call cached, we recommend `hashing-strategy: ["md5", "identity"]`. This | ||
will use `md5` hashes when they exist, and fall back to the very strict `identity` strategy when they do not. Because | ||
all GCS files created by Cromwell are guaranteed to have `md5`, `identity` comes into play only for user-provided workflow | ||
input files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explanation, one thing to add could be "why?" md5 does not exist, namely parallel composite upload
@@ -174,23 +176,39 @@ object GcsBatchSizeCommand { | |||
file.objectBlobId.map(GcsBatchSizeCommand(file, _)) | |||
} | |||
|
|||
case class GcsBatchCrc32Command(override val file: GcsPath, override val blob: BlobId, setUserProject: Boolean = false) | |||
extends IoHashCommand(file) | |||
case class GcsBatchHashCommand(override val file: GcsPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extremely optional bikeshedding.
I've never been the biggest fan of this name, since this command is for a single file... it does not contain a batch.
Doubly so now that GCP Batch competes with GCS batch for headspace.
case class GcsBatchHashCommand(override val file: GcsPath, | |
case class GcsHashCommand(override val file: GcsPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you but would rather leave as-is and make a larger-scale change to this whole package later.
private def getFileHashForGcsPath(gcsPath: GcsPath): IO[FileHash] = delayedIoFromTry { | ||
gcsPath.objectBlobId.map(id => FileHash(HashType.GcsCrc32c, gcsPath.cloudStorage.get(id).getCrc32c)) | ||
} | ||
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also extremely optional, I wonder if we could apply ✨ More Types ✨ to DRY out this group of functions.
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] = | |
private def getFileHashForPath[T](path: T, hashStrategy: FileHashStrategy[T]): IO[Option[FileHash]] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, it would just mean that function would have a big match/case statement in it... which might be nicer, I'll give it a try. I kind of like the idea of doing cool stuff with FileHashStrategy[T]
but don't want to hold this up that much further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline and decided to leave this as-is.
fsConfigs <- configurationDescriptor.backendConfig.as[Option[Config]]("filesystems").toList | ||
fsKey <- fsConfigs.root.keySet().asScala | ||
configKey = s"${fsKey}.caching.hashing-strategy" | ||
// TODO this allows users to override with an empty list to prevent caching, is that desirable or a footgun? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a global call-caching.enabled
flag that seems like a better choice.
I don't it makes sense to let users toggle call caching on a per-filesystem basis because one backend can use multiple filesystems in the same task (e.g. GCS + DRS).
I would make empty list an exception if call-caching.enabled
is true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I only have a few minor comments, the documentation is great and super clear to follow, especially for somebody not as familiar with call caching like me.
val defaultHashingStrategies: Map[String, FileHashStrategy] = Map.empty | ||
|
||
// Hashing strategy to use if none is specified. | ||
val fallbackHashingStrategy: FileHashStrategy = FileHashStrategy.Md5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the list of md5 + identity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the generic fallback across all backends and filesystems, since identity is only implemented for GCP I'm afraid that would confuse people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense!
} | ||
|
||
object FileHashStrategy extends LazyLogging { | ||
// Define some commonly-used strategies so they're easier to refer to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also add the identity strategy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I don't think it will ever be referred to in code since we aren't hard-coding it anywhere - we'll just be setting it in config. I'm a little reluctant to add technically dead code, but if you think it would be useful for documenting suggested strategies I can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh gotcha, I am a bit fuzzy around the identity matching still, that is technically not a hash I guess, leaving it out is fine, especially if it would confuse the next dev working in that area.
} | ||
} | ||
} | ||
// it should "fail if DrsPath hash doesn't match checksum" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man good catch, I commented out these tests to "fix later" and then forgot about them. Later is now, will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resolved this by removing the functionality checked by these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Info: test failures are the requester-pays enablement scream test
filesystems/gcs/src/main/scala/cromwell/filesystems/gcs/batch/GcsBatchIoCommand.scala
Outdated
Show resolved
Hide resolved
// blob id, which includes bucket name, blob name, and generation. This slash-delimited string is md5-hashed to | ||
// obtain a string of a consistent length. This is used for call caching. Logic for generating this fingerprint | ||
// is centralized here because different codepaths are dealing with different representations of the GS object. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to DRY the below a little more, but not sure I can without finding the magical Google library type that StorageObject
and Blob
both inherit from. I've poked around a little for that and haven't found anything useful.
Description
See
Configuring.md
changes for a description of what this does and how it works.Release Notes Confirmation
CHANGELOG.md
CHANGELOG.md
in this PRCHANGELOG.md
because it doesn't impact community usersTerra Release Notes