-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk cache: store a data integrity header for non-CAS blobs #186
base: master
Are you sure you want to change the base?
Conversation
6c1660d
to
061cfa2
Compare
061cfa2
to
314b7c2
Compare
@buchgr: any thoughts on this? In particular:
|
594719e
to
1b97393
Compare
Is there a good reason to not use protobuf for the header? Did you consider reusing the Digest message from the REAPI?
I think it's a great idea. Thanks! |
if err != nil { | ||
log.Fatal(err) | ||
|
||
casDir := filepath.Join(dir, cache.CAS.String(), subDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also version the CAS dir? I think it's possible that we'll need to make changes to it eventually as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skipped versioning the CAS dir when I first implemented this because they are already stored in a natural representation on disk. But then I added support for different hash types, and forgot to revisit this choice.
Rethinking it now, CAS blobs should have a header to define the hash type, at the small cost of making it more difficult to verify blobs with the sha256sum command line utility. So the CAS dir should be versioned also. Maybe we should use the same header as for AC/RAW blobs, to simplify our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple alternative could be to encode the hash function in the path string i.e. cas/sha256. Would that be too complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea- we could then just use a Digest proto as the header, and the blob offset is determined by the hash type in the filename.
ac.v2/sha256/<key>
ac.v2/md5/<key>
cas.v2/sha256/<key>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that using a Digest like this is not so simple, since due to the SizeBytes field it doesn't have a fixed size on disk even if the Hash field has a constant length. So this would still require a prefix to store the length of the proto.
It would work well with a hand-rolled header though.
addChecksum: addChecksum, | ||
} | ||
|
||
if scanDir.version < 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to mix the migration detection with the loading logic? Shouldn't detecting whether a migration is necessary be as simple as checking whether a folder V-1 exists and is not empty?
if hasItems("ac/)) {
migrate();
rmdir("ac/")
}
load_items()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through this code again, and remembered why I mixed the logic:
- So we end up with an accurate order for the LRU index (migrating old style files first would always place them at the start of the index even if they're older than new style files).
- Building the LRU index is slower when done sequentially than in parallel, and I would prefer cache upgrades to be fast (less chance that the admin might think something went wrong and ctrl-c in the middle of an upgrade).
I didn't want to use a single protobuf for the entire file (header + blob) because it requires reading the entire file into memory in order to unmarshal, and that might be undesirable for large files. Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.
The Digest message alone is insufficient if we want to allow other hash functions in the future. But this would make sense if we figure out a good way to use protobuf for the header. |
There is the delimited protobuf format that's recommended in the official docs [1] [2]. It's implemented in the Java protobuf library [3]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP. [1] https://developers.google.com/protocol-buffers/docs/techniques#streaming |
Interesting, I'll try this out- thanks! |
As an alternative to storing data as delimited Protobufs there is also riegeli [1], which allows storing metadata in the same file and doesn't require reading the whole file into memory. The kythe project seems to have a Go implementation available [2]. The nice thing about riegeli is that it can also automatically compress the data and save some disc space (although I'm not sure whether the overhead is worth it for non-CAS blobs, even with very fast compression modes like brotli or snappy). [1] https://github.com/google/riegeli |
Thanks @Yannic. The overhead of the header is negligible and there isn't a need for compression. I'd prefer to not pull in another dependency for the header. |
I would also prefer not to pull in another dependency for what I hope can be a simple header. But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ? |
Writing the header proto size, then the message, then the content sounds pretty reasonable to me. For inspiration, similar code is in: https://github.com/kythe/kythe/blob/master/kythe/go/platform/delimited/delimited.go |
The issue #12 talks about making compression part of the caching protocol. It was always my understanding that compression is most interesting for CAS blobs. We do not store protobufs in the CAS. For that zlib sounds more interesting for compression and I believe it is shipped as part of the Go standard library. Does any of you feel differently? |
The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.
Which cache operations should do the data verification? Put? Get? Contains? Have you benchmarked how performance is affected? Especially for HTTP/1.1-without-SSL use cases, where golang’s io.Copy can use linux kernel’s sendfile(2) for transfeering data directly from file to socket, instead of copying data to and from user space. I think that is what today makes bazel-remote on par with heavily optimized http servers like NGINX. |
I haven't performed any benchmarks yet, and I haven't figured out when/where to perform the validation. But I suspect that we will either validate data on the first Get or Contains call, and cache the result until the file is rewritten (or the server restarts). |
The header is made up of three fields:
DigestFunction.
of bytes in the blob.
the particular DigestFunction.
(32 for SHA256. 20 for SHA1, 16 for MD5).
Note that we currently only support SHA256, however.
This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.
To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.
We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (#67).
This also includes a slightly refactored version of PR #123
(load the items from disk concurrently) by @bdittmer.