Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pal-SGX/LibOS] Move (expensive) creation of (shallow) merkle tree of trusted files out of critical path #1846

Open
g2flyer opened this issue Apr 16, 2024 · 3 comments

Comments

@g2flyer
Copy link
Contributor

g2flyer commented Apr 16, 2024

Description of the feature

For trusted files, replace the full-file hash in the manifest with a hash over a new meta-data file, produced as a preprocessing step and containing a merkle-tree like structure. Then at runtime verify integrity of that meta-data-file, load the merkle-tree like structure into memory and on-access verify access read file-chunks based on this.

The corresponding meta-data files might be best generated in a manifest-relative directory tree mimicing the absolute paths from the manifest (this would allow easy sharing of such meta-data-files across different applications yet would not require touching system directories where many of the libraries might be installed)

Probably best done/looked at after the move of trusted files from PAL to LibOS (RFC #578 / WIP-PR #1812) as this could also potentially enable a consolidation with the file-format of protected files on a common integrity format (just in this case without encryption)? (after further reflection, that probably is more hassle than worth as there is no easy way to make gmac/ghash into an (unkeyed) cryptographic hashs and the file-format requirements are probably also too different) As initial step probably easiest, though, just to use existing sgx_chunk_hash_t struct into meta-data file and use more or less existing code from load_trusted_or_allowed_file for the preprocessing utility.

Why Gramine should implement it?

Currently, trusted-files are included in the manifest via a full-file hash and at runtime the full file is read, the hash is verified. To verify later access to the file without having to keep the full file in memory, an (memory-resident) array of hashes of all 16k chunks is built so any chunk which is evicted from cache and has to be re-read can be verified separately. The runtime integrity-data-structure creation is in the critical path at runtime and incurs an additional, potentially very significant, latency cost proportional to the file size, regardless how much of that file is accessed. Performing the creation of the integrity structure as preprocessing step can completely eliminate this runtime latency cost and will allow use of large files. Also as many trusted files will be re-used across different gramine applications or invocations of a single gramine application, there will be also additional computational cost savings.

@dimakuv
Copy link

dimakuv commented Apr 17, 2024

The corresponding meta-data files might be best generated in a manifest-relative directory tree mimicing the absolute paths from the manifest

What about the following idea:

# manifest file contains
sgx.trusted_files = {
   [ uri="file:my/file", meta_file_sha256="1234567890abcdef" ],
}

# the files in the Gramine app dir
app.manifest.sgx
1234567890abcdef.trusted_file.meta

So the metadata file's hash becomes the name of the metadata file itself. This allows two things:

  1. If two files (e.g. symlinks) have the same contents, they are both served by the same metadata file
  2. Flat dir structure, instead of thinking about abs/rel paths (well, maybe put these meta files under a special dir to not pollute the main dir)

Each metadata file should be in binary form, as otherwise it will take unnecessary time to process text-to-binary and 2x as much size. I imagine format like this:

<uint64_t file_size> <sha128_t hash_chunk0> <sha128_t hash_chunk1> ...

Although as initial step probably easiest just to use existing sgx_chunk_hash_t struct into meta-data file and use more or less existing code from load_trusted_or_allowed_file for the preprocessing utility.

Yes, I totally agree. I initially thought of having a proper Merkle Tree, but in reality there's no need for it (for read-only files).

Related task: think if the hard-coded 16KB size of chunks is good enough. Maybe:

  • Some other constant
  • Not a constant but a percentage from the file size (e.g., such that the file is always split into 128 chunks or smth)

@g2flyer
Copy link
Contributor Author

g2flyer commented Apr 17, 2024

The corresponding meta-data files might be best generated in a manifest-relative directory tree mimicing the absolute paths from the manifest

What about the following idea:

# manifest file contains
sgx.trusted_files = {
   [ uri="file:my/file", meta_file_sha256="1234567890abcdef" ],
}

Yep, definitely makes sense to use same structure but use a different key so one could distinguish old from new manifests. However, i assume one can get away without having backwards compatibility, i.e., support old or new (or even both at same time)? Or is such backwards compatibility for manifests a must?

# the files in the Gramine app dir
app.manifest.sgx
1234567890abcdef.trusted_file.meta


So the metadata file's hash becomes the name of the metadata file itself. This allows two things:

1. If two files (e.g. symlinks) have the same contents, they are both served by the same metadata file
2. Flat dir structure, instead of thinking about abs/rel paths (well, maybe put these meta files under a special dir to not pollute the main dir)

It is true that my proposed directory hierarchy is more messy than your flat-structure, your hashed filenames seems to make it though harder in managing these files (e.g., no easy reverse name lookup) and also harder to share across applications (e.g., same file could be accessed via different paths in different apps)? Related to that, though, one other thing which comes in mind is that it would be nice to see if a meta-data-file doesn't match the target file, e.g., after library update (at least assuming that we would like to cache and share meta-datafiles and create only on-demand during manifest creation): one useful convention could be to just last-modified change time of file during creation to be the same as the target file?

BTW: regarding symlinks i implicitly assumed in my scenario that would be just mapped to relative symlinks.

Each metadata file should be in binary form, as otherwise it will take unnecessary time to process text-to-binary and 2x as much size. I imagine format like this:

<uint64_t file_size> <sha128_t hash_chunk0> <sha128_t hash_chunk1> ...

Yep, definitely was also thinking about binary as you say above, maybe with a header defining type and version?

Related task: think if the hard-coded 16KB size of chunks is good enough. Maybe:

  • Some other constant
  • Not a constant but a percentage from the file size (e.g., such that the file is always split into 128 chunks or smth)

Percentage of course keeps in-memory-representation constant but for large files could result in really large chunks which could be very expensive when doing somewhat randome access? I probably would start with 16kb for easiest transition but then maybe parameterize that to a power-of-two multiple of the page size (with the size parameter part of the meta-data-file header)? Or maybe just use linux normal, large and huge page sizes? I think either way could give enough flexibility to tune applications without being too complicated. That said, this is just my gut feeling and not backed by deeper thoughts or investigations of applications with large files

@mkow
Copy link
Member

mkow commented Apr 27, 2024

Actually, with symlinks, maybe we could finally properly support and measure them? If we have a metadata file, then we can put the file type there, so there are no security issues with having them anymore.
I assume that in the future we'll also put some more metadata there, like creation time or permissions which we now either passthrough blindly or fake with dummy values.

Related task: think if the hard-coded 16KB size of chunks is good enough. Maybe:

* Some other constant

* Not a constant but a percentage from the file size (e.g., such that the file is always split into 128 chunks or smth)

As @g2flyer already said, I think percentage doesn't make sense in this scenario. I'd keep 16kB, but allow easy tuning via a #define if someone would want to experiment in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants