Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let writers specify the chunk size they want #26

Open
wants to merge 5 commits into
base: pkhuong/podman
Choose a base branch
from

Conversation

pkhuong
Copy link
Collaborator

@pkhuong pkhuong commented Jan 22, 2024

The first two commits are small cleanups.

The next commit adds an optional base_chunk_size field in the ManifestV1 proto. When this new field is populated, it holds the base chunk size for the chunks referenced by that manifest: all but the last chunk must have that size (the final chunk may be smaller). This new proto field has the unused protobuf index 10.

Given this infrastructure, it's easy to make the writer's chunk size configurable (with an environment variable for now), and to pipe that to the manifest.

Tested with t/test.sh (i.e., large chunks of the sqlite test suite with validation of the replicated data).

@pkhuong pkhuong force-pushed the pkhuong/podman branch 2 times, most recently from 16d1f9a to 5dea95c Compare January 23, 2024 00:54
…nitial

Manifest has grown pretty big, as flagged by clippy.

TESTED=it builds.
The code so far assumes that readers recognize the fingerprint for
a full-sized chunks that's all zero filled.  The chunk size is currently
constant, 64 KiB.  Decouple that constant from the read and write side
logic, since it's really become part of the replication protocol.

TESTED=partial test.sh
…buf index 10)

Still with a default of 64 KB, except with a differently named constant.

The (write) tracker's SNAPSHOT_GRANULARITY constant is now WRITE_SNAPSHOT_GRANULARITY,
and we can see in the patch that we only had to update write-side
code, or constants for performance heuristics.

On the write side, we stick in the default `None`: this lets us
(temporarily) test what would happen if a new reader accessed old
manifests without this new field.

TESTED=partial test.sh
Hide WRITE_SNAPSHOT_GRANULARITY behind a function, and rename the
constant to `*DEFAULT_*WRITE_SNAPSHOT_GRANULARITY`, to enforce using
the function.

The Tracker object now keeps track of its snapshot granularity (i.e.,
chunk size, except for the last chunk which may be shorter) as a field...
and pipes it all the way to the serialised ManifestV1.

Manual inspection shows that the remaining reference changes (a lot of
which also transitively turned into functions) are for performance
heuristics (copier.rs, unzstd.rs), and thus also apply the current
default as the min value, to avoid acting too weird when the actual
chunk size is eventually configured much smaller than expected.

TESTED=it builds
with the VERNEUIL_WRITE_SNAPSHOT_GRANULARITY environment variable.

The value must be strictly positive, and ideally not too small;
the code currently imposes a min value of 64 bytes.

Currently an environment variable because there's no clear need for
making this configurable via json, but shouldn't be too hard to fix:
the Tracker object tracks the current value in an instance field.

TESTED=test.sh, w/o size override and w/ export VERNEUIL_WRITE_SNAPSHOT_GRANULARITY={16384,196608}`
@ahicks92
Copy link
Contributor

ahicks92 commented Feb 6, 2024

I'll try to get back to this if @jpihlaja-bt doesn't beat me to it. What is the purpose, though? I don't understand what we gain unless it's that you want to change the sql page size and are hoping that chunks match it.

@pkhuong
Copy link
Collaborator Author

pkhuong commented Feb 8, 2024

What is the purpose, though?

There's already only a weak relationship between SQLite's page size and Verneuil's chunk size: we must do the right thing for SQLite page size smaller than 64 KB (e.g., the default value). Making the chunk size tweakable (usually larger, I expect) lets us improve the aggregate cost of operations when IOPS are more expensive than storage; that's particularly interesting for read-heavy workloads on mostly static data, where we really want megabyte-sized chunks.

I think the final state might look like a special case for the root page, and smaller pages for B-tree leaves (e.g., after the first ~ sqrt(n) bytes of data), but a single tun-able is already useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants