Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chunk size experiment #341

Merged
merged 2 commits into from
Jan 16, 2025
Merged

Conversation

spencerschrock
Copy link
Contributor

@spencerschrock spencerschrock commented Jan 16, 2025

Summary

Adds a simple experiment to vary chunk size, built on top of #306

So far, the experiment has shown similar results on two different machines (ARM64 macOS and x86 Linux), and on two different models.

Two factors may limit what models can be benchmarked:

  1. Special size of 0, which will attempt to read whole files into memory.
  2. timeit.timeit disables garbage collection, so even if smaller buffers are used, older ones may stay around in memory.

The hashing algorithm was left as SHA256 based on the results from the hashing experiment.

hatch run +py=3.11 bench:chunk /tmp/falcon-7b
0:              8.7508
1024:          76.3885
2048:         127.3363
4096:          75.9053
8192:          46.9426
16384:         20.7537
32768:         11.0133
65536:          8.3529
131072:         7.9244
262144:         7.7939
524288:         7.7006
1048576:        7.6512
2097152:        7.6781
4194304:        7.8679
8388608:        7.8370
16777216:       7.8448
33554432:       8.5892
67108864:       8.5765
134217728:      8.6458
268435456:      8.6698
536870912:      8.6905
1073741824:     8.6779

The result of the benchmarks suggest increasing the chunk size to at least 128 KB (131072) with a 1MB read size producing the best results in this benchmark. Happy to do that in a follow-up PR, or this one.

Release Note

NONE

Documentation

NONE

So far, the experiment has shown similar results on two different
machines (ARM64 macOS and x86 Linux), and on two different models.

Two factors may limit what models can be benchmarked:
1. Special size of 0, which will attempt to read whole files into
   memory.
2. `timeit.timeit` disables garbage collection, so even if smaller
   buffers are used, older ones may stay around in memory.

The hashing algorithm was left as SHA256 based on the results from the
hashing experiment.

Signed-off-by: Spencer Schrock <[email protected]>
@spencerschrock spencerschrock requested review from a team as code owners January 16, 2025 17:44
mihaimaruseac
mihaimaruseac previously approved these changes Jan 16, 2025
benchmarks/exp_chunk.py Show resolved Hide resolved
benchmarks/exp_chunk.py Outdated Show resolved Hide resolved
@mihaimaruseac mihaimaruseac merged commit f0a6e96 into sigstore:main Jan 16, 2025
33 checks passed
@spencerschrock spencerschrock deleted the chunk branch January 16, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants