-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize RLP serialization #765
base: master
Are you sure you want to change the base?
Conversation
bc0089e
to
cc36a5c
Compare
when nestedListsDepth > 0: | ||
var tracker = StaticRlpLengthTracker[nestedListsDepth]() | ||
elif nestedListsDepth == 0: | ||
var tracker = DynamicRlpLengthTracker() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of types need DynamicLengthTracker
? a string/int? Can we use StaticRlpLengthTracker[1]()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types that are recursive and only fully defined during runtime. By a full definition I mean, these types assume the full structure of the object they hold only at runtime. At compile time they are empty. For example, JsonNode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
references too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I'm curious why block hashing using hash writer is slower than default writer + hash. And can you add block header hashing measurement too? Because in nimbus-eth1 we only use transaction hashing and block header hashing only. Almost no need to hash a full block. |
I have been wondering the same thing for almost two weeks now. I made a lot of measurements using the time profiler on MacOS Instruments but couldn't corner it down. Then yesterday, I switched the tests, i.e. I ran the default writer + hash benchmark first and then the hash writer. This time hash writer was faster in the same order of magnitude. I think it has something to do with caching or garbage collection, I'm not sure though but I'm trying to corner this down. what is weird is that, the benchmark is calculated by collecting 100 measurements and then averaging them out. So caching shouldn't be an issue. Additionally, the encoding benchmarks before hashing benchmarks use the same data which means it should already be cached
Yes, will add this |
Nim generic instantiation cache(in the compiler) is known to have a bug. And by looking into the bug tracker, I believe it still there. |
refc GC
orc GC
|
So the benchmarks collected before are biased because of the CPU cache (I verified this). So I collected new benchmarks and ran each benchmark as a different process to avoid cache hits causing skewed/biased data. The code for the benchmarks is attached after the benchmarks. EDIT: The collected benchmarks for rlp hashing a block with varying number of transactions
nim compile --mm:orc -d:blk -d:release --out:bench_blk test.nim
nim compile --mm:orc -d:blk80 -d:release --out:bench_blk80 test.nim
nim compile --mm:orc -d:blk320 -d:release --out:bench_blk320 test.nim
nim compile --mm:orc -d:blk640 -d:release --out:bench_blk640 test.nim
nim compile --mm:orc -d:blk1280 -d:release --out:bench_blk1280 test.nim
nim compile --mm:orc -d:blk -d:opt -d:release --out:opt_bench_blk test.nim
nim compile --mm:orc -d:blk80 -d:opt -d:release --out:opt_bench_blk80 test.nim
nim compile --mm:orc -d:blk320 -d:opt -d:release --out:opt_bench_blk320 test.nim
nim compile --mm:orc -d:blk640 -d:opt -d:release --out:opt_bench_blk640 test.nim
nim compile --mm:orc -d:blk1280 -d:opt -d:release --out:opt_bench_blk1280 test.nim |
Also want to point out that the timing improvements for encoding blocks (not hashing) using the new implementation is way more significant than for hashing with new implementations. This is because the keccak implementation adds a huge bias. The next step to improving the rlp hash algo should be optimizing the keccak implementation before optimizing the writers again. |
tests/rlp/test_rlp_profiler.nim
Outdated
let bytes13 = encodeAndHash(myTx) | ||
benchmark "Block Sequence hashing using default writer and then hash": | ||
let bytes14 = encodeAndHash(blkSeq) | ||
benchmark "Block hashing using defualt writer and the hash": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defualt -> default
tests/rlp/test_rlp_profiler.nim
Outdated
let bytes14 = encodeAndHash(blkSeq) | ||
benchmark "Block hashing using defualt writer and the hash": | ||
let bytes15 = encodeAndHash(blkSeq) | ||
benchmark "Block header hashing using hash writer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block header hashing using hash writer -> Block header hashing using default writer and the hash
As a general note, It's not to benchmark both, but the operative benchmarks in a short/medium-term way are refc ones. |
here are refc benchmarks:
|
No copyright linter will enforce this in this repository, but still good to check all the modified files in the PR and update their copyright years to 2025. |
b5cdf8d
to
4fc0632
Compare
something to note about benchmarking garbage-collected languages: the cost of allocation and deallocation is often not charged on the spot, but happens "randomly" - specially in a real-world application where you can't really micro-benchmark it in isolation (due to fragmentation etc). Instead of benchmarking, the other way to reason about a change such as this is to measure/reason about the number of allocations that happen during an operation because this will ultimately determine how RLP contributes to the bigger picture in an application - the aim of the optimization is to reduce this number while not compromising on other things - for the ideal case where you're hashing an Another way to look at it is that the objective of this PR can be reduced to " |
Noted
Makes sense, there are two edge cases where dynamic allocation is unavoidable. First, are lists(static or dynamic) within dynamic sized lists. In the RLP world this would include sequences/arrays within dynamic sized sequences and objects within dynamic sized sequences. An example is the transactions field within the block object. The second is where the type of an object doesn't fully assume the shape of the object at compile time. Examples are recurring objects for binary trees and parsed data types like JSON. Apart from this all other cases achieve static memory allocation |
For Second approach, and this has been tested is replacing I prefer to merge this first, and use it in nimbus-eth1 because we have achieve improvements compared to older implementation. If no more objections, I will merge this soon. |
Description
This PR improves upon the memory and time complexity of the RLP serialisation algorithm(
RlpWriter
). The existing implementation of RLP uses theRlpWriter
that extends a buffer and moves memory whenever required. This wastes a lot of time for large payloads. This PR, firstly, implements aRlpLengthTracker
that dry runs the serialisation to calculate the full length of the serialised data and the prefixes of lists. Secondly, theRlpTwoPassWriter
uses the pre-calculated values to assign memory in one shot and avoid moving memory to make space for prefixes.Additionally, the PR also implements a
RlpHashWriter
that directly updates the sponge construction of keccak256 hash to avoid allocating any memory (few exceptions which are negligible).Rationale
Naive implementations of RLP serialisation require moving memory because prefixes of lists can only be calculated after serialising the items within the list. Since complex types such as objects are represented as lists the serialisation of any data structure in Ethereum requires moving the encoded/serialised list items to make space for the prefix. In cases where the encoded buffer is hashed, the encoded list items cannot be absorbed until the prefix is calculated.
However, the encoded lengths of base types can be calculated without actually serialising them (or allocating memory). This is exploited to split the computation into two. First to calculate lengths of the base types, and then to actually serialise the data.
By calculating the lengths of the base types we can find the length of encoded list items and thereby also the length of the prefix (without actually serialising the prefix). Additionally, we also derive the total length of the serialised data from this. This logic is implemented as the
RlpLengthTracker
which recordslistLengths
,prefixLengths
and thetotalLength
.In the second part of the computation, we actually serialise the data (and the prefixes) without skipping over memory or moving memory positions. This is implemented as the
RlpTwoPassWriter
, "two-pass" because we iterate through the data twice. Because the data is iterated through twice theRlpTwoPassWriter
can only be used within theencode
function. In cases where users manually callappend
, theRlpTwoPaasWriter
would require users to manually run theRlpLengthTracker
for the first pass and then the second pass.RlpHashWriter
also uses two passes but for the second pass it absorbs the data directly into the sponge construction of keccak instead of allocating memory thereby saving memory (amortized constant usage). And for similar reasons as stated above, a new api method,encodeHash
is exposed.Changelog
encodeHash
for constant memory RLP encoding and hashing.encode
method to serialise data in "two passes".writer.nim
handling only the dependency injection architecture of theappend
methodtest_hashes.nim
andtest_rlp_profiler.nim
. The first, does a sanity check of the hash writer. The second one, time profiles the new writer implementation to the default implementation.Benchmarks
using
--mm=refc
(refc GC)using
--mm=orc
(orc GC)