Add fast, layer-merged forward and inverse NTT code. #610

rod-chapman · 2025-01-02T09:11:16Z

Fast NTT and Inverse NTT in C

This PR introduces optimized C implementation of the forward and inverse NTT for MLKEM.

Significant changes

Complete re-write of ntt.c - see comments at the top of that file
for an overview of the optimization approach.
New per-layer Zeta tables are now auto-generated (by updated autogenerate_file.py script)
into mlkem/zetas.i Note that zetas.i is NOT a standalone translation unit in C, but is
intended to be included by cpp in the body of ntt.c, so that the literal constant
values in the tables are available to the compiler and optimizer for auto-vectorization.
Small change to poly.c/poly_mulcache_compute() to use these new Zeta tables.
The function basemul_cached() has moved from ntt.c to poly.c. It is only
ever called in the latter unit, so really has no place in ntt.c. This also means
that ntt.c implements the NTT and its inverse and nothing else, so greater
cohesion of this unit. basemul_cached() is now declared "static" in poly.c and is
available for inlining and auto-vectorization at the compiler's discretion.
With GCC 14.2.0 -O3 on Apple M1, this saves approx 5000 cycles in all 3 top-level operations.
All CBMC proofs updated accordingly.
Contracts in ntt.h have NOT changed, so the new implementation meets the
same proof obligations as before.

Minor changes

cbmc/Makefile.common has been updated so that the final 4 lines
of logs/result.txt are shown following make result

Verification

Lint OK
All functional tests OK, both with and without MLKEM_DEBUG
All CBMC proofs OK with MLKEM_K=[2,3,4]

Performance

Results on top-level operations will appear from CI runs.

Results for low-level poly_ntt() and poly_invntt_tomont() using the
"--components" switch of "tests bench" on various platforms:

Graviton 3 (c7g instance), GCC 13.2.0 -O3

Clock cycles, lower is better

Operation	Before	After
NTT	1853	705
Inv NTT	4340	908

x86_64 (c7i instance), GCC 13.2.0 -O3

Clock cycles, lower is better

Operation	Before	After
NTT	2367	1172
Inv NTT	5420	1901

Apple M1 (MacBook Pro), GCC 14.2.0 -O3

Clock cycles, lower is better

Operation	Before	After
NTT	1368	366
Inv NTT	1619	439

To come

More results to come:

CPUs that can vectorize, but not AArch64 or Intel (e.g. PowerPC)
CPUs that cannot vectorize (e.g. low-end embedded platforms, such as 32-bit RISC-V)

Code Size

On Apple M1 with GCC 14.2.0 -O3

Size of text segment of ntt.o:

Before: 2212 bytes
After: 6924 bytes

oqs-bot

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`29055` cycles	`29046` cycles	`1.00`
`ML-KEM-512 encaps`	`35416` cycles	`35405` cycles	`1.00`
`ML-KEM-512 decaps`	`45889` cycles	`45880` cycles	`1.00`
`ML-KEM-768 keypair`	`49360` cycles	`49365` cycles	`1.00`
`ML-KEM-768 encaps`	`55621` cycles	`55567` cycles	`1.00`
`ML-KEM-768 decaps`	`70405` cycles	`70328` cycles	`1.00`
`ML-KEM-1024 keypair`	`72065` cycles	`72056` cycles	`1.00`
`ML-KEM-1024 encaps`	`80792` cycles	`80837` cycles	`1.00`
`ML-KEM-1024 decaps`	`100660` cycles	`100700` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`13532` cycles	`13956` cycles	`0.97`
`ML-KEM-512 encaps`	`17333` cycles	`17291` cycles	`1.00`
`ML-KEM-512 decaps`	`22900` cycles	`23181` cycles	`0.99`
`ML-KEM-768 keypair`	`22536` cycles	`22560` cycles	`1.00`
`ML-KEM-768 encaps`	`24566` cycles	`24504` cycles	`1.00`
`ML-KEM-768 decaps`	`32594` cycles	`32450` cycles	`1.00`
`ML-KEM-1024 keypair`	`31394` cycles	`31395` cycles	`1.00`
`ML-KEM-1024 encaps`	`34952` cycles	`34968` cycles	`1.00`
`ML-KEM-1024 decaps`	`45836` cycles	`45831` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`18106` cycles	`18094` cycles	`1.00`
`ML-KEM-512 encaps`	`23033` cycles	`22911` cycles	`1.01`
`ML-KEM-512 decaps`	`30221` cycles	`30205` cycles	`1.00`
`ML-KEM-768 keypair`	`31107` cycles	`31104` cycles	`1.00`
`ML-KEM-768 encaps`	`33901` cycles	`33887` cycles	`1.00`
`ML-KEM-768 decaps`	`44545` cycles	`44494` cycles	`1.00`
`ML-KEM-1024 keypair`	`44753` cycles	`44686` cycles	`1.00`
`ML-KEM-1024 encaps`	`49982` cycles	`49954` cycles	`1.00`
`ML-KEM-1024 decaps`	`64430` cycles	`64342` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`14914` cycles	`14920` cycles	`1.00`
`ML-KEM-512 encaps`	`19657` cycles	`19650` cycles	`1.00`
`ML-KEM-512 decaps`	`26301` cycles	`26315` cycles	`1.00`
`ML-KEM-768 keypair`	`25597` cycles	`25586` cycles	`1.00`
`ML-KEM-768 encaps`	`28072` cycles	`28096` cycles	`1.00`
`ML-KEM-768 decaps`	`37818` cycles	`37908` cycles	`1.00`
`ML-KEM-1024 keypair`	`35932` cycles	`35215` cycles	`1.02`
`ML-KEM-1024 encaps`	`40942` cycles	`40027` cycles	`1.02`
`ML-KEM-1024 decaps`	`54413` cycles	`53472` cycles	`1.02`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`20350` cycles	`20355` cycles	`1.00`
`ML-KEM-512 encaps`	`26953` cycles	`26954` cycles	`1.00`
`ML-KEM-512 decaps`	`35737` cycles	`35779` cycles	`1.00`
`ML-KEM-768 keypair`	`34908` cycles	`34905` cycles	`1.00`
`ML-KEM-768 encaps`	`38182` cycles	`38169` cycles	`1.00`
`ML-KEM-768 decaps`	`50974` cycles	`50951` cycles	`1.00`
`ML-KEM-1024 keypair`	`47974` cycles	`47981` cycles	`1.00`
`ML-KEM-1024 encaps`	`54125` cycles	`54176` cycles	`1.00`
`ML-KEM-1024 decaps`	`71597` cycles	`71674` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`29699` cycles	`33077` cycles	`0.90`
`ML-KEM-512 encaps`	`33533` cycles	`38822` cycles	`0.86`
`ML-KEM-512 decaps`	`41387` cycles	`50806` cycles	`0.81`
`ML-KEM-768 keypair`	`51518` cycles	`54897` cycles	`0.94`
`ML-KEM-768 encaps`	`55748` cycles	`60738` cycles	`0.92`
`ML-KEM-768 decaps`	`66723` cycles	`75882` cycles	`0.88`
`ML-KEM-1024 keypair`	`76880` cycles	`81820` cycles	`0.94`
`ML-KEM-1024 encaps`	`84202` cycles	`91788` cycles	`0.92`
`ML-KEM-1024 decaps`	`98084` cycles	`111489` cycles	`0.88`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`18954` cycles	`18967` cycles	`1.00`
`ML-KEM-512 encaps`	`23561` cycles	`23560` cycles	`1.00`
`ML-KEM-512 decaps`	`30706` cycles	`30694` cycles	`1.00`
`ML-KEM-768 keypair`	`32310` cycles	`32313` cycles	`1.00`
`ML-KEM-768 encaps`	`35877` cycles	`35896` cycles	`1.00`
`ML-KEM-768 decaps`	`46025` cycles	`46038` cycles	`1.00`
`ML-KEM-1024 keypair`	`46541` cycles	`46621` cycles	`1.00`
`ML-KEM-1024 encaps`	`52423` cycles	`52443` cycles	`1.00`
`ML-KEM-1024 decaps`	`66242` cycles	`66278` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`37626` cycles	`43325` cycles	`0.87`
`ML-KEM-512 encaps`	`42482` cycles	`51813` cycles	`0.82`
`ML-KEM-512 decaps`	`52707` cycles	`67018` cycles	`0.79`
`ML-KEM-768 keypair`	`63281` cycles	`71608` cycles	`0.88`
`ML-KEM-768 encaps`	`69996` cycles	`82680` cycles	`0.85`
`ML-KEM-768 decaps`	`83834` cycles	`103114` cycles	`0.81`
`ML-KEM-1024 keypair`	`95928` cycles	`106742` cycles	`0.90`
`ML-KEM-1024 encaps`	`105214` cycles	`121052` cycles	`0.87`
`ML-KEM-1024 decaps`	`123282` cycles	`147326` cycles	`0.84`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`33602` cycles	`39577` cycles	`0.85`
`ML-KEM-512 encaps`	`37594` cycles	`45619` cycles	`0.82`
`ML-KEM-512 decaps`	`46401` cycles	`59080` cycles	`0.79`
`ML-KEM-768 keypair`	`56468` cycles	`64519` cycles	`0.88`
`ML-KEM-768 encaps`	`62163` cycles	`72950` cycles	`0.85`
`ML-KEM-768 decaps`	`74127` cycles	`91111` cycles	`0.81`
`ML-KEM-1024 keypair`	`85447` cycles	`96073` cycles	`0.89`
`ML-KEM-1024 encaps`	`93431` cycles	`107206` cycles	`0.87`
`ML-KEM-1024 decaps`	`109380` cycles	`130548` cycles	`0.84`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`29053` cycles	`29041` cycles	`1.00`
`ML-KEM-512 encaps`	`35403` cycles	`35393` cycles	`1.00`
`ML-KEM-512 decaps`	`45901` cycles	`45891` cycles	`1.00`
`ML-KEM-768 keypair`	`49386` cycles	`49383` cycles	`1.00`
`ML-KEM-768 encaps`	`55648` cycles	`55591` cycles	`1.00`
`ML-KEM-768 decaps`	`70466` cycles	`70353` cycles	`1.00`
`ML-KEM-1024 keypair`	`72125` cycles	`72079` cycles	`1.00`
`ML-KEM-1024 encaps`	`80883` cycles	`80862` cycles	`1.00`
`ML-KEM-1024 decaps`	`100777` cycles	`100733` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`18125` cycles	`18117` cycles	`1.00`
`ML-KEM-512 encaps`	`22176` cycles	`22174` cycles	`1.00`
`ML-KEM-512 decaps`	`28835` cycles	`28829` cycles	`1.00`
`ML-KEM-768 keypair`	`30558` cycles	`30564` cycles	`1.00`
`ML-KEM-768 encaps`	`33621` cycles	`33626` cycles	`1.00`
`ML-KEM-768 decaps`	`43169` cycles	`43166` cycles	`1.00`
`ML-KEM-1024 keypair`	`44171` cycles	`44176` cycles	`1.00`
`ML-KEM-1024 encaps`	`49642` cycles	`49658` cycles	`1.00`
`ML-KEM-1024 decaps`	`62616` cycles	`62640` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`45619` cycles	`51405` cycles	`0.89`
`ML-KEM-512 encaps`	`50810` cycles	`59550` cycles	`0.85`
`ML-KEM-512 decaps`	`62915` cycles	`76537` cycles	`0.82`
`ML-KEM-768 keypair`	`76152` cycles	`84282` cycles	`0.90`
`ML-KEM-768 encaps`	`83475` cycles	`94962` cycles	`0.88`
`ML-KEM-768 decaps`	`99733` cycles	`117171` cycles	`0.85`
`ML-KEM-1024 keypair`	`113603` cycles	`124526` cycles	`0.91`
`ML-KEM-1024 encaps`	`123655` cycles	`138736` cycles	`0.89`
`ML-KEM-1024 decaps`	`144785` cycles	`167382` cycles	`0.86`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3 (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`35706` cycles	`39348` cycles	`0.91`
`ML-KEM-512 encaps`	`39349` cycles	`45455` cycles	`0.87`
`ML-KEM-512 decaps`	`47936` cycles	`57377` cycles	`0.84`
`ML-KEM-768 keypair`	`60354` cycles	`65850` cycles	`0.92`
`ML-KEM-768 encaps`	`65388` cycles	`73802` cycles	`0.89`
`ML-KEM-768 decaps`	`77205` cycles	`89876` cycles	`0.86`
`ML-KEM-1024 keypair`	`91698` cycles	`98993` cycles	`0.93`
`ML-KEM-1024 encaps`	`99226` cycles	`110056` cycles	`0.90`
`ML-KEM-1024 decaps`	`114902` cycles	`130883` cycles	`0.88`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4 (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`34744` cycles	`37951` cycles	`0.92`
`ML-KEM-512 encaps`	`38105` cycles	`43322` cycles	`0.88`
`ML-KEM-512 decaps`	`47520` cycles	`55510` cycles	`0.86`
`ML-KEM-768 keypair`	`58216` cycles	`62975` cycles	`0.92`
`ML-KEM-768 encaps`	`63185` cycles	`70419` cycles	`0.90`
`ML-KEM-768 decaps`	`76063` cycles	`86890` cycles	`0.88`
`ML-KEM-1024 keypair`	`88106` cycles	`94481` cycles	`0.93`
`ML-KEM-1024 encaps`	`95974` cycles	`105197` cycles	`0.91`
`ML-KEM-1024 decaps`	`112900` cycles	`126506` cycles	`0.89`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2 (no-opt)

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`54566` cycles	`60765` cycles	`0.90`
`ML-KEM-512 encaps`	`60731` cycles	`69723` cycles	`0.87`
`ML-KEM-512 decaps`	`74738` cycles	`88788` cycles	`0.84`
`ML-KEM-768 keypair`	`92711` cycles	`102028` cycles	`0.91`
`ML-KEM-768 encaps`	`101596` cycles	`114142` cycles	`0.89`
`ML-KEM-768 decaps`	`120218` cycles	`139370` cycles	`0.86`
`ML-KEM-1024 keypair`	`141471` cycles	`153975` cycles	`0.92`
`ML-KEM-1024 encaps`	`153738` cycles	`169718` cycles	`0.91`
`ML-KEM-1024 decaps`	`178067` cycles	`202204` cycles	`0.88`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Bananapi bpi-f3 benchmarks

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`280848` cycles	`334938` cycles	`0.84`
`ML-KEM-512 encaps`	`316448` cycles	`443581` cycles	`0.71`
`ML-KEM-512 decaps`	`404008` cycles	`591786` cycles	`0.68`
`ML-KEM-768 keypair`	`478553` cycles	`559178` cycles	`0.86`
`ML-KEM-768 encaps`	`524327` cycles	`697637` cycles	`0.75`
`ML-KEM-768 decaps`	`641776` cycles	`889082` cycles	`0.72`
`ML-KEM-1024 keypair`	`721477` cycles	`828079` cycles	`0.87`
`ML-KEM-1024 encaps`	`781700` cycles	`1000643` cycles	`0.78`
`ML-KEM-1024 decaps`	`925584` cycles	`1231971` cycles	`0.75`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`52008` cycles	`51776` cycles	`1.00`
`ML-KEM-512 encaps`	`58668` cycles	`58359` cycles	`1.01`
`ML-KEM-512 decaps`	`75031` cycles	`74228` cycles	`1.01`
`ML-KEM-768 keypair`	`87951` cycles	`87850` cycles	`1.00`
`ML-KEM-768 encaps`	`95804` cycles	`96467` cycles	`0.99`
`ML-KEM-768 decaps`	`119619` cycles	`119522` cycles	`1.00`
`ML-KEM-1024 keypair`	`130753` cycles	`131631` cycles	`0.99`
`ML-KEM-1024 encaps`	`144002` cycles	`145213` cycles	`0.99`
`ML-KEM-1024 decaps`	`175925` cycles	`177906` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Arm Cortex-A55 (Snapdragon 888) benchmarks

Benchmark suite	Current: `755c479`	Previous: `5e3033e`	Ratio
`ML-KEM-512 keypair`	`58341` cycles	`58342` cycles	`1.00`
`ML-KEM-512 encaps`	`65751` cycles	`65810` cycles	`1.00`
`ML-KEM-512 decaps`	`84550` cycles	`84601` cycles	`1.00`
`ML-KEM-768 keypair`	`99016` cycles	`98963` cycles	`1.00`
`ML-KEM-768 encaps`	`110536` cycles	`110566` cycles	`1.00`
`ML-KEM-768 decaps`	`136898` cycles	`136500` cycles	`1.00`
`ML-KEM-1024 keypair`	`150306` cycles	`150087` cycles	`1.00`
`ML-KEM-1024 encaps`	`166814` cycles	`166550` cycles	`1.00`
`ML-KEM-1024 decaps`	`202887` cycles	`202797` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

hanno-becker · 2025-01-02T11:07:22Z

Thanks a lot @rod-chapman, also for preparing this in a clean way that already passes CI. This is a very impressive performance on the C-level!

My concerns, as discussed offline before, are:

This is a large deviation from the reference implementation. There are two issues with this:
- We inherit trust from the reference implementation. Are we risking to loose a potential customer because of this? Is the performance uplift worth the deviation? I'm not sure yet.
- Code-size: Some customers care more about code-size than performance, even in the application-domain. The reference implementation and our current version of it are quite compact, and may be an option for those customers. If we put performance first even in the C backend, we may loose those customers because we don't provide a compact base implementation anymore.
Which customer will actually benefit from this? mlkem-native targets server, mobile and application cores, which I'd say are almost exclusively AArch64 or x86_64+AVX2, hence covered by the existing native backends. I agree that your code is likely faster than the reference [inv]NTT on other architectures as well, but arguing this way feels like putting the cart in front of the horse.

On the other hand, the code fits well with the theme of simultaneous assurance + performance, and will make mlkem-native very competitive even in C vs. C comparisons. And... you have already done the work.

@mkannwischer Please join in and share your thoughts.

mlkem/ntt.c

cbmc/Makefile.common

hanno-becker · 2025-01-02T21:31:12Z

The function basemul_cached() has moved from ntt.c to poly.c. It is only
ever called in the latter unit, so really has no place in ntt.c. This also means
that ntt.c implements the NTT and its inverse and nothing else, so greater
cohesion of this unit. basemul_cached() is now declared "static" in poly.c and is
available for inlining and auto-vectorization at the compiler's discretion.
With GCC 14.2.0 -O3 on Apple M1, this saves approx 5000 cycles in all 3 top-level operations.

NB: If we go down the route of further optimizing the C code at the cost of deviating from the reference implementation, the vector-vector base multiplication should match the structure of the native AArch64 implementation, computing one scalar product of vectors of polynomials of degree 2 at time. This reduces the number of modular reductions by a factor of MLKEM_K.

But: @rod-chapman this is already a huge change -- if, as it seems, the code motion of basemul_cached() is independent from the change to the [inv]NTT, can we please not do it in this PR?

rod-chapman · 2025-01-14T13:30:34Z

Thanks for your thoughts on this PR, and apologies for latency
in replying.

I sympathize with your main point - why did I do this?!?
(And is it compatible with the main goals of mlkem-native anyway?)

I had a bit of think about the first question. My own goal for
doing this are probably something like

To provide a demonstration of proof-driven optimization and that
"formal is fast" is achievable using C and CBMC.
To force improvement of CBMC.
To provide a "faster, slightly bigger" C backend for mlkem-native
(more of which below...).
To provide a fast C backend for targets other than AArch64/NEON and x86-64/AVX2,
for example
AArch64/SVE2 (if and when compilers can do it)
x86_64/AVX512 (if and when compilers can do it)
PowerPC/AltiVec
RISC-V64 with the Vector extension
Any machine which can't vectorize at all e.g. lots of 32-bit devices.
To provide a vehicle for teaching CBMC and type-safe programming in C.
To provide some interesting tests cases to see how well SLOTHY
performs on compiler-generated NEON code.

While I realize that most of these are not goals of mlkem-native,
I don't think any of them are obviously contradictory or incompatible
with mlkem-native.

On point 3, I kinda like the idea of keeping 1 C backend, but offering
a build option for "small code size, slow" (i.e. current reference NTT code)
or "larger code size, faster" (i.e. my new code). Many other crypto libraries
(e.g. WolfCrypt) have such build-time options for code-size vs performance
trade-off.

Thinking about "Who would use or want this?" I can see a few use cases:

Embedded systems people who want reasonable performance on all targets,
but aren't worried about power analysis attacks.
LibOQS - does LibOQS adoption of mlkem-native change things? Do they
have more targets in mind other an AArch64 and Intel? Would a fast reference
C backend be useful for them?
I will reach out to AWS Devices team to see what targets they have in play.

rod-chapman · 2025-01-15T14:05:16Z

I realize that AArch64/SVE2 and x86_64/AVX512 are something of a red-herring, since anyone with a CPU implementing those things will also have NEON and AVX2, so they'd naturally use the existing back-ends for those.

hanno-becker · 2025-01-18T08:25:57Z

The performance impact has gone down since the switch to unsigned variables has miraculously improved performance by 10-30%. It is still in the range of 10-15% though.

Adjust and move proof files accordingly. Signed-off-by: Rod Chapman <[email protected]> Add support for NO_INLINE attribute and its use with CBMC. Signed-off-by: Rod Chapman <[email protected]> Make top-level ntt_layer*() functions NO_INLINE to improve comprehension and review of auto-vectorization of this code. Signed-off-by: Rod Chapman <[email protected]> Add first two fast Invert NTT functions Adds invntt_layer7_invert_inner() invntt_layer7_invert() and their proof files. Signed-off-by: Rod Chapman <[email protected]> Remove dummy call to invntt_layer7_invert() Signed-off-by: Rod Chapman <[email protected]> Add Invert NTT Layer 6 functions. Adds invntt_layer6_inner() invntt_layer6() functions and their proof files. Signed-off-by: Rod Chapman <[email protected]> Simplify Zeta tables for layers 4 and 5. Signed-off-by: Rod Chapman <[email protected]> Adds Inverse NTT Layer54 (marged) Adds functions invntt_layer54_inner() invntt_layer54() and their proof files. Signed-off-by: Rod Chapman <[email protected]> Use new zeta constants in ntt_layer123() Signed-off-by: Rod Chapman <[email protected]> Adds first full implemenation of Inverse NTT with layer merging. Adds function invntt_layer321() and its proof. Updates top-level poly_invntt_tomont() to call new layer-merged implementation. Renames existing implementation as poly_invntt_tomont_ref() for now. Signed-off-by: Rod Chapman <[email protected]> Update proof of poly_invntt_tomont() Signed-off-by: Rod Chapman <[email protected]> Remove reference implementation of poly_invntt_tomont() Also removes local function invntt_layer() and its proof. Signed-off-by: Rod Chapman <[email protected]> Switch to Z3 for these proofs, which is much faster than Bitwuzla. Signed-off-by: Rod Chapman <[email protected]> Switch back to Z3 for proof of ntt_layer123() Signed-off-by: Rod Chapman <[email protected]> Rename ntt_layer45_slice() to ntt_inner45_inner() Signed-off-by: Rod Chapman <[email protected]> Rename proof files Signed-off-by: Rod Chapman <[email protected]> Further renaming of ntt_layer45_slice() to ntt_layer45_inner() Signed-off-by: Rod Chapman <[email protected]> Rename *_slice() to *_inner() functions - phase 1 Signed-off-by: Rod Chapman <[email protected]> Rename *_slice() functions to *_inner() - phase 2 Signed-off-by: Rod Chapman <[email protected]> renaming *slioce() to *inner() - phase 3 Signed-off-by: Rod Chapman <[email protected]> Display final 4 lines of logs/result.txt after make result Signed-off-by: Rod Chapman <[email protected]> Rename inner to butterfly for all internal ntt functions Signed-off-by: Rod Chapman <[email protected]> Rename harness function source files Signed-off-by: Rod Chapman <[email protected]> Adjust proof Makefiles and harnesses for renaming inner to butterfly Signed-off-by: Rod Chapman <[email protected]> Update Makefiles for function renaming Signed-off-by: Rod Chapman <[email protected]> Rename all inner functions to butterfly Signed-off-by: Rod Chapman <[email protected]> Replace literal 255 with (MLKEM_N - 1) throughout Signed-off-by: Rod Chapman <[email protected]> Move declaration of Zeta tables to be as close to their point of first use as possible. Add and adjust comments. Signed-off-by: Rod Chapman <[email protected]> Align and rename Zetas tables. Signed-off-by: Rod Chapman <[email protected]> Zetas tables are all declared with static scope Signed-off-by: Rod Chapman <[email protected]> Auto-generate Zeta tables for new layer-merged NTT Signed-off-by: Rod Chapman <[email protected]> This proof no longer needs zetas.c as a source file Signed-off-by: Rod Chapman <[email protected]> Move basemul_cached() function from ntt.[hc] to poly.c It is now local, and static within poly.c, so is amenable to inlining and auto-vectorization within that unit. Signed-off-by: Rod Chapman <[email protected]> Make basemul_cached() inline-able Signed-off-by: Rod Chapman <[email protected]> Add top-level explanatory comments Signed-off-by: Rod Chapman <[email protected]> Clarify and correct 1 typo in comment only. Signed-off-by: Rod Chapman <[email protected]> Move symlink from zetas.c to zetas.i Signed-off-by: Rod Chapman <[email protected]> Correct INVNTT_BOUND_REF for new implementation of Inverse NTT Signed-off-by: Rod Chapman <[email protected]> Add MLKEM_NAMESPACE to mlkem_layer7_zetas, since it is a global symbol Signed-off-by: Rod Chapman <[email protected]> Remove boilerplate comments from proof harness sources. Signed-off-by: Rod Chapman <[email protected]> Updates for namespacing of all static functions. Signed-off-by: Rod Chapman <[email protected]> Update this file following changes to other files Signed-off-by: Rod Chapman <[email protected]> Remove pc typedef and use int16_t r[MLKEM_N] instead throughout Signed-off-by: Rod Chapman <[email protected]> Adds namespacing to all static constant lookup tables. In support of the monolithic build. Signed-off-by: Rod Chapman <[email protected]> Update auto-generates files following name-spacing of Zeta tables Signed-off-by: Rod Chapman <[email protected]> Remove declaration of basmul_cached() from ntt.h following rebase against PR 623 Signed-off-by: Rod Chapman <[email protected]> Correct declaration of basemul_cached() to be static Signed-off-by: Rod Chapman <[email protected]> basemul_cached() is explicitly static, so does not also need MLKEM_NATIVE_INTERNAL_API Signed-off-by: Rod Chapman <[email protected]> remove sym link to zetas.c Signed-off-by: Rod Chapman <[email protected]> Add new symlink for zetas.i Signed-off-by: Rod Chapman <[email protected]> Add internal ct_butterfly() function and use it in ntt_layer123() Same for other NTT layers and InvNTT is TBD. Signed-off-by: Rod Chapman <[email protected]> Update autogenerated file Signed-off-by: Rod Chapman <[email protected]> Update invariant and constants for exclusive array bounds checks. Updates only ntt_layer123() for now to confirm effectiveness of updates. Other functions will be updated once all is well. Signed-off-by: Rod Chapman <[email protected]> Update proofs for exclusive upper bound. 1. Upper bound of quantitied ranges is now exclusive 2. Upper bound of array element range constraint is now exclusive. Signed-off-by: Rod Chapman <[email protected]> Re-generate all auto-generated files after rebase Signed-off-by: Rod Chapman <[email protected]> Use inner ct_butterfly() function in all forward NTT functions. Signed-off-by: Rod Chapman <[email protected]> Updates for readability following review. Introduces local gs_butterfly_reduce() and gs_butterfly_defer() functions, which are inlined for both compilation and proof, but significantly simplify and improve readability of the calling functions. Signed-off-by: Rod Chapman <[email protected]> Update auto-generated files after rebase Signed-off-by: Rod Chapman <[email protected]> Remove final STATIC_ASSERT() macro use Signed-off-by: Rod Chapman <[email protected]> Switch to unsigned type for loop index variables. Signed-off-by: Rod Chapman <[email protected]> Clarify comment on vectorization strategy Signed-off-by: Rod Chapman <[email protected]> Remove redundant proof files no longer needed on this branch. Signed-off-by: Rod Chapman <[email protected]> Correct use of BOUND() macro to debug_assert_abs_bound() Signed-off-by: Rod Chapman <[email protected]> Use "unsigned" where possible for zeta_index, start formal parameters and loop index variables. Drop redundant lower-bound for these objects in pre-conditions and loop invariants. Signed-off-by: Rod Chapman <[email protected]> Correct bound pre-condition on basemul_cached() Signed-off-by: Rod Chapman <[email protected]> Update auto-generated files after rebase Signed-off-by: Rod Chapman <[email protected]> Correct new CPP directives following rebase Signed-off-by: Rod Chapman <[email protected]> Restore basemul_cached() in poly.c following rebase. Signed-off-by: Rod Chapman <[email protected]> Switches to use "unsigned" not "int" for all coefficient indexes. Consistent with a similar change to the reference code on main branch. Signed-off-by: Rod Chapman <[email protected]>

…sole caller. ntt_layer7_butterfly() ntt_layer6_butterfly() invntt_layer7_invert_butterfly() invntt_layer6_butterfly() Resulting code is easier to read, and has no affect on performance or proof. Removes now-reduandant proof files for these functions. Signed-off-by: Rod Chapman <[email protected]>

rod-chapman requested review from mkannwischer and hanno-becker January 2, 2025 09:11

rod-chapman requested a review from a team as a code owner January 2, 2025 09:11

hanno-becker added the benchmark this PR should be benchmarked in CI label Jan 2, 2025

oqs-bot reviewed Jan 2, 2025

View reviewed changes

hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Jan 2, 2025

hanno-becker reviewed Jan 2, 2025

View reviewed changes

mlkem/ntt.c Outdated Show resolved Hide resolved

hanno-becker reviewed Jan 2, 2025

View reviewed changes

cbmc/Makefile.common Outdated Show resolved Hide resolved

hanno-becker force-pushed the fastntt2 branch 2 times, most recently from a5b999e to f71ebbf Compare January 2, 2025 15:16

rod-chapman force-pushed the fastntt2 branch 9 times, most recently from d5e3b7d to c99b1fd Compare January 14, 2025 09:35

rod-chapman force-pushed the fastntt2 branch 3 times, most recently from bf84430 to b075527 Compare January 15, 2025 13:54

rod-chapman force-pushed the fastntt2 branch 8 times, most recently from b5bde83 to 16cdf94 Compare January 17, 2025 13:32

hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Jan 18, 2025

rod-chapman force-pushed the fastntt2 branch 2 times, most recently from b25f6e3 to cb650de Compare January 21, 2025 12:36

rod-chapman force-pushed the fastntt2 branch from c93f7e1 to 8cea961 Compare January 21, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast, layer-merged forward and inverse NTT code. #610

Add fast, layer-merged forward and inverse NTT code. #610

rod-chapman commented Jan 2, 2025

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

hanno-becker commented Jan 2, 2025 •

edited

Loading

hanno-becker commented Jan 2, 2025 •

edited

Loading

rod-chapman commented Jan 14, 2025

rod-chapman commented Jan 15, 2025

hanno-becker commented Jan 18, 2025 •

edited

Loading

Add fast, layer-merged forward and inverse NTT code. #610

Are you sure you want to change the base?

Add fast, layer-merged forward and inverse NTT code. #610

Conversation

rod-chapman commented Jan 2, 2025

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i) (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Graviton3

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a) (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a) (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Graviton2

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Graviton4

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i) (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Graviton3 (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Graviton4 (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Graviton2 (no-opt)

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Bananapi bpi-f3 benchmarks

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

oqs-bot left a comment • edited Loading

Choose a reason for hiding this comment

Arm Cortex-A55 (Snapdragon 888) benchmarks

hanno-becker commented Jan 2, 2025 • edited Loading

hanno-becker commented Jan 2, 2025 • edited Loading

rod-chapman commented Jan 14, 2025

rod-chapman commented Jan 15, 2025

hanno-becker commented Jan 18, 2025 • edited Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

hanno-becker commented Jan 2, 2025 •

edited

Loading

hanno-becker commented Jan 2, 2025 •

edited

Loading

hanno-becker commented Jan 18, 2025 •

edited

Loading