Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast, layer-merged forward and inverse NTT code. #610

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rod-chapman
Copy link
Contributor

Fast NTT and Inverse NTT in C

This PR introduces optimized C implementation of the forward and inverse NTT for MLKEM.

Significant changes

  1. Complete re-write of ntt.c - see comments at the top of that file
    for an overview of the optimization approach.
  2. New per-layer Zeta tables are now auto-generated (by updated autogenerate_file.py script)
    into mlkem/zetas.i Note that zetas.i is NOT a standalone translation unit in C, but is
    intended to be included by cpp in the body of ntt.c, so that the literal constant
    values in the tables are available to the compiler and optimizer for auto-vectorization.
  3. Small change to poly.c/poly_mulcache_compute() to use these new Zeta tables.
  4. The function basemul_cached() has moved from ntt.c to poly.c. It is only
    ever called in the latter unit, so really has no place in ntt.c. This also means
    that ntt.c implements the NTT and its inverse and nothing else, so greater
    cohesion of this unit. basemul_cached() is now declared "static" in poly.c and is
    available for inlining and auto-vectorization at the compiler's discretion.
    With GCC 14.2.0 -O3 on Apple M1, this saves approx 5000 cycles in all 3 top-level operations.
  5. All CBMC proofs updated accordingly.
  6. Contracts in ntt.h have NOT changed, so the new implementation meets the
    same proof obligations as before.

Minor changes

  1. cbmc/Makefile.common has been updated so that the final 4 lines
    of logs/result.txt are shown following make result

Verification

  1. Lint OK
  2. All functional tests OK, both with and without MLKEM_DEBUG
  3. All CBMC proofs OK with MLKEM_K=[2,3,4]

Performance

Results on top-level operations will appear from CI runs.

Results for low-level poly_ntt() and poly_invntt_tomont() using the
"--components" switch of "tests bench" on various platforms:

Graviton 3 (c7g instance), GCC 13.2.0 -O3

Clock cycles, lower is better

Operation Before After
NTT 1853 705
Inv NTT 4340 908

x86_64 (c7i instance), GCC 13.2.0 -O3

Clock cycles, lower is better

Operation Before After
NTT 2367 1172
Inv NTT 5420 1901

Apple M1 (MacBook Pro), GCC 14.2.0 -O3

Clock cycles, lower is better

Operation Before After
NTT 1368 366
Inv NTT 1619 439

To come

More results to come:

  1. CPUs that can vectorize, but not AArch64 or Intel (e.g. PowerPC)
  2. CPUs that cannot vectorize (e.g. low-end embedded platforms, such as 32-bit RISC-V)

Code Size

On Apple M1 with GCC 14.2.0 -O3

Size of text segment of ntt.o:

Before: 2212 bytes
After: 6924 bytes

@rod-chapman rod-chapman requested a review from a team as a code owner January 2, 2025 09:11
@hanno-becker hanno-becker added the benchmark this PR should be benchmarked in CI label Jan 2, 2025
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 29055 cycles 29046 cycles 1.00
ML-KEM-512 encaps 35416 cycles 35405 cycles 1.00
ML-KEM-512 decaps 45889 cycles 45880 cycles 1.00
ML-KEM-768 keypair 49360 cycles 49365 cycles 1.00
ML-KEM-768 encaps 55621 cycles 55567 cycles 1.00
ML-KEM-768 decaps 70405 cycles 70328 cycles 1.00
ML-KEM-1024 keypair 72065 cycles 72056 cycles 1.00
ML-KEM-1024 encaps 80792 cycles 80837 cycles 1.00
ML-KEM-1024 decaps 100660 cycles 100700 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 13532 cycles 13956 cycles 0.97
ML-KEM-512 encaps 17333 cycles 17291 cycles 1.00
ML-KEM-512 decaps 22900 cycles 23181 cycles 0.99
ML-KEM-768 keypair 22536 cycles 22560 cycles 1.00
ML-KEM-768 encaps 24566 cycles 24504 cycles 1.00
ML-KEM-768 decaps 32594 cycles 32450 cycles 1.00
ML-KEM-1024 keypair 31394 cycles 31395 cycles 1.00
ML-KEM-1024 encaps 34952 cycles 34968 cycles 1.00
ML-KEM-1024 decaps 45836 cycles 45831 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 18106 cycles 18094 cycles 1.00
ML-KEM-512 encaps 23033 cycles 22911 cycles 1.01
ML-KEM-512 decaps 30221 cycles 30205 cycles 1.00
ML-KEM-768 keypair 31107 cycles 31104 cycles 1.00
ML-KEM-768 encaps 33901 cycles 33887 cycles 1.00
ML-KEM-768 decaps 44545 cycles 44494 cycles 1.00
ML-KEM-1024 keypair 44753 cycles 44686 cycles 1.00
ML-KEM-1024 encaps 49982 cycles 49954 cycles 1.00
ML-KEM-1024 decaps 64430 cycles 64342 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 14914 cycles 14920 cycles 1.00
ML-KEM-512 encaps 19657 cycles 19650 cycles 1.00
ML-KEM-512 decaps 26301 cycles 26315 cycles 1.00
ML-KEM-768 keypair 25597 cycles 25586 cycles 1.00
ML-KEM-768 encaps 28072 cycles 28096 cycles 1.00
ML-KEM-768 decaps 37818 cycles 37908 cycles 1.00
ML-KEM-1024 keypair 35932 cycles 35215 cycles 1.02
ML-KEM-1024 encaps 40942 cycles 40027 cycles 1.02
ML-KEM-1024 decaps 54413 cycles 53472 cycles 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 20350 cycles 20355 cycles 1.00
ML-KEM-512 encaps 26953 cycles 26954 cycles 1.00
ML-KEM-512 decaps 35737 cycles 35779 cycles 1.00
ML-KEM-768 keypair 34908 cycles 34905 cycles 1.00
ML-KEM-768 encaps 38182 cycles 38169 cycles 1.00
ML-KEM-768 decaps 50974 cycles 50951 cycles 1.00
ML-KEM-1024 keypair 47974 cycles 47981 cycles 1.00
ML-KEM-1024 encaps 54125 cycles 54176 cycles 1.00
ML-KEM-1024 decaps 71597 cycles 71674 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 29699 cycles 33077 cycles 0.90
ML-KEM-512 encaps 33533 cycles 38822 cycles 0.86
ML-KEM-512 decaps 41387 cycles 50806 cycles 0.81
ML-KEM-768 keypair 51518 cycles 54897 cycles 0.94
ML-KEM-768 encaps 55748 cycles 60738 cycles 0.92
ML-KEM-768 decaps 66723 cycles 75882 cycles 0.88
ML-KEM-1024 keypair 76880 cycles 81820 cycles 0.94
ML-KEM-1024 encaps 84202 cycles 91788 cycles 0.92
ML-KEM-1024 decaps 98084 cycles 111489 cycles 0.88

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 18954 cycles 18967 cycles 1.00
ML-KEM-512 encaps 23561 cycles 23560 cycles 1.00
ML-KEM-512 decaps 30706 cycles 30694 cycles 1.00
ML-KEM-768 keypair 32310 cycles 32313 cycles 1.00
ML-KEM-768 encaps 35877 cycles 35896 cycles 1.00
ML-KEM-768 decaps 46025 cycles 46038 cycles 1.00
ML-KEM-1024 keypair 46541 cycles 46621 cycles 1.00
ML-KEM-1024 encaps 52423 cycles 52443 cycles 1.00
ML-KEM-1024 decaps 66242 cycles 66278 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 37626 cycles 43325 cycles 0.87
ML-KEM-512 encaps 42482 cycles 51813 cycles 0.82
ML-KEM-512 decaps 52707 cycles 67018 cycles 0.79
ML-KEM-768 keypair 63281 cycles 71608 cycles 0.88
ML-KEM-768 encaps 69996 cycles 82680 cycles 0.85
ML-KEM-768 decaps 83834 cycles 103114 cycles 0.81
ML-KEM-1024 keypair 95928 cycles 106742 cycles 0.90
ML-KEM-1024 encaps 105214 cycles 121052 cycles 0.87
ML-KEM-1024 decaps 123282 cycles 147326 cycles 0.84

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 33602 cycles 39577 cycles 0.85
ML-KEM-512 encaps 37594 cycles 45619 cycles 0.82
ML-KEM-512 decaps 46401 cycles 59080 cycles 0.79
ML-KEM-768 keypair 56468 cycles 64519 cycles 0.88
ML-KEM-768 encaps 62163 cycles 72950 cycles 0.85
ML-KEM-768 decaps 74127 cycles 91111 cycles 0.81
ML-KEM-1024 keypair 85447 cycles 96073 cycles 0.89
ML-KEM-1024 encaps 93431 cycles 107206 cycles 0.87
ML-KEM-1024 decaps 109380 cycles 130548 cycles 0.84

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 29053 cycles 29041 cycles 1.00
ML-KEM-512 encaps 35403 cycles 35393 cycles 1.00
ML-KEM-512 decaps 45901 cycles 45891 cycles 1.00
ML-KEM-768 keypair 49386 cycles 49383 cycles 1.00
ML-KEM-768 encaps 55648 cycles 55591 cycles 1.00
ML-KEM-768 decaps 70466 cycles 70353 cycles 1.00
ML-KEM-1024 keypair 72125 cycles 72079 cycles 1.00
ML-KEM-1024 encaps 80883 cycles 80862 cycles 1.00
ML-KEM-1024 decaps 100777 cycles 100733 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 18125 cycles 18117 cycles 1.00
ML-KEM-512 encaps 22176 cycles 22174 cycles 1.00
ML-KEM-512 decaps 28835 cycles 28829 cycles 1.00
ML-KEM-768 keypair 30558 cycles 30564 cycles 1.00
ML-KEM-768 encaps 33621 cycles 33626 cycles 1.00
ML-KEM-768 decaps 43169 cycles 43166 cycles 1.00
ML-KEM-1024 keypair 44171 cycles 44176 cycles 1.00
ML-KEM-1024 encaps 49642 cycles 49658 cycles 1.00
ML-KEM-1024 decaps 62616 cycles 62640 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 45619 cycles 51405 cycles 0.89
ML-KEM-512 encaps 50810 cycles 59550 cycles 0.85
ML-KEM-512 decaps 62915 cycles 76537 cycles 0.82
ML-KEM-768 keypair 76152 cycles 84282 cycles 0.90
ML-KEM-768 encaps 83475 cycles 94962 cycles 0.88
ML-KEM-768 decaps 99733 cycles 117171 cycles 0.85
ML-KEM-1024 keypair 113603 cycles 124526 cycles 0.91
ML-KEM-1024 encaps 123655 cycles 138736 cycles 0.89
ML-KEM-1024 decaps 144785 cycles 167382 cycles 0.86

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 35706 cycles 39348 cycles 0.91
ML-KEM-512 encaps 39349 cycles 45455 cycles 0.87
ML-KEM-512 decaps 47936 cycles 57377 cycles 0.84
ML-KEM-768 keypair 60354 cycles 65850 cycles 0.92
ML-KEM-768 encaps 65388 cycles 73802 cycles 0.89
ML-KEM-768 decaps 77205 cycles 89876 cycles 0.86
ML-KEM-1024 keypair 91698 cycles 98993 cycles 0.93
ML-KEM-1024 encaps 99226 cycles 110056 cycles 0.90
ML-KEM-1024 decaps 114902 cycles 130883 cycles 0.88

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 34744 cycles 37951 cycles 0.92
ML-KEM-512 encaps 38105 cycles 43322 cycles 0.88
ML-KEM-512 decaps 47520 cycles 55510 cycles 0.86
ML-KEM-768 keypair 58216 cycles 62975 cycles 0.92
ML-KEM-768 encaps 63185 cycles 70419 cycles 0.90
ML-KEM-768 decaps 76063 cycles 86890 cycles 0.88
ML-KEM-1024 keypair 88106 cycles 94481 cycles 0.93
ML-KEM-1024 encaps 95974 cycles 105197 cycles 0.91
ML-KEM-1024 decaps 112900 cycles 126506 cycles 0.89

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 54566 cycles 60765 cycles 0.90
ML-KEM-512 encaps 60731 cycles 69723 cycles 0.87
ML-KEM-512 decaps 74738 cycles 88788 cycles 0.84
ML-KEM-768 keypair 92711 cycles 102028 cycles 0.91
ML-KEM-768 encaps 101596 cycles 114142 cycles 0.89
ML-KEM-768 decaps 120218 cycles 139370 cycles 0.86
ML-KEM-1024 keypair 141471 cycles 153975 cycles 0.92
ML-KEM-1024 encaps 153738 cycles 169718 cycles 0.91
ML-KEM-1024 decaps 178067 cycles 202204 cycles 0.88

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bananapi bpi-f3 benchmarks

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 280848 cycles 334938 cycles 0.84
ML-KEM-512 encaps 316448 cycles 443581 cycles 0.71
ML-KEM-512 decaps 404008 cycles 591786 cycles 0.68
ML-KEM-768 keypair 478553 cycles 559178 cycles 0.86
ML-KEM-768 encaps 524327 cycles 697637 cycles 0.75
ML-KEM-768 decaps 641776 cycles 889082 cycles 0.72
ML-KEM-1024 keypair 721477 cycles 828079 cycles 0.87
ML-KEM-1024 encaps 781700 cycles 1000643 cycles 0.78
ML-KEM-1024 decaps 925584 cycles 1231971 cycles 0.75

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 52008 cycles 51776 cycles 1.00
ML-KEM-512 encaps 58668 cycles 58359 cycles 1.01
ML-KEM-512 decaps 75031 cycles 74228 cycles 1.01
ML-KEM-768 keypair 87951 cycles 87850 cycles 1.00
ML-KEM-768 encaps 95804 cycles 96467 cycles 0.99
ML-KEM-768 decaps 119619 cycles 119522 cycles 1.00
ML-KEM-1024 keypair 130753 cycles 131631 cycles 0.99
ML-KEM-1024 encaps 144002 cycles 145213 cycles 0.99
ML-KEM-1024 decaps 175925 cycles 177906 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks

Benchmark suite Current: 755c479 Previous: 5e3033e Ratio
ML-KEM-512 keypair 58341 cycles 58342 cycles 1.00
ML-KEM-512 encaps 65751 cycles 65810 cycles 1.00
ML-KEM-512 decaps 84550 cycles 84601 cycles 1.00
ML-KEM-768 keypair 99016 cycles 98963 cycles 1.00
ML-KEM-768 encaps 110536 cycles 110566 cycles 1.00
ML-KEM-768 decaps 136898 cycles 136500 cycles 1.00
ML-KEM-1024 keypair 150306 cycles 150087 cycles 1.00
ML-KEM-1024 encaps 166814 cycles 166550 cycles 1.00
ML-KEM-1024 decaps 202887 cycles 202797 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Jan 2, 2025
@hanno-becker
Copy link
Contributor

hanno-becker commented Jan 2, 2025

Thanks a lot @rod-chapman, also for preparing this in a clean way that already passes CI. This is a very impressive performance on the C-level!

My concerns, as discussed offline before, are:

  1. This is a large deviation from the reference implementation. There are two issues with this:
    • We inherit trust from the reference implementation. Are we risking to loose a potential customer because of this? Is the performance uplift worth the deviation? I'm not sure yet.
    • Code-size: Some customers care more about code-size than performance, even in the application-domain. The reference implementation and our current version of it are quite compact, and may be an option for those customers. If we put performance first even in the C backend, we may loose those customers because we don't provide a compact base implementation anymore.
  2. Which customer will actually benefit from this? mlkem-native targets server, mobile and application cores, which I'd say are almost exclusively AArch64 or x86_64+AVX2, hence covered by the existing native backends. I agree that your code is likely faster than the reference [inv]NTT on other architectures as well, but arguing this way feels like putting the cart in front of the horse.

On the other hand, the code fits well with the theme of simultaneous assurance + performance, and will make mlkem-native very competitive even in C vs. C comparisons. And... you have already done the work.

@mkannwischer Please join in and share your thoughts.

mlkem/ntt.c Outdated Show resolved Hide resolved
cbmc/Makefile.common Outdated Show resolved Hide resolved
@hanno-becker hanno-becker force-pushed the fastntt2 branch 2 times, most recently from a5b999e to f71ebbf Compare January 2, 2025 15:16
@hanno-becker
Copy link
Contributor

hanno-becker commented Jan 2, 2025

The function basemul_cached() has moved from ntt.c to poly.c. It is only
ever called in the latter unit, so really has no place in ntt.c. This also means
that ntt.c implements the NTT and its inverse and nothing else, so greater
cohesion of this unit. basemul_cached() is now declared "static" in poly.c and is
available for inlining and auto-vectorization at the compiler's discretion.
With GCC 14.2.0 -O3 on Apple M1, this saves approx 5000 cycles in all 3 top-level operations.

NB: If we go down the route of further optimizing the C code at the cost of deviating from the reference implementation, the vector-vector base multiplication should match the structure of the native AArch64 implementation, computing one scalar product of vectors of polynomials of degree 2 at time. This reduces the number of modular reductions by a factor of MLKEM_K.

But: @rod-chapman this is already a huge change -- if, as it seems, the code motion of basemul_cached() is independent from the change to the [inv]NTT, can we please not do it in this PR?

@rod-chapman rod-chapman force-pushed the fastntt2 branch 9 times, most recently from d5e3b7d to c99b1fd Compare January 14, 2025 09:35
@rod-chapman
Copy link
Contributor Author

Thanks for your thoughts on this PR, and apologies for latency
in replying.

I sympathize with your main point - why did I do this?!?
(And is it compatible with the main goals of mlkem-native anyway?)

I had a bit of think about the first question. My own goal for
doing this are probably something like

  1. To provide a demonstration of proof-driven optimization and that
    "formal is fast" is achievable using C and CBMC.

  2. To force improvement of CBMC.

  3. To provide a "faster, slightly bigger" C backend for mlkem-native
    (more of which below...).

  4. To provide a fast C backend for targets other than AArch64/NEON and x86-64/AVX2,
    for example
    AArch64/SVE2 (if and when compilers can do it)
    x86_64/AVX512 (if and when compilers can do it)
    PowerPC/AltiVec
    RISC-V64 with the Vector extension
    Any machine which can't vectorize at all e.g. lots of 32-bit devices.

  5. To provide a vehicle for teaching CBMC and type-safe programming in C.

  6. To provide some interesting tests cases to see how well SLOTHY
    performs on compiler-generated NEON code.

While I realize that most of these are not goals of mlkem-native,
I don't think any of them are obviously contradictory or incompatible
with mlkem-native.

On point 3, I kinda like the idea of keeping 1 C backend, but offering
a build option for "small code size, slow" (i.e. current reference NTT code)
or "larger code size, faster" (i.e. my new code). Many other crypto libraries
(e.g. WolfCrypt) have such build-time options for code-size vs performance
trade-off.

Thinking about "Who would use or want this?" I can see a few use cases:

  1. Embedded systems people who want reasonable performance on all targets,
    but aren't worried about power analysis attacks.
  2. LibOQS - does LibOQS adoption of mlkem-native change things? Do they
    have more targets in mind other an AArch64 and Intel? Would a fast reference
    C backend be useful for them?
  3. I will reach out to AWS Devices team to see what targets they have in play.

@rod-chapman rod-chapman force-pushed the fastntt2 branch 3 times, most recently from bf84430 to b075527 Compare January 15, 2025 13:54
@rod-chapman
Copy link
Contributor Author

I realize that AArch64/SVE2 and x86_64/AVX512 are something of a red-herring, since anyone with a CPU implementing those things will also have NEON and AVX2, so they'd naturally use the existing back-ends for those.

@rod-chapman rod-chapman force-pushed the fastntt2 branch 8 times, most recently from b5bde83 to 16cdf94 Compare January 17, 2025 13:32
@hanno-becker hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Jan 18, 2025
@hanno-becker
Copy link
Contributor

hanno-becker commented Jan 18, 2025

The performance impact has gone down since the switch to unsigned variables has miraculously improved performance by 10-30%. It is still in the range of 10-15% though.

@rod-chapman rod-chapman force-pushed the fastntt2 branch 2 times, most recently from b25f6e3 to cb650de Compare January 21, 2025 12:36
Adjust and move proof files accordingly.

Signed-off-by: Rod Chapman <[email protected]>

Add support for NO_INLINE attribute and its
use with CBMC.

Signed-off-by: Rod Chapman <[email protected]>

Make top-level ntt_layer*() functions NO_INLINE
to improve comprehension and review of auto-vectorization
of this code.

Signed-off-by: Rod Chapman <[email protected]>

Add first two fast Invert NTT functions

Adds
  invntt_layer7_invert_inner()
  invntt_layer7_invert()

and their proof files.

Signed-off-by: Rod Chapman <[email protected]>

Remove dummy call to invntt_layer7_invert()

Signed-off-by: Rod Chapman <[email protected]>

Add Invert NTT Layer 6 functions.

Adds
  invntt_layer6_inner()
  invntt_layer6()
functions and their proof files.

Signed-off-by: Rod Chapman <[email protected]>

Simplify Zeta tables for layers 4 and 5.

Signed-off-by: Rod Chapman <[email protected]>

Adds Inverse NTT Layer54 (marged)

Adds functions
  invntt_layer54_inner()
  invntt_layer54()

and their proof files.

Signed-off-by: Rod Chapman <[email protected]>

Use new zeta constants in ntt_layer123()

Signed-off-by: Rod Chapman <[email protected]>

Adds first full implemenation of Inverse NTT with layer merging.

Adds function invntt_layer321() and its proof.

Updates top-level poly_invntt_tomont() to call new layer-merged implementation.

Renames existing implementation as poly_invntt_tomont_ref() for now.

Signed-off-by: Rod Chapman <[email protected]>

Update proof of poly_invntt_tomont()

Signed-off-by: Rod Chapman <[email protected]>

Remove reference implementation of poly_invntt_tomont()

Also removes local function invntt_layer()
and its proof.

Signed-off-by: Rod Chapman <[email protected]>

Switch to Z3 for these proofs, which is much faster than Bitwuzla.

Signed-off-by: Rod Chapman <[email protected]>

Switch back to Z3 for proof of ntt_layer123()

Signed-off-by: Rod Chapman <[email protected]>

Rename ntt_layer45_slice() to ntt_inner45_inner()

Signed-off-by: Rod Chapman <[email protected]>

Rename proof files

Signed-off-by: Rod Chapman <[email protected]>

Further renaming of ntt_layer45_slice() to ntt_layer45_inner()

Signed-off-by: Rod Chapman <[email protected]>

Rename *_slice() to *_inner() functions - phase 1

Signed-off-by: Rod Chapman <[email protected]>

Rename *_slice() functions to *_inner() - phase 2

Signed-off-by: Rod Chapman <[email protected]>

renaming *slioce() to *inner() - phase 3

Signed-off-by: Rod Chapman <[email protected]>

Display final 4 lines of logs/result.txt after make result

Signed-off-by: Rod Chapman <[email protected]>

Rename inner to butterfly for all internal ntt functions

Signed-off-by: Rod Chapman <[email protected]>

Rename harness function source files

Signed-off-by: Rod Chapman <[email protected]>

Adjust proof Makefiles and harnesses for renaming inner to butterfly

Signed-off-by: Rod Chapman <[email protected]>

Update Makefiles for function renaming

Signed-off-by: Rod Chapman <[email protected]>

Rename all inner functions to butterfly

Signed-off-by: Rod Chapman <[email protected]>

Replace literal 255 with (MLKEM_N - 1) throughout

Signed-off-by: Rod Chapman <[email protected]>

Move declaration of Zeta tables to be as close to their
point of first use as possible.

Add and adjust comments.

Signed-off-by: Rod Chapman <[email protected]>

Align and rename Zetas tables.

Signed-off-by: Rod Chapman <[email protected]>

Zetas tables are all declared with static scope

Signed-off-by: Rod Chapman <[email protected]>

Auto-generate Zeta tables for new layer-merged NTT

Signed-off-by: Rod Chapman <[email protected]>

This proof no longer needs zetas.c as a source file

Signed-off-by: Rod Chapman <[email protected]>

Move basemul_cached() function from ntt.[hc] to poly.c

It is now local, and static within poly.c, so is amenable
to inlining and auto-vectorization within that unit.

Signed-off-by: Rod Chapman <[email protected]>

Make basemul_cached() inline-able

Signed-off-by: Rod Chapman <[email protected]>

Add top-level explanatory comments

Signed-off-by: Rod Chapman <[email protected]>

Clarify and correct 1 typo in comment only.

Signed-off-by: Rod Chapman <[email protected]>

Move symlink from zetas.c to zetas.i

Signed-off-by: Rod Chapman <[email protected]>

Correct INVNTT_BOUND_REF for new implementation of Inverse NTT

Signed-off-by: Rod Chapman <[email protected]>

Add MLKEM_NAMESPACE to mlkem_layer7_zetas, since it is a global symbol

Signed-off-by: Rod Chapman <[email protected]>

Remove boilerplate comments from proof harness sources.

Signed-off-by: Rod Chapman <[email protected]>

Updates for namespacing of all static functions.

Signed-off-by: Rod Chapman <[email protected]>

Update this file following changes to other files

Signed-off-by: Rod Chapman <[email protected]>

Remove pc typedef and use int16_t r[MLKEM_N] instead throughout

Signed-off-by: Rod Chapman <[email protected]>

Adds namespacing to all static constant lookup tables.

In support of the monolithic build.

Signed-off-by: Rod Chapman <[email protected]>

Update auto-generates files following name-spacing of Zeta tables

Signed-off-by: Rod Chapman <[email protected]>

Remove declaration of basmul_cached() from ntt.h following rebase against PR 623

Signed-off-by: Rod Chapman <[email protected]>

Correct declaration of basemul_cached() to be static

Signed-off-by: Rod Chapman <[email protected]>

basemul_cached() is explicitly static, so does not also need MLKEM_NATIVE_INTERNAL_API

Signed-off-by: Rod Chapman <[email protected]>

remove sym link to zetas.c

Signed-off-by: Rod Chapman <[email protected]>

Add new symlink for zetas.i

Signed-off-by: Rod Chapman <[email protected]>

Add internal ct_butterfly() function and use it in ntt_layer123()

Same for other NTT layers and InvNTT is TBD.

Signed-off-by: Rod Chapman <[email protected]>

Update autogenerated file

Signed-off-by: Rod Chapman <[email protected]>

Update invariant and constants for exclusive array bounds checks.

Updates only ntt_layer123() for now to confirm effectiveness
of updates. Other functions will be updated once all is well.

Signed-off-by: Rod Chapman <[email protected]>

Update proofs for exclusive upper bound.

1. Upper bound of quantitied ranges is now exclusive
2. Upper bound of array element range constraint is now exclusive.

Signed-off-by: Rod Chapman <[email protected]>

Re-generate all auto-generated files after rebase

Signed-off-by: Rod Chapman <[email protected]>

Use inner ct_butterfly() function in all forward NTT functions.

Signed-off-by: Rod Chapman <[email protected]>

Updates for readability following review.

Introduces local gs_butterfly_reduce() and
gs_butterfly_defer() functions, which are inlined
for both compilation and proof, but significantly
simplify and improve readability of the calling
functions.

Signed-off-by: Rod Chapman <[email protected]>

Update auto-generated files after rebase

Signed-off-by: Rod Chapman <[email protected]>

Remove final STATIC_ASSERT() macro use

Signed-off-by: Rod Chapman <[email protected]>

Switch to unsigned type for loop index variables.

Signed-off-by: Rod Chapman <[email protected]>

Clarify comment on vectorization strategy

Signed-off-by: Rod Chapman <[email protected]>

Remove redundant proof files no longer needed on this branch.

Signed-off-by: Rod Chapman <[email protected]>

Correct use of BOUND() macro to debug_assert_abs_bound()

Signed-off-by: Rod Chapman <[email protected]>

Use "unsigned" where possible for zeta_index, start formal
parameters and loop index variables.

Drop redundant lower-bound for these objects in pre-conditions
and loop invariants.

Signed-off-by: Rod Chapman <[email protected]>

Correct bound pre-condition on basemul_cached()

Signed-off-by: Rod Chapman <[email protected]>

Update auto-generated files after rebase

Signed-off-by: Rod Chapman <[email protected]>

Correct new CPP directives following rebase

Signed-off-by: Rod Chapman <[email protected]>

Restore basemul_cached() in poly.c following rebase.

Signed-off-by: Rod Chapman <[email protected]>

Switches to use "unsigned" not "int" for all coefficient indexes.

Consistent with a similar change to the reference code on main branch.

Signed-off-by: Rod Chapman <[email protected]>
…sole caller.

ntt_layer7_butterfly()
ntt_layer6_butterfly()
invntt_layer7_invert_butterfly()
invntt_layer6_butterfly()

Resulting code is easier to read, and has no affect on performance or
proof. Removes now-reduandant proof files for these functions.

Signed-off-by: Rod Chapman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark this PR should be benchmarked in CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants