-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast, layer-merged forward and inverse NTT code. #610
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29055 cycles |
29046 cycles |
1.00 |
ML-KEM-512 encaps |
35416 cycles |
35405 cycles |
1.00 |
ML-KEM-512 decaps |
45889 cycles |
45880 cycles |
1.00 |
ML-KEM-768 keypair |
49360 cycles |
49365 cycles |
1.00 |
ML-KEM-768 encaps |
55621 cycles |
55567 cycles |
1.00 |
ML-KEM-768 decaps |
70405 cycles |
70328 cycles |
1.00 |
ML-KEM-1024 keypair |
72065 cycles |
72056 cycles |
1.00 |
ML-KEM-1024 encaps |
80792 cycles |
80837 cycles |
1.00 |
ML-KEM-1024 decaps |
100660 cycles |
100700 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
13532 cycles |
13956 cycles |
0.97 |
ML-KEM-512 encaps |
17333 cycles |
17291 cycles |
1.00 |
ML-KEM-512 decaps |
22900 cycles |
23181 cycles |
0.99 |
ML-KEM-768 keypair |
22536 cycles |
22560 cycles |
1.00 |
ML-KEM-768 encaps |
24566 cycles |
24504 cycles |
1.00 |
ML-KEM-768 decaps |
32594 cycles |
32450 cycles |
1.00 |
ML-KEM-1024 keypair |
31394 cycles |
31395 cycles |
1.00 |
ML-KEM-1024 encaps |
34952 cycles |
34968 cycles |
1.00 |
ML-KEM-1024 decaps |
45836 cycles |
45831 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
18106 cycles |
18094 cycles |
1.00 |
ML-KEM-512 encaps |
23033 cycles |
22911 cycles |
1.01 |
ML-KEM-512 decaps |
30221 cycles |
30205 cycles |
1.00 |
ML-KEM-768 keypair |
31107 cycles |
31104 cycles |
1.00 |
ML-KEM-768 encaps |
33901 cycles |
33887 cycles |
1.00 |
ML-KEM-768 decaps |
44545 cycles |
44494 cycles |
1.00 |
ML-KEM-1024 keypair |
44753 cycles |
44686 cycles |
1.00 |
ML-KEM-1024 encaps |
49982 cycles |
49954 cycles |
1.00 |
ML-KEM-1024 decaps |
64430 cycles |
64342 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
14914 cycles |
14920 cycles |
1.00 |
ML-KEM-512 encaps |
19657 cycles |
19650 cycles |
1.00 |
ML-KEM-512 decaps |
26301 cycles |
26315 cycles |
1.00 |
ML-KEM-768 keypair |
25597 cycles |
25586 cycles |
1.00 |
ML-KEM-768 encaps |
28072 cycles |
28096 cycles |
1.00 |
ML-KEM-768 decaps |
37818 cycles |
37908 cycles |
1.00 |
ML-KEM-1024 keypair |
35932 cycles |
35215 cycles |
1.02 |
ML-KEM-1024 encaps |
40942 cycles |
40027 cycles |
1.02 |
ML-KEM-1024 decaps |
54413 cycles |
53472 cycles |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
20350 cycles |
20355 cycles |
1.00 |
ML-KEM-512 encaps |
26953 cycles |
26954 cycles |
1.00 |
ML-KEM-512 decaps |
35737 cycles |
35779 cycles |
1.00 |
ML-KEM-768 keypair |
34908 cycles |
34905 cycles |
1.00 |
ML-KEM-768 encaps |
38182 cycles |
38169 cycles |
1.00 |
ML-KEM-768 decaps |
50974 cycles |
50951 cycles |
1.00 |
ML-KEM-1024 keypair |
47974 cycles |
47981 cycles |
1.00 |
ML-KEM-1024 encaps |
54125 cycles |
54176 cycles |
1.00 |
ML-KEM-1024 decaps |
71597 cycles |
71674 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i) (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29699 cycles |
33077 cycles |
0.90 |
ML-KEM-512 encaps |
33533 cycles |
38822 cycles |
0.86 |
ML-KEM-512 decaps |
41387 cycles |
50806 cycles |
0.81 |
ML-KEM-768 keypair |
51518 cycles |
54897 cycles |
0.94 |
ML-KEM-768 encaps |
55748 cycles |
60738 cycles |
0.92 |
ML-KEM-768 decaps |
66723 cycles |
75882 cycles |
0.88 |
ML-KEM-1024 keypair |
76880 cycles |
81820 cycles |
0.94 |
ML-KEM-1024 encaps |
84202 cycles |
91788 cycles |
0.92 |
ML-KEM-1024 decaps |
98084 cycles |
111489 cycles |
0.88 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
18954 cycles |
18967 cycles |
1.00 |
ML-KEM-512 encaps |
23561 cycles |
23560 cycles |
1.00 |
ML-KEM-512 decaps |
30706 cycles |
30694 cycles |
1.00 |
ML-KEM-768 keypair |
32310 cycles |
32313 cycles |
1.00 |
ML-KEM-768 encaps |
35877 cycles |
35896 cycles |
1.00 |
ML-KEM-768 decaps |
46025 cycles |
46038 cycles |
1.00 |
ML-KEM-1024 keypair |
46541 cycles |
46621 cycles |
1.00 |
ML-KEM-1024 encaps |
52423 cycles |
52443 cycles |
1.00 |
ML-KEM-1024 decaps |
66242 cycles |
66278 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a) (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
37626 cycles |
43325 cycles |
0.87 |
ML-KEM-512 encaps |
42482 cycles |
51813 cycles |
0.82 |
ML-KEM-512 decaps |
52707 cycles |
67018 cycles |
0.79 |
ML-KEM-768 keypair |
63281 cycles |
71608 cycles |
0.88 |
ML-KEM-768 encaps |
69996 cycles |
82680 cycles |
0.85 |
ML-KEM-768 decaps |
83834 cycles |
103114 cycles |
0.81 |
ML-KEM-1024 keypair |
95928 cycles |
106742 cycles |
0.90 |
ML-KEM-1024 encaps |
105214 cycles |
121052 cycles |
0.87 |
ML-KEM-1024 decaps |
123282 cycles |
147326 cycles |
0.84 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a) (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
33602 cycles |
39577 cycles |
0.85 |
ML-KEM-512 encaps |
37594 cycles |
45619 cycles |
0.82 |
ML-KEM-512 decaps |
46401 cycles |
59080 cycles |
0.79 |
ML-KEM-768 keypair |
56468 cycles |
64519 cycles |
0.88 |
ML-KEM-768 encaps |
62163 cycles |
72950 cycles |
0.85 |
ML-KEM-768 decaps |
74127 cycles |
91111 cycles |
0.81 |
ML-KEM-1024 keypair |
85447 cycles |
96073 cycles |
0.89 |
ML-KEM-1024 encaps |
93431 cycles |
107206 cycles |
0.87 |
ML-KEM-1024 decaps |
109380 cycles |
130548 cycles |
0.84 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29053 cycles |
29041 cycles |
1.00 |
ML-KEM-512 encaps |
35403 cycles |
35393 cycles |
1.00 |
ML-KEM-512 decaps |
45901 cycles |
45891 cycles |
1.00 |
ML-KEM-768 keypair |
49386 cycles |
49383 cycles |
1.00 |
ML-KEM-768 encaps |
55648 cycles |
55591 cycles |
1.00 |
ML-KEM-768 decaps |
70466 cycles |
70353 cycles |
1.00 |
ML-KEM-1024 keypair |
72125 cycles |
72079 cycles |
1.00 |
ML-KEM-1024 encaps |
80883 cycles |
80862 cycles |
1.00 |
ML-KEM-1024 decaps |
100777 cycles |
100733 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
18125 cycles |
18117 cycles |
1.00 |
ML-KEM-512 encaps |
22176 cycles |
22174 cycles |
1.00 |
ML-KEM-512 decaps |
28835 cycles |
28829 cycles |
1.00 |
ML-KEM-768 keypair |
30558 cycles |
30564 cycles |
1.00 |
ML-KEM-768 encaps |
33621 cycles |
33626 cycles |
1.00 |
ML-KEM-768 decaps |
43169 cycles |
43166 cycles |
1.00 |
ML-KEM-1024 keypair |
44171 cycles |
44176 cycles |
1.00 |
ML-KEM-1024 encaps |
49642 cycles |
49658 cycles |
1.00 |
ML-KEM-1024 decaps |
62616 cycles |
62640 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i) (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
45619 cycles |
51405 cycles |
0.89 |
ML-KEM-512 encaps |
50810 cycles |
59550 cycles |
0.85 |
ML-KEM-512 decaps |
62915 cycles |
76537 cycles |
0.82 |
ML-KEM-768 keypair |
76152 cycles |
84282 cycles |
0.90 |
ML-KEM-768 encaps |
83475 cycles |
94962 cycles |
0.88 |
ML-KEM-768 decaps |
99733 cycles |
117171 cycles |
0.85 |
ML-KEM-1024 keypair |
113603 cycles |
124526 cycles |
0.91 |
ML-KEM-1024 encaps |
123655 cycles |
138736 cycles |
0.89 |
ML-KEM-1024 decaps |
144785 cycles |
167382 cycles |
0.86 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3 (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
35706 cycles |
39348 cycles |
0.91 |
ML-KEM-512 encaps |
39349 cycles |
45455 cycles |
0.87 |
ML-KEM-512 decaps |
47936 cycles |
57377 cycles |
0.84 |
ML-KEM-768 keypair |
60354 cycles |
65850 cycles |
0.92 |
ML-KEM-768 encaps |
65388 cycles |
73802 cycles |
0.89 |
ML-KEM-768 decaps |
77205 cycles |
89876 cycles |
0.86 |
ML-KEM-1024 keypair |
91698 cycles |
98993 cycles |
0.93 |
ML-KEM-1024 encaps |
99226 cycles |
110056 cycles |
0.90 |
ML-KEM-1024 decaps |
114902 cycles |
130883 cycles |
0.88 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4 (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
34744 cycles |
37951 cycles |
0.92 |
ML-KEM-512 encaps |
38105 cycles |
43322 cycles |
0.88 |
ML-KEM-512 decaps |
47520 cycles |
55510 cycles |
0.86 |
ML-KEM-768 keypair |
58216 cycles |
62975 cycles |
0.92 |
ML-KEM-768 encaps |
63185 cycles |
70419 cycles |
0.90 |
ML-KEM-768 decaps |
76063 cycles |
86890 cycles |
0.88 |
ML-KEM-1024 keypair |
88106 cycles |
94481 cycles |
0.93 |
ML-KEM-1024 encaps |
95974 cycles |
105197 cycles |
0.91 |
ML-KEM-1024 decaps |
112900 cycles |
126506 cycles |
0.89 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2 (no-opt)
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
54566 cycles |
60765 cycles |
0.90 |
ML-KEM-512 encaps |
60731 cycles |
69723 cycles |
0.87 |
ML-KEM-512 decaps |
74738 cycles |
88788 cycles |
0.84 |
ML-KEM-768 keypair |
92711 cycles |
102028 cycles |
0.91 |
ML-KEM-768 encaps |
101596 cycles |
114142 cycles |
0.89 |
ML-KEM-768 decaps |
120218 cycles |
139370 cycles |
0.86 |
ML-KEM-1024 keypair |
141471 cycles |
153975 cycles |
0.92 |
ML-KEM-1024 encaps |
153738 cycles |
169718 cycles |
0.91 |
ML-KEM-1024 decaps |
178067 cycles |
202204 cycles |
0.88 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bananapi bpi-f3 benchmarks
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
280848 cycles |
334938 cycles |
0.84 |
ML-KEM-512 encaps |
316448 cycles |
443581 cycles |
0.71 |
ML-KEM-512 decaps |
404008 cycles |
591786 cycles |
0.68 |
ML-KEM-768 keypair |
478553 cycles |
559178 cycles |
0.86 |
ML-KEM-768 encaps |
524327 cycles |
697637 cycles |
0.75 |
ML-KEM-768 decaps |
641776 cycles |
889082 cycles |
0.72 |
ML-KEM-1024 keypair |
721477 cycles |
828079 cycles |
0.87 |
ML-KEM-1024 encaps |
781700 cycles |
1000643 cycles |
0.78 |
ML-KEM-1024 decaps |
925584 cycles |
1231971 cycles |
0.75 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
52008 cycles |
51776 cycles |
1.00 |
ML-KEM-512 encaps |
58668 cycles |
58359 cycles |
1.01 |
ML-KEM-512 decaps |
75031 cycles |
74228 cycles |
1.01 |
ML-KEM-768 keypair |
87951 cycles |
87850 cycles |
1.00 |
ML-KEM-768 encaps |
95804 cycles |
96467 cycles |
0.99 |
ML-KEM-768 decaps |
119619 cycles |
119522 cycles |
1.00 |
ML-KEM-1024 keypair |
130753 cycles |
131631 cycles |
0.99 |
ML-KEM-1024 encaps |
144002 cycles |
145213 cycles |
0.99 |
ML-KEM-1024 decaps |
175925 cycles |
177906 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A55 (Snapdragon 888) benchmarks
Benchmark suite | Current: 755c479 | Previous: 5e3033e | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
58341 cycles |
58342 cycles |
1.00 |
ML-KEM-512 encaps |
65751 cycles |
65810 cycles |
1.00 |
ML-KEM-512 decaps |
84550 cycles |
84601 cycles |
1.00 |
ML-KEM-768 keypair |
99016 cycles |
98963 cycles |
1.00 |
ML-KEM-768 encaps |
110536 cycles |
110566 cycles |
1.00 |
ML-KEM-768 decaps |
136898 cycles |
136500 cycles |
1.00 |
ML-KEM-1024 keypair |
150306 cycles |
150087 cycles |
1.00 |
ML-KEM-1024 encaps |
166814 cycles |
166550 cycles |
1.00 |
ML-KEM-1024 decaps |
202887 cycles |
202797 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Thanks a lot @rod-chapman, also for preparing this in a clean way that already passes CI. This is a very impressive performance on the C-level! My concerns, as discussed offline before, are:
On the other hand, the code fits well with the theme of simultaneous assurance + performance, and will make mlkem-native very competitive even in C vs. C comparisons. And... you have already done the work. @mkannwischer Please join in and share your thoughts. |
a5b999e
to
f71ebbf
Compare
NB: If we go down the route of further optimizing the C code at the cost of deviating from the reference implementation, the vector-vector base multiplication should match the structure of the native AArch64 implementation, computing one scalar product of vectors of polynomials of degree 2 at time. This reduces the number of modular reductions by a factor of MLKEM_K. But: @rod-chapman this is already a huge change -- if, as it seems, the code motion of |
d5e3b7d
to
c99b1fd
Compare
Thanks for your thoughts on this PR, and apologies for latency I sympathize with your main point - why did I do this?!? I had a bit of think about the first question. My own goal for
While I realize that most of these are not goals of mlkem-native, On point 3, I kinda like the idea of keeping 1 C backend, but offering Thinking about "Who would use or want this?" I can see a few use cases:
|
bf84430
to
b075527
Compare
I realize that AArch64/SVE2 and x86_64/AVX512 are something of a red-herring, since anyone with a CPU implementing those things will also have NEON and AVX2, so they'd naturally use the existing back-ends for those. |
b5bde83
to
16cdf94
Compare
The performance impact has gone down since the switch to unsigned variables has miraculously improved performance by 10-30%. It is still in the range of 10-15% though. |
b25f6e3
to
cb650de
Compare
Adjust and move proof files accordingly. Signed-off-by: Rod Chapman <[email protected]> Add support for NO_INLINE attribute and its use with CBMC. Signed-off-by: Rod Chapman <[email protected]> Make top-level ntt_layer*() functions NO_INLINE to improve comprehension and review of auto-vectorization of this code. Signed-off-by: Rod Chapman <[email protected]> Add first two fast Invert NTT functions Adds invntt_layer7_invert_inner() invntt_layer7_invert() and their proof files. Signed-off-by: Rod Chapman <[email protected]> Remove dummy call to invntt_layer7_invert() Signed-off-by: Rod Chapman <[email protected]> Add Invert NTT Layer 6 functions. Adds invntt_layer6_inner() invntt_layer6() functions and their proof files. Signed-off-by: Rod Chapman <[email protected]> Simplify Zeta tables for layers 4 and 5. Signed-off-by: Rod Chapman <[email protected]> Adds Inverse NTT Layer54 (marged) Adds functions invntt_layer54_inner() invntt_layer54() and their proof files. Signed-off-by: Rod Chapman <[email protected]> Use new zeta constants in ntt_layer123() Signed-off-by: Rod Chapman <[email protected]> Adds first full implemenation of Inverse NTT with layer merging. Adds function invntt_layer321() and its proof. Updates top-level poly_invntt_tomont() to call new layer-merged implementation. Renames existing implementation as poly_invntt_tomont_ref() for now. Signed-off-by: Rod Chapman <[email protected]> Update proof of poly_invntt_tomont() Signed-off-by: Rod Chapman <[email protected]> Remove reference implementation of poly_invntt_tomont() Also removes local function invntt_layer() and its proof. Signed-off-by: Rod Chapman <[email protected]> Switch to Z3 for these proofs, which is much faster than Bitwuzla. Signed-off-by: Rod Chapman <[email protected]> Switch back to Z3 for proof of ntt_layer123() Signed-off-by: Rod Chapman <[email protected]> Rename ntt_layer45_slice() to ntt_inner45_inner() Signed-off-by: Rod Chapman <[email protected]> Rename proof files Signed-off-by: Rod Chapman <[email protected]> Further renaming of ntt_layer45_slice() to ntt_layer45_inner() Signed-off-by: Rod Chapman <[email protected]> Rename *_slice() to *_inner() functions - phase 1 Signed-off-by: Rod Chapman <[email protected]> Rename *_slice() functions to *_inner() - phase 2 Signed-off-by: Rod Chapman <[email protected]> renaming *slioce() to *inner() - phase 3 Signed-off-by: Rod Chapman <[email protected]> Display final 4 lines of logs/result.txt after make result Signed-off-by: Rod Chapman <[email protected]> Rename inner to butterfly for all internal ntt functions Signed-off-by: Rod Chapman <[email protected]> Rename harness function source files Signed-off-by: Rod Chapman <[email protected]> Adjust proof Makefiles and harnesses for renaming inner to butterfly Signed-off-by: Rod Chapman <[email protected]> Update Makefiles for function renaming Signed-off-by: Rod Chapman <[email protected]> Rename all inner functions to butterfly Signed-off-by: Rod Chapman <[email protected]> Replace literal 255 with (MLKEM_N - 1) throughout Signed-off-by: Rod Chapman <[email protected]> Move declaration of Zeta tables to be as close to their point of first use as possible. Add and adjust comments. Signed-off-by: Rod Chapman <[email protected]> Align and rename Zetas tables. Signed-off-by: Rod Chapman <[email protected]> Zetas tables are all declared with static scope Signed-off-by: Rod Chapman <[email protected]> Auto-generate Zeta tables for new layer-merged NTT Signed-off-by: Rod Chapman <[email protected]> This proof no longer needs zetas.c as a source file Signed-off-by: Rod Chapman <[email protected]> Move basemul_cached() function from ntt.[hc] to poly.c It is now local, and static within poly.c, so is amenable to inlining and auto-vectorization within that unit. Signed-off-by: Rod Chapman <[email protected]> Make basemul_cached() inline-able Signed-off-by: Rod Chapman <[email protected]> Add top-level explanatory comments Signed-off-by: Rod Chapman <[email protected]> Clarify and correct 1 typo in comment only. Signed-off-by: Rod Chapman <[email protected]> Move symlink from zetas.c to zetas.i Signed-off-by: Rod Chapman <[email protected]> Correct INVNTT_BOUND_REF for new implementation of Inverse NTT Signed-off-by: Rod Chapman <[email protected]> Add MLKEM_NAMESPACE to mlkem_layer7_zetas, since it is a global symbol Signed-off-by: Rod Chapman <[email protected]> Remove boilerplate comments from proof harness sources. Signed-off-by: Rod Chapman <[email protected]> Updates for namespacing of all static functions. Signed-off-by: Rod Chapman <[email protected]> Update this file following changes to other files Signed-off-by: Rod Chapman <[email protected]> Remove pc typedef and use int16_t r[MLKEM_N] instead throughout Signed-off-by: Rod Chapman <[email protected]> Adds namespacing to all static constant lookup tables. In support of the monolithic build. Signed-off-by: Rod Chapman <[email protected]> Update auto-generates files following name-spacing of Zeta tables Signed-off-by: Rod Chapman <[email protected]> Remove declaration of basmul_cached() from ntt.h following rebase against PR 623 Signed-off-by: Rod Chapman <[email protected]> Correct declaration of basemul_cached() to be static Signed-off-by: Rod Chapman <[email protected]> basemul_cached() is explicitly static, so does not also need MLKEM_NATIVE_INTERNAL_API Signed-off-by: Rod Chapman <[email protected]> remove sym link to zetas.c Signed-off-by: Rod Chapman <[email protected]> Add new symlink for zetas.i Signed-off-by: Rod Chapman <[email protected]> Add internal ct_butterfly() function and use it in ntt_layer123() Same for other NTT layers and InvNTT is TBD. Signed-off-by: Rod Chapman <[email protected]> Update autogenerated file Signed-off-by: Rod Chapman <[email protected]> Update invariant and constants for exclusive array bounds checks. Updates only ntt_layer123() for now to confirm effectiveness of updates. Other functions will be updated once all is well. Signed-off-by: Rod Chapman <[email protected]> Update proofs for exclusive upper bound. 1. Upper bound of quantitied ranges is now exclusive 2. Upper bound of array element range constraint is now exclusive. Signed-off-by: Rod Chapman <[email protected]> Re-generate all auto-generated files after rebase Signed-off-by: Rod Chapman <[email protected]> Use inner ct_butterfly() function in all forward NTT functions. Signed-off-by: Rod Chapman <[email protected]> Updates for readability following review. Introduces local gs_butterfly_reduce() and gs_butterfly_defer() functions, which are inlined for both compilation and proof, but significantly simplify and improve readability of the calling functions. Signed-off-by: Rod Chapman <[email protected]> Update auto-generated files after rebase Signed-off-by: Rod Chapman <[email protected]> Remove final STATIC_ASSERT() macro use Signed-off-by: Rod Chapman <[email protected]> Switch to unsigned type for loop index variables. Signed-off-by: Rod Chapman <[email protected]> Clarify comment on vectorization strategy Signed-off-by: Rod Chapman <[email protected]> Remove redundant proof files no longer needed on this branch. Signed-off-by: Rod Chapman <[email protected]> Correct use of BOUND() macro to debug_assert_abs_bound() Signed-off-by: Rod Chapman <[email protected]> Use "unsigned" where possible for zeta_index, start formal parameters and loop index variables. Drop redundant lower-bound for these objects in pre-conditions and loop invariants. Signed-off-by: Rod Chapman <[email protected]> Correct bound pre-condition on basemul_cached() Signed-off-by: Rod Chapman <[email protected]> Update auto-generated files after rebase Signed-off-by: Rod Chapman <[email protected]> Correct new CPP directives following rebase Signed-off-by: Rod Chapman <[email protected]> Restore basemul_cached() in poly.c following rebase. Signed-off-by: Rod Chapman <[email protected]> Switches to use "unsigned" not "int" for all coefficient indexes. Consistent with a similar change to the reference code on main branch. Signed-off-by: Rod Chapman <[email protected]>
…sole caller. ntt_layer7_butterfly() ntt_layer6_butterfly() invntt_layer7_invert_butterfly() invntt_layer6_butterfly() Resulting code is easier to read, and has no affect on performance or proof. Removes now-reduandant proof files for these functions. Signed-off-by: Rod Chapman <[email protected]>
Fast NTT and Inverse NTT in C
This PR introduces optimized C implementation of the forward and inverse NTT for MLKEM.
Significant changes
for an overview of the optimization approach.
into mlkem/zetas.i Note that zetas.i is NOT a standalone translation unit in C, but is
intended to be included by cpp in the body of ntt.c, so that the literal constant
values in the tables are available to the compiler and optimizer for auto-vectorization.
ever called in the latter unit, so really has no place in ntt.c. This also means
that ntt.c implements the NTT and its inverse and nothing else, so greater
cohesion of this unit. basemul_cached() is now declared "static" in poly.c and is
available for inlining and auto-vectorization at the compiler's discretion.
With GCC 14.2.0 -O3 on Apple M1, this saves approx 5000 cycles in all 3 top-level operations.
same proof obligations as before.
Minor changes
of logs/result.txt are shown following
make result
Verification
Performance
Results on top-level operations will appear from CI runs.
Results for low-level poly_ntt() and poly_invntt_tomont() using the
"--components" switch of "tests bench" on various platforms:
Graviton 3 (c7g instance), GCC 13.2.0 -O3
Clock cycles, lower is better
x86_64 (c7i instance), GCC 13.2.0 -O3
Clock cycles, lower is better
Apple M1 (MacBook Pro), GCC 14.2.0 -O3
Clock cycles, lower is better
To come
More results to come:
Code Size
On Apple M1 with GCC 14.2.0 -O3
Size of text segment of ntt.o:
Before: 2212 bytes
After: 6924 bytes