-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use concrete-ntt
for NTT operations
#243
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #243 +/- ##
=======================================
Coverage 93.71% 93.71%
=======================================
Files 42 42
Lines 8194 8194
=======================================
Hits 7679 7679
Misses 515 515 ☔ View full report in Codecov by Sentry. |
Apple M2 on battery: $ cargo criterion ntt --features concrete-ntt
Compiling fhe-math v0.1.0-beta.8 (/Users/tancrede/git/fhe.rs/crates/fhe-math)
Compiling fhe v0.1.0-beta.8 (/Users/tancrede/git/fhe.rs/crates/fhe)
Finished bench [optimized] target(s) in 8.12s
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s
ntt/forward/1024/62 time: [4.2295 µs 4.2310 µs 4.2335 µs]
change: [+31.699% +32.158% +32.656%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward_vt/1024/62 time: [4.2292 µs 4.2308 µs 4.2333 µs]
change: [+30.739% +31.144% +31.576%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward/1024/62 time: [4.9529 µs 4.9564 µs 4.9596 µs]
change: [+42.015% +42.382% +42.770%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward_vt/1024/62 time: [4.9462 µs 4.9509 µs 4.9564 µs]
change: [+41.606% +41.982% +42.352%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward/1024/16 time: [4.2616 µs 4.2904 µs 4.3288 µs]
change: [+33.286% +34.570% +36.245%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward_vt/1024/16 time: [4.4043 µs 4.5029 µs 4.6164 µs]
change: [+35.090% +38.045% +43.124%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward/1024/16 time: [4.9739 µs 4.9961 µs 5.0263 µs]
change: [+43.092% +43.963% +44.946%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward_vt/1024/16 time: [4.9532 µs 4.9581 µs 4.9631 µs]
change: [+41.480% +42.032% +42.502%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward/4096/62 time: [19.723 µs 19.757 µs 19.802 µs]
change: [+35.091% +36.112% +37.347%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward_vt/4096/62 time: [19.873 µs 19.990 µs 20.126 µs]
change: [+34.851% +35.493% +36.214%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward/4096/62 time: [22.864 µs 22.901 µs 22.951 µs]
change: [+44.177% +44.813% +45.415%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward_vt/4096/62 time: [22.846 µs 22.856 µs 22.872 µs]
change: [+44.354% +44.766% +45.210%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward/4096/16 time: [19.775 µs 19.871 µs 19.986 µs]
change: [+35.310% +36.557% +38.172%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/forward_vt/4096/16 time: [19.678 µs 19.689 µs 19.702 µs]
change: [+33.242% +33.690% +34.139%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward/4096/16 time: [22.840 µs 22.849 µs 22.861 µs]
change: [+43.660% +44.123% +44.560%] (p = 0.00 < 0.05)
Performance has regressed.
ntt/backward_vt/4096/16 time: [22.892 µs 22.960 µs 23.032 µs]
change: [+44.630% +45.081% +45.544%] (p = 0.00 < 0.05)
Performance has regressed. |
Actually, I want to modify the concrete-ntt a little. |
Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz cargo criterion ntt --features concrete-ntt
Compiling fhe-math v0.1.0-beta.8 (/data/dev/[fhe.rs/crates/fhe-math](http://fhe.rs/crates/fhe-math))
Compiling fhe v0.1.0-beta.8 (/data/dev/[fhe.rs/crates/fhe](http://fhe.rs/crates/fhe))
Finished bench [optimized] target(s) in 4.60s
Gnuplot not found, using plotters backend
Gnuplot not found, using plotters backend
Gnuplot not found, using plotters backend
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s
Gnuplot not found, using plotters backend
ntt/forward/1024/62 time: [7.9211 µs 7.9222 µs 7.9231 µs]
change: [-52.194% -52.185% -52.178%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/1024/62 time: [7.8962 µs 7.9078 µs 7.9179 µs]
change: [-48.513% -48.461% -48.413%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/1024/62 time: [8.9256 µs 8.9387 µs 8.9510 µs]
change: [-22.332% -22.265% -22.208%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/1024/62 time: [8.8733 µs 8.8781 µs 8.8836 µs]
change: [-18.803% -18.765% -18.725%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward/1024/16 time: [7.4251 µs 7.5024 µs 7.6175 µs]
change: [-54.565% -54.342% -53.987%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/1024/16 time: [7.5126 µs 7.5287 µs 7.5420 µs]
change: [-49.884% -49.787% -49.698%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/1024/16 time: [8.3443 µs 8.3462 µs 8.3478 µs]
change: [-27.461% -27.448% -27.436%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/1024/16 time: [8.3488 µs 8.3499 µs 8.3512 µs]
change: [-23.645% -23.601% -23.574%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward/4096/62 time: [33.557 µs 33.562 µs 33.566 µs]
change: [-55.338% -54.399% -53.716%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/4096/62 time: [34.956 µs 35.053 µs 35.149 µs]
change: [-48.215% -48.066% -47.919%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/4096/62 time: [40.084 µs 40.110 µs 40.129 µs]
change: [-25.304% -25.241% -25.178%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/4096/62 time: [39.784 µs 39.793 µs 39.804 µs]
change: [-22.524% -22.509% -22.493%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward/4096/16 time: [35.116 µs 35.187 µs 35.243 µs]
change: [-50.887% -50.801% -50.726%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/4096/16 time: [33.574 µs 33.584 µs 33.598 µs]
change: [-49.150% -49.135% -49.119%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/4096/16 time: [38.478 µs 38.485 µs 38.492 µs]
change: [-31.876% -29.802% -28.464%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/4096/16 time: [38.505 µs 38.513 µs 38.521 µs]
change: [-25.007% -24.990% -24.972%] (p = 0.00 < 0.05)
Performance has improved. $ cargo +nightly criterion ntt --features concrete-ntt-nightly
ntt/forward/1024/62 time: [4.0195 µs 4.0297 µs 4.0412 µs]
change: [-49.195% -49.092% -48.961%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/1024/62 time: [3.8602 µs 3.8623 µs 3.8644 µs]
change: [-51.222% -51.168% -51.109%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/1024/62 time: [4.6452 µs 4.6559 µs 4.6658 µs]
change: [-48.186% -48.072% -47.969%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/1024/62 time: [4.4434 µs 4.4449 µs 4.4475 µs]
change: [-49.954% -49.869% -49.726%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward/1024/16 time: [4.0979 µs 4.0988 µs 4.0997 µs]
change: [-45.527% -45.114% -44.823%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/1024/16 time: [3.8746 µs 3.8767 µs 3.8787 µs]
change: [-48.519% -48.427% -48.325%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/1024/16 time: [4.6727 µs 4.6841 µs 4.6956 µs]
change: [-43.972% -43.862% -43.766%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/1024/16 time: [4.4763 µs 4.4767 µs 4.4772 µs]
change: [-46.397% -46.388% -46.378%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward/4096/62 time: [21.465 µs 21.498 µs 21.522 µs]
change: [-35.987% -35.904% -35.851%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/4096/62 time: [20.704 µs 20.710 µs 20.717 µs]
change: [-41.233% -41.067% -40.901%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/4096/62 time: [23.186 µs 23.228 µs 23.253 µs]
change: [-42.044% -41.976% -41.908%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/4096/62 time: [23.653 µs 23.666 µs 23.679 µs]
change: [-40.539% -40.501% -40.463%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward/4096/16 time: [18.732 µs 18.736 µs 18.741 µs]
change: [-46.796% -46.713% -46.621%] (p = 0.00 < 0.05)
Performance has improved.
ntt/forward_vt/4096/16 time: [18.746 µs 18.755 µs 18.764 µs]
change: [-44.181% -44.154% -44.126%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward/4096/16 time: [20.750 µs 20.759 µs 20.772 µs]
change: [-46.058% -45.980% -45.892%] (p = 0.00 < 0.05)
Performance has improved.
ntt/backward_vt/4096/16 time: [21.054 µs 21.056 µs 21.058 µs]
change: [-45.338% -45.325% -45.312%] (p = 0.00 < 0.05)
Performance has improved. |
@fionser I was going to ask you to review. I fully support what you propose, it would be great to merge the normalization in |
@tlepoint Does the benchmark means the concrete-ntt is slower ? I see, the nightly-concrete-ntt can support AVX512 which should be faster than the non-nightly one |
It's slower on Apple M1/M2, but it's faster on Intel with AVX2/AVX512. |
It suprised me that the concret-ntt impl is 30% slower on Apple M1/M2. Given that the concrete-ntt is quit similiar to the pointer-base implemetation in fhe.rs when n <= 1024. |
This would need to be fully investigated; but a possible reason is that this library is using "Shoup" representation for the NTT precomputed values, and limiting moduli to 62-bits maximum, which allows to speed up the base arithmetic operations. |
Acutally, the concrete-ntt also uses Shoup's mulmod trick and Harvey’s butterfly (when the prime modulus is less than 62-bit) which is what fhe.rs doing. |
I find my optimization expereience is not working on Rust/Apple :( // The shoup's mul that uses 2 double-width mul (i.e., fhe.rs)
pub const fn lazy_mul_shoup(&self, a: u64, b: u64, b_shoup: u64) -> u64 {
let q = ((a as u128) * (b_shoup as u128)) >> 64;
let r = ((a as u128) * (b as u128) - q * (self.p as u128)) as u64;
r
}
ntt/forward/4096/62 time: [14.730 µs 14.774 µs 14.845 µs]
change: [-26.838% -26.447% -26.049%] (p = 0.00 < 0.05) // The shoup's mul that uses only one double-width mul which is expected to be faster (i.e,. concrete-ntt)
// However, this code show slower in M2
pub const fn lazy_mul_shoup(&self, a: u64, b: u64, b_shoup: u64) -> u64 {
let q = (((a as u128) * (b_shoup as u128)) >> 64) as u64;
let q = q.wrapping_mul(self.p);
let r = a.wrapping_mul(b).wrapping_sub(q);
r
}
ntt/forward/4096/62 time: [20.120 µs 20.150 µs 20.187 µs]
This also match the benchmark on concrete-ntt. Thus, I would say the performance downgrade from fhe.rs to concrete-ntt is due to this mulmod function. It will be nice @sarah-ek can give us a hint why a such thing happens (I am a newbee to Rust) |
I am not sure what happens either (but I can reproduce the same slowdown when using your code). It is supposed to be compiled the same way: Is concrete-ntt faster if you work with |
Yes, the concrete-ntt is faster when doing both multiplication in u128 for Apple M2. |
We allow the use of
concrete-ntt
infhe-math
crate via the feature flagsconcrete-ntt
concrete-ntt-nightly
Benchmarks to be added to this PR.