Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD Vectorization - Use Integer Fused Multiply-Add (AVX512) #427

Open
mratsim opened this issue Jul 12, 2024 · 2 comments
Open

SIMD Vectorization - Use Integer Fused Multiply-Add (AVX512) #427

mratsim opened this issue Jul 12, 2024 · 2 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Jul 12, 2024

It might be quite interesting to explore SIMD vectorization for elliptic curves and MSMs. This might significantly speed-up:

  • Verkle Trees
  • KZG
  • MSM

without needing a GPU. Ideally the same optimizations are written in "portable" intrinsics so that can be used with ARM Neon as well.
Speedup can be done either horizontally, i.e. processing 8 points in parallel (8*64 = 512) or within some subroutines try to use as much parallelism as possible, i.e. on Fp2 or at the Elliptic curve level.

This probably requires full support for signed unsaturated arithmetic which is partially supported here:

CPU support:

While recent Intel CPUs (AlderLake and later) don't support AVX512, it might be that they support the 256-bit version of IFMA

Papers:

@mratsim
Copy link
Owner Author

mratsim commented Aug 14, 2024

Update following discussion with @Vindaar

Given that our numbers are actually small in size, just 381-bit in production, AVX512-IFMA is actually unnecessary:

Instead we should just use addcarry/subborrow/extended_multiplication and process 4, 8 or 16 field elements at once.

@mratsim
Copy link
Owner Author

mratsim commented Aug 14, 2024

Actually there is no addcarry so the sequence will be similar to the following

proc addcarry*(bld: BuilderRef, a, b, carryIn: ValueRef): tuple[carryOut, r: ValueRef] =
## (cOut, result) <- a+b+cIn
let ty = a.getTypeOf()
let add = bld.add(a, b, name = "adc01_")
let carry0 = bld.icmp(kULT, add, b, name = "adc01c_")
let cIn = bld.zext(carryIn, ty, name = "adc2_")
let adc = bld.add(cIn, add, name = "adc_")
let carry1 = bld.icmp(kULT, adc, add, name = "adc012c_")
let carryOut = bld.`or`(carry0, carry1, name = "cOut_")
return (carryOut, adc)

So 2 addition, 2 comparison, 1 or per addcarry, this is too costly and unsaturated arithmetic will be necessary, so IFMA is back on the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant