Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of high order derivatives in Enzyme is slower than finite difference #2252

Open
Arpit-Babbar opened this issue Jan 6, 2025 · 3 comments

Comments

@Arpit-Babbar
Copy link

Arpit-Babbar commented Jan 6, 2025

Thank you for adding the capability of computing high order derivatives with Enzyme in #2161!

I benchmarked the performance of Enzyme against finite difference method. For orders greater than 2, I see that finite difference is faster. It is 1.7 times faster for order 3 and 2.4 times faster for order 4. I am sharing my benchmarking code for orders 3 and 4 in case it leads to any ideas for improvement. The mean performance of the third order finite difference method is 8 ns, while that of Enzyme is 14 ns. For the fourth order, finite difference is at 11 ns and Enzyme is at 27 ns.

Benchmarking code for third order
using BenchmarkTools: @benchmark
using StaticArrays
using Enzyme

# Flux function to be differentiation

@inline function flux(u)
    rho, rho_v1, rho_v2, rho_e = u
    gamma = 1.4
    v1 = rho_v1 / rho
    v2 = rho_v2 / rho
    p = (gamma - 1) * (rho_e - 0.5f0 * (rho_v1 * v1 + rho_v2 * v2))
    f1 = rho_v1
    f2 = rho_v1 * v1 + p
    f3 = rho_v1 * v2
    f4 = (rho_e + p) * v1
    return SVector(f1, f2, f3, f4)
end

# Third order finite difference derivative

function third_derivative_fd(u, du, ddu, dddu)
    factor = 0.5
    df = factor * (flux(u + 2.0 * du + 2.0 * ddu + 4.0/3.0 * dddu)
                   - 2.0 * flux(u + du + 0.5 * ddu + (1.0/6.0) * dddu)
                   + 2.0 * flux(u - du + 0.5 * ddu - (1.0/6.0) * dddu)
                   - flux(u - 2.0 * du + 2.0 * ddu - 4.0/3.0 * dddu))
    return df
end

# AD to compute derivatives

dg_ad(x, dx) = autodiff(Forward, flux, DuplicatedNoNeed(x, dx))[1]
ddg_ad(x, dx, ddx) = autodiff(Forward, dg_ad, DuplicatedNoNeed(x, dx),
                              DuplicatedNoNeed(dx, ddx))[1]
dddg_ad(x, dx, ddx, dddx) = autodiff(Forward, ddg_ad, DuplicatedNoNeed(x, dx),
                                    DuplicatedNoNeed(dx, ddx), DuplicatedNoNeed(ddx, dddx))[1]

# Random inputs

u = SVector(1.0, -0.1, 0.2, 2.0)
du, ddu, dddu, ddddu = (1e-3*SVector(rand(4)...) for _ in 1:4)

@info "Third derivative"
@info "FD"
display(@benchmark third_derivative_fd($u, $du, $ddu, $dddu))

@info "Enzyme"

display(@benchmark dddg_ad($u, $du, $ddu, $dddu))
Benchmarking code for fourth order
using BenchmarkTools: @benchmark
using StaticArrays
using Enzyme

# Flux function to be differentiation

@inline function flux(u)
    rho, rho_v1, rho_v2, rho_e = u
    gamma = 1.4
    v1 = rho_v1 / rho
    v2 = rho_v2 / rho
    p = (gamma - 1) * (rho_e - 0.5f0 * (rho_v1 * v1 + rho_v2 * v2))
    f1 = rho_v1
    f2 = rho_v1 * v1 + p
    f3 = rho_v1 * v2
    f4 = (rho_e + p) * v1
    return SVector(f1, f2, f3, f4)
end

# Fourth order finite difference derivative

function fourth_derivative_fd(u, du, ddu, dddu, ddddu)
    df = (
          flux(u + 2.0 * du + 2.0 * ddu + 4.0/3.0 * dddu + 2.0/3.0 * ddddu)
         - 4.0 * flux(u + du + 0.5 * ddu + 1.0/6.0 * dddu + 1.0/24.0 * ddddu)
         + 6.0 * flux(u)
         - 4.0 * flux(u - du + 0.5 * ddu - 1.0/6.0 * dddu + 1.0/24.0 * ddddu)
         + flux(u - 2.0 * du + 2.0 * ddu - 4.0/3.0 * dddu + 2.0/3.0 * ddddu)
         )
    return df
end


# AD to compute derivatives

dg_ad(x, dx) = autodiff(Forward, flux, DuplicatedNoNeed(x, dx))[1]
ddg_ad(x, dx, ddx) = autodiff(Forward, dg_ad, DuplicatedNoNeed(x, dx),
                              DuplicatedNoNeed(dx, ddx))[1]
dddg_ad(x, dx, ddx, dddx) = autodiff(Forward, ddg_ad, DuplicatedNoNeed(x, dx),
                                    DuplicatedNoNeed(dx, ddx), DuplicatedNoNeed(ddx, dddx))[1]
ddddg_ad(x, dx, ddx, dddx, ddddx) = autodiff(Forward, dddg_ad, DuplicatedNoNeed(x, dx),
                                             DuplicatedNoNeed(dx, ddx),
                                             DuplicatedNoNeed(ddx, dddx),
                                             DuplicatedNoNeed(dddx, ddddx))[1]

# Random inputs

u = SVector(1.0, -0.1, 0.2, 2.0)
du, ddu, dddu, ddddu = (1e-3*SVector(rand(4)...) for _ in 1:4)

@info "Fourth derivative"
@info "FD"
display(@benchmark fourth_derivative_fd($u, $du, $ddu, $dddu, $dddu))

@info "Enzyme"

display(@benchmark ddddg_ad($u, $du, $ddu, $dddu, $dddu))
Benchmarking results for third order
[ Info: Third derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.924 ns … 20.604 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.008 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.140 ns ±  0.710 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▃▃▂     ▃                                            ▁▁ ▂
  ██████▇▆▆▅▆█▆▆▆▅▃█▇▄▅▅▄▅▄▅▄▃▅▅▁▅▄▄▆▄▃▅▄▃▃▁▄▅▅▃▄▃▄▄▄▅▆▅▆▇██ █
  7.92 ns      Histogram: log(frequency) by time     11.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  13.639 ns … 26.652 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.722 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.037 ns ±  0.980 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▃ ▂     ▁▅▂▁                                              ▁
  ███▇█▇▆▆▃▆████▇▅▄▅▇▅▅▆▆▅▄▅▄▁▄▄▅▅▅▄▆▄▆▃▅▃▄▅▄▅▅▆▅▆▄▅▅▄▅▄▅▅▄▆█ █
  13.6 ns      Histogram: log(frequency) by time      19.1 ns <
Benchmarking results for fourth order
[ Info: Fourth derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  10.927 ns … 83.625 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.304 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.644 ns ±  2.999 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▃█▂    █▆▃▃▂▁▁▁ ▁▁▁▂▁▁▁▂▁▂▄▂▁▁                             ▂
  ███████████████████████████████▅▅▄▄▅▄▅▅▄▅▄▄▄▄▄▄▅▃▄▁▄▁▄▆▅▅▆▅ █
  10.9 ns      Histogram: log(frequency) by time      20.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  26.230 ns … 43.257 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.397 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.774 ns ±  1.646 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆  ▁                                                       ▁
  ███▆█▇▆▅▅▆▆▆▆▆▅▅▄▅▆▇▇▆▅▆▅▅▅▅▆▅▅▄▄▄▄▅▅▅▄▅▅▄▆▅▅▆▅▄▄▅▅▅▄▅▄▅▃▄▅ █
  26.2 ns      Histogram: log(frequency) by time      36.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results have been generated with Enzyme v0.13.24 using Apple M3 Pro on Julia 1.11.2.

versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M3 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 6 virtual cores)

Here is a gist that checks that the above third and fourth order derivative computations are correct by using a polynomial test case. This gist contains computation and benchmarking of all derivatives up to four.

@vchuravy
Copy link
Member

vchuravy commented Jan 6, 2025

You are on an Apple M3/M4?

On my system there is a overhead, but it is much smaller.

[ Info: FD

BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  25.681 ns … 155.876 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.882 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.268 ns ±   4.361 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▃                                                          ▁
  ███▇▅▃▄▅▅▅▄▄▁▃▃▄▃▄▄▄▁▁▃▃▄▃▄▄▃▁▃▄▄▅▄▅▃▄▄▄▅▅▄▃▅▅▄▅▁▃▄▃▃▄▅▅▆▆▇▇ █
  25.7 ns       Histogram: log(frequency) by time      34.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

[ Info: Enzyme

BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  29.251 ns … 213.054 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.805 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.373 ns ±   6.794 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▃▃▄▃▃▃▃▁ ▁▃▂▂▂▂▂▂▂▂▂▁        ▁▁        ▁                   ▁
  ███████████████████████▇█▆▇▇▆▅████▆▇▇▇▄▇███▇▄█▄▇▅▂▂▃▅▄▃▄▂▄▄▄ █
  29.3 ns       Histogram: log(frequency) by time      49.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@vchuravy
Copy link
Member

vchuravy commented Jan 6, 2025

For the fourth derivative I start to see the overhead grow:

[ Info: Fourth derivative

[ Info: FD

BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  28.631 ns … 120.559 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.551 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.569 ns ±   2.216 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂  ▃▃█▅ ▃▃▁▃▃▁▃▁▃▃ ▃▂▁▃▁▂▂ ▂▂▂ ▃▂▁▁                          ▂
  ██████████████████████████████▇████▆▅▅▅▅▅▅▃▄▅▅▅▅▅▄▄▁▅▃▅▃▅▄▅█ █
  28.6 ns       Histogram: log(frequency) by time      38.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

[ Info: Enzyme

BenchmarkTools.Trial: 10000 samples with 982 evaluations.
 Range (min … max):  53.267 ns … 92.200 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.267 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.501 ns ±  1.301 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▄▂▃                                                   
  ▂▂▂▂▃▆████▅▄▃▂▂▃▄▂▂▃▁▁▁▂▁▂▂▂▂▂▂▂▁▁▁▁▁▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  53.3 ns         Histogram: frequency by time        60.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@Arpit-Babbar
Copy link
Author

You are on an Apple M3/M4?

On my system there is a overhead, but it is much smaller.

I am using Apple M3 Pro. I have updated my first post with that information, along with code results for the fourth order derivative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants