consider whether the big array fwbatch should be malloced at plan or at execute. #480

ahbarnett · 2024-07-11T22:18:45Z

Martin's HVOX users want a small plan object (not sure why since it's going to get malloc'ed anyway)...
discuss whether to malloc in each execute call, as Martin was trying to do in DUCC-FFT PR #463 (negligible overhead).

mreineck · 2024-07-13T07:47:40Z

To set the record straight: I misinterpreted the remarks I heard from the HVOX authors (sorry about that, and thanks again for the correction, @SepandKashani!). In the most recent version of their particular application the memory overhead doesn't seem to matter any more.

The situation I imagine is the following:

Assume you want to perform an approximate inverse NUFFT from a large set of nonuniform points to a large 3D grid, using CG iteration. Then (as far as I know, but I may be wrong) two finufft plans are needed, one type 1 and one type 2, that are executed alternatingly in every CG iteration. Both of the plans allocate memory for a copy of the fine grid, but at any point in time only one of them is needed, which, depending on the problem size, can become a huge overhead.
When FFTW is used, this is a bit tricky to avoid, since typically an FFTW plan stores a pointer to its input and output, so this memory has to have the same lifetime as the FFTW plan. For ducc's FFT, this is not the case, so the fine grid could be allocated "on demand", while a particular FINUFFT plan is executed.
I'd consider this a "low hanging fruit" quality of implementation issue, but if the goal is absolute maximum performance, I agree it will be better to allocate once during plan generation and avoid the overhead of allocation/deallocation during every plan execution.

SepandKashani · 2024-07-13T08:06:52Z

Actually, if you want to solve an optimization problem of the form $x^{*} = \arg\min_{x} \lVert y - A x \rVert^{2}$ (or similar) using first-order methods where $A$ is an NUFFT (of either type 1/2/3), then a single NUFFT plan suffices to evaluate $A x$ and $A^{H} y$ needed by these methods. CG falls squarely in this category.

I don't have this written up yet, but you can monitor this repo over the next 2 weeks which will show how to achieve this.

ahbarnett · 2024-07-16T00:02:24Z

It is interesting that you can reuse one plan for t1 and t2... this definitely counts as a hack that we wouldn't expect others to know about :) (how do you change the plan type after planning, given the plan is a blind pointer?). It's also interesting that you've managed to avoid the multiple plans for the hierarchical type 3. My original thinking was that the plan interface was most useful for multiple small transforms where the cost of a new malloc per transform should be avoided. But I am leaning towards adding an option that switches which type of malloc is used. That would be simple and flexible.

…

On Sat, Jul 13, 2024 at 4:07 AM Sepand KASHANI ***@***.***> wrote: Actually, if you want to solve an optimization problem of the form $x^{*} = \arg\min_{x} \lVert y - A x \rVert^{2}$ (or similar) using first-order methods where $A$ is an NUFFT (of either type 1/2/3), then a single NUFFT plan suffices to evaluate $A x$ and $A^{H} y$ needed by these methods. CG falls squarely in this category. I don't have this written up yet, but you can monitor this repo <https://github.com/SepandKashani/fourier_toolkit> over the next 2 weeks which will show how to achieve this. — Reply to this email directly, view it on GitHub <#480 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSWS6P3SXYXLQG6QXZDZMDN3FAVCNFSM6AAAAABKX2GXO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAYTMMBVGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

SepandKashani · 2024-08-13T14:48:33Z

My apologies for the delay: putting everything together took longer than expected.

It is interesting that you can reuse one plan for t1 and t2... this definitely counts as a hack that we wouldn't expect others to know about :) (how do you change the plan type after planning, given the plan is a blind pointer?)

How to concretely re-use a T1 plan to perform a T2 (or a T3 to perform T3^{*}) depends on how the NUFFT is implemented.
FINUFFT doesn't currently allow you to do this natively via the API, but adding this is quite simple. [Not sure about the blind pointer issue you mentioned above.]
Moreover it is useful when performing NUFFTs as part of a large optimization problem since it effectively halves the memory use by storing one plan only.

The implementations in the repo above use DUCC since the number of transforms may vary at runtime, but the same can be done with FFTW via up-front planning if desired.

Note that this version does not use FINUFFT since the hierarchical plan trick requires me to set ($\gamma, S_{i}, X_{i}$) in FINUFFT wording a bit differently; nevertheless I have a good idea how to add this support in FINUFFT too if there is interest.

In any case happy to contribute to FINUFFT once the tide settles a bit on my side.

ahbarnett · 2024-08-13T15:06:36Z

Dear Sepand, Sounds like we should consider whether to bring in to FINUFFT some of the research and performance tricks you develop once it settles a bit, as you say. Good luck! Best, Alex

…

On Tue, Aug 13, 2024 at 10:49 AM Sepand KASHANI ***@***.***> wrote: My apologies for the delay: putting everything together took longer than expected. It is interesting that you can reuse one plan for t1 and t2... this definitely counts as a hack that we wouldn't expect others to know about :) (how do you change the plan type after planning, given the plan is a blind pointer?) How to concretely re-use a T1 plan to perform a T2 (or a T3 to perform T3^{*}) depends on how the NUFFT is implemented. FINUFFT doesn't currently allow you to do this natively via the API, but adding this is quite simple. [Not sure about the blind pointer issue you mentioned above.] Moreover it is useful when performing NUFFTs as part of a large optimization problem since it effectively halves the memory use by storing one plan only. The implementations in the repo above use DUCC since the number of transforms may vary at runtime, but the same can be done with FFTW via up-front planning if desired. Note that this version does not use FINUFFT since the hierarchical plan trick requires me to set ($\gamma, S_{i}, X_{i}$) in FINUFFT wording a bit differently; nevertheless I have a good idea how to add this support in FINUFFT too if there is interest. In any case happy to contribute to FINUFFT once the tide settles a bit on my side. — Reply to this email directly, view it on GitHub <#480 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSWZE3AWKBAV5WS2FXDZRIMFPAVCNFSM6AAAAABKX2GXO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBWGQ2DKOBUGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

ahbarnett · 2025-02-13T17:21:30Z

Martin Reinecke brought this up again by email. Some of that discussion:

Your other comment:
/Independently, I would also reconsider whether “fwBatch” should be kept
allocated outside of plan construction (with FFTW only) and execution.
This also saves a lot of memory./
I didn't understand. Are you proposing that fwBatch not be allocated at
the plan stage, rather, at the execute stage?

Yes, I'd argue for allocating scratch space (which this is) only during
the time it is really needed. When using FFTW, it is needed in the
constructor as well, to allow planning, but it doesn't hold meaningful
values between user calls. We would need to adjust he FFTW invocations a
little bit, but that's easy to do.
I'm aware that repeated allocation costs time, but on the other side of
the balance there is a "waste" of potentially Gigabytes of RAM.
Assuming that a user uses N different libraries in their code; if these
libraries all go for the "use more RAM" strategy, this will lead to
trouble when N increases.

mreineck · 2025-02-17T13:52:33Z

One more argument for just-in-time allocation: it would allow to call one and the same Finufft plan concurrently from different threads. More technically: a Finufft plan would be perfectly immutable between construction and destruction. FFTW also gives this kind of guarantee for its own plans.

ahbarnett · 2025-02-17T16:47:13Z

Hi Martin, I'm all in favor of just-in-time allocation for big transforms. I'm curious about the allocation cost for many small executes which need low overhead (eg, try make gurutest, or make perftest/manysmallprobs which I just found is broken, will fix). Our reason for the plan interface was originally because FFTW's plan-lookup cost of 0.1 ms per thread was too intrusive when doing many small transforms. Re the plan being immutable, I didn't understand that, since eg setpts fills the sort index array. (That would differ across threads if a user does setpts on different NU pts on different threads.) Also since the FFTW plan is specific to a fixed array location (apparently), that would also break if using the same plan on different threads? Or does "immutable" mean that its size doesn't change (no new mallocs)? Maybe you could pseudocode a guru (plan, setpts, exec, delete) for the multithreaded scenario you envisage? Worth doing before we proceed. Thanks, Alex

…

On Mon, Feb 17, 2025 at 8:52 AM mreineck ***@***.***> wrote: One more argument for just-in-time allocation: it would allow to call one and the same Finufft plan concurrently from different threads. More technically: a Finufft plan would be perfectly immutable between construction and destruction. FFTW also gives this kind of guarantee for its own plans. — Reply to this email directly, view it on GitHub <#480 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSUMZJYKCSA22F4J2ET2QHSTPAVCNFSM6AAAAABXCXF6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTGIYDCMZZGM> . You are receiving this because you authored the thread.Message ID: ***@***.***> [image: mreineck]*mreineck* left a comment (flatironinstitute/finufft#480) <#480 (comment)> One more argument for just-in-time allocation: it would allow to call one and the same Finufft plan concurrently from different threads. More technically: a Finufft plan would be perfectly immutable between construction and destruction. FFTW also gives this kind of guarantee for its own plans. — Reply to this email directly, view it on GitHub <#480 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSUMZJYKCSA22F4J2ET2QHSTPAVCNFSM6AAAAABXCXF6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRTGIYDCMZZGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

mreineck · 2025-02-17T17:10:12Z

Hi Alex, thanks for the response!

You are right, I screwed up my remark about immutability ... in my head setpts will always be part of the "planning" stage, and plans with different point locations are different plans to me. Of course that doesn't match with finufft's implementation and terminology, so my statement was simply wrong.
Finufft plans are immutable as long as setpts isn't called. I think (but my intuition might be wrong) that setpts will be called only once in a plan's lifetime in most applications, so we get the immmutability as soon as this call is finished, I can try to demonstrate parallel invocation in this context; will work on a demo over the next days!

ahbarnett added the enhancement label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider whether the big array fwbatch should be malloced at plan or at execute. #480

consider whether the big array fwbatch should be malloced at plan or at execute. #480

ahbarnett commented Jul 11, 2024 •

edited

Loading

mreineck commented Jul 13, 2024

SepandKashani commented Jul 13, 2024

ahbarnett commented Jul 16, 2024 via email

SepandKashani commented Aug 13, 2024

ahbarnett commented Aug 13, 2024 via email

ahbarnett commented Feb 13, 2025

mreineck commented Feb 17, 2025

ahbarnett commented Feb 17, 2025 via email

mreineck commented Feb 17, 2025

consider whether the big array fwbatch should be malloced at plan or at execute. #480

consider whether the big array fwbatch should be malloced at plan or at execute. #480

Comments

ahbarnett commented Jul 11, 2024 • edited Loading

mreineck commented Jul 13, 2024

SepandKashani commented Jul 13, 2024

ahbarnett commented Jul 16, 2024 via email

SepandKashani commented Aug 13, 2024

ahbarnett commented Aug 13, 2024 via email

ahbarnett commented Feb 13, 2025

mreineck commented Feb 17, 2025

ahbarnett commented Feb 17, 2025 via email

mreineck commented Feb 17, 2025

ahbarnett commented Jul 11, 2024 •

edited

Loading