-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synergies with LiberTEM, UDFs vs. map_blocks #737
Comments
Doesn't the import hyperspy.api as hs
from scipy.ndimage import gaussian_filter
from pyxem.utils.dask_tools import _process_dask_array
s = hs.load("test_data.hspy", lazy=True) # 4-dimensional dataset
data_gaussian = _process_dask_array(s.data, gaussian_filter, sigma=20) # Get dask array as output and retain chunk sizes It can easily be extended to a One advantage is that import pyxem as pxm
s_gaussian = pxm.signals.Diffraction2D(data_gaussian).as_lazy()
s_gaussian.unit = "2th_deg"
s_gaussian.set_ai()
s_ai = s_gaussian.get_azimuthal_integral1d(npt=100) # Still lazy signal
s_ai.save("test_azimuthal.hspy") |
There isn't a really a short answer to these questions, but the last time Dieter and I spoke the conclusion we reached was that there are a lot of unexploited synergies between the work being done with LiberTEM and hyperspy and pyxem. I think in the short-medium term they (LiberTEM) were hoping to improve hyperspy's lazy signal handling. As I've recently discovered @CSSFrancis also has pull requests open with hyperspy to a similar effect (upstreaming some of are dask processing code) and that seems to be the direction things are going. In terms of the use case @din14970 is describing, if the functionality was rewritten into a LiberTEM UDF I would be very happy to merge it into pyxem. I have no objections to depending on LiberTEM (but similarly, there is little enthusiasm from either us or LiberTEM to "merge" the two packages into one) and it may well be faster than using dask for most users (although I've not personally run experiments to confirm this). |
@magnunor Actually wasn't aware of
As far as I understood, LiberTEM is seeing themselves less and less as a 4D-STEM analysis package, as they are trying to focus on the speedy calculation aspect and trying to push the application specific stuff out to Hyperspy/Pyxem. |
Yes, that's certainly my impression, and it seems like an excellent way of allocating the skills we have at present in the community(ies) |
We have plans to try out LiberTEM's UDFs in kikuchipy as well (pyxem/kikuchipy#321), so I'm interested in seeing it goes, @din14970! As Magnus mentions, the "interactivity" of dask array signals is really nice and practical. I'm thinking that using a combination of dask and LiberTEM might be good, in that LiberTEM can do some specialized computations, while dask does all the rest. |
With regards to
Give it a shot! I'm very curious to see how it handles it. One thing to keep in mind when benchmarking very big datasets, is doing the "full" processing from the harddrive to the end result. It is sometimes possible to over-optimize things in one part of the processing, while the real bottleneck is somewhere else.
I did a little work on this earlier, with |
It's been a little while since I've looked at LiberTEM but it does seem like they have a lot better handle on dealing with large datasets than hyperspy does (partially because I don't think Hyperspy has really had to deal with them until recently). For operating using multiple cores on a cluster, I've recently been playing around with dask-jobqueue which might be of interest if you are trying to schedule jobs on a cluster through Dask. I wouldn't say that hyperspy/hyperspy#2617 is going to completely fix the map function in hyperspy but I think it is a good start. I'll try and benchmark it's ability to run in parallel hopefully later today. |
On this topic, dask seems to be playing very well with the RAPIDS and the ucx framework https://medium.com/rapids-ai/high-performance-python-communication-with-ucx-py-221ac9623a6a Also, NEP35 will be very convenient to write code which works for cupy, numpy, dask/numpy or dask/cupy arrays. I have started experimenting with a branch of hyperspy which supports cupy array and it works well without changing much of the code.
As @CSSFrancis said, this shouldn't be necessary, if the hyperspy map function was improved. |
Although this was never resolved, I am closing as old+discussion. |
I'm posting this mainly as a point of discussion.
In my experience, indexing real and large datasets (whichever method is used) is still quite slow, even with whatever I implemented in #673. These things should not take more than a few minutes on a cluster if functions could make use of GPU's, which is something I'd like to look into - e.g. with Cupy. More than that however is that I feel the functionality we write here to perform operations on datasets using
dask.array.map_blocks
requires a lot of boilerplate: first write a function to operate on a frame, then a function that loops over all the frames, then wrap that in another thing. After a discussion with Dieter Weber I recently started looking into LiberTEM's UDF API, which seems like a much more natural, granular and scaleable approach to operating on large datasets. I'm in the process of experimenting a bit with this to see if I can get implement "fast" template matching as a LiberTEM UDF, hopefully also with CUDA support, so we could run it on our GPU cluster.I'm wondering if there are any longer term ideas about using the LiberTEM API inside Pyxem or porting the Pyxem functionality to LiberTEM. Hence opening the issue here for a discussion.
The text was updated successfully, but these errors were encountered: