Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance increase for Variogram. _calc_groups(). #145

Open
MartinSchobben opened this issue May 11, 2023 · 5 comments
Open

Possible performance increase for Variogram. _calc_groups(). #145

MartinSchobben opened this issue May 11, 2023 · 5 comments

Comments

@MartinSchobben
Copy link

I ran into this problem while dealing with a large amount of data. I wonder whether replacing the for loop in Variogram. _calc_groups(). with numpy.digitize() might be an option. Not sure if the output is the desired format but it would surely be fast. See below for more details and some benchmark output:

import numpy as np
import timeit

# distance
d = np.arange(0, 1e6)
# get the bin edges and distances
bin_edges = np.linspace(0, d.max(), int(1e5))
 
n = 5

# for loop
def for_loop_solution():
    # -1 is the group fir distances outside maxlag
    groups = np.ones(len(d), dtype=int) * -1

    for i, bounds in enumerate(zip([0] + list(bin_edges), bin_edges)):
        groups[np.where((d >= bounds[0]) & (d < bounds[1]))] = i
    
    print("Object `groups`: ", groups , " of size: ", len(groups))

result = timeit.timeit(stmt='for_loop_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

# numpy
def numpy_solution():
    groups = np.digitize(d, bin_edges)
    print("Object `groups`: ", groups , " of size: ", len(groups))

result = timeit.timeit(stmt='numpy_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

Output:

Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Execution time is 103.20237362137996 seconds
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Execution time is 0.01940285782329738 seconds
@mmaelicke
Copy link
Owner

Hi Martin,

That looks more than promising. Yeah, I was always aware of that bottleneck, but never found the time to solve it. So I am more than thrilled to implement your solution. I hope to do that soon. If the implementation is urgent for you, you can also file a PR or let me know and I try to hurry up a little bit.

Best,
Mirko

@MartinSchobben
Copy link
Author

Hi Mirko,

There is no real hurry. I was peeling apart your code as an example for an application meant for internal usage at the TU Wien. I like to have variograms for ASCAT soil moisture data. But there I need some more performance, so I was profiling the code and looking to build in some memoizing functionality for e.g. methods that contain expensive calculations. That is why I came across this and I thought I let you know.

Perhaps I can show you this code some time, also to check if credits are due somewhere for if it should go public.

I would also be happy to try and turn the above into a PR.

Kind regards,
Martin

@mmaelicke
Copy link
Owner

Yeah, nice. Then I guess the best way to proceed is that you open a PR with your code, which also makes you a collaborator (and co-author of the Zenodo publication). That way credit is attributed clearly and transparently. I can also assist with the PR, if you feel like I can be helpful.

@MartinSchobben
Copy link
Author

Yes, and I should list you probably as contributor (or cite your work) on my ASCAT application if it would go public. As I am heavily relying on your knowledge of spatial autocorrelation.

@mmaelicke
Copy link
Owner

Yes, and I should list you probably as contributor (or cite your work) on my ASCAT application if it would go public. As I am heavily relying on your knowledge of spatial autocorrelation.

Just to let you know: There is also a model description paper on GMD, which you can cite if you use the package, DOI: 10.5194/gmd-15-2505-2022
URL: https://gmd.copernicus.org/articles/15/2505/2022/gmd-15-2505-2022.html

MartinSchobben added a commit to MartinSchobben/scikit-gstat that referenced this issue Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants