Possible performance increase for `Variogram. _calc_groups()`. #145

MartinSchobben · 2023-05-11T12:47:55Z

I ran into this problem while dealing with a large amount of data. I wonder whether replacing the for loop in Variogram. _calc_groups(). with numpy.digitize() might be an option. Not sure if the output is the desired format but it would surely be fast. See below for more details and some benchmark output:

import numpy as np
import timeit

# distance
d = np.arange(0, 1e6)
# get the bin edges and distances
bin_edges = np.linspace(0, d.max(), int(1e5))
 
n = 5

# for loop
def for_loop_solution():
    # -1 is the group fir distances outside maxlag
    groups = np.ones(len(d), dtype=int) * -1

    for i, bounds in enumerate(zip([0] + list(bin_edges), bin_edges)):
        groups[np.where((d >= bounds[0]) & (d < bounds[1]))] = i
    
    print("Object `groups`: ", groups , " of size: ", len(groups))

result = timeit.timeit(stmt='for_loop_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

# numpy
def numpy_solution():
    groups = np.digitize(d, bin_edges)
    print("Object `groups`: ", groups , " of size: ", len(groups))

result = timeit.timeit(stmt='numpy_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

Output:

Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Object `groups`:  [    1     1     1 ... 99999 99999    -1]  of size:  1000000
Execution time is 103.20237362137996 seconds
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Object `groups`:  [     1      1      1 ...  99999  99999 100000]  of size:  1000000
Execution time is 0.01940285782329738 seconds

The text was updated successfully, but these errors were encountered:

mmaelicke · 2023-05-12T06:12:20Z

Hi Martin,

That looks more than promising. Yeah, I was always aware of that bottleneck, but never found the time to solve it. So I am more than thrilled to implement your solution. I hope to do that soon. If the implementation is urgent for you, you can also file a PR or let me know and I try to hurry up a little bit.

Best,
Mirko

MartinSchobben · 2023-05-12T07:11:12Z

Hi Mirko,

There is no real hurry. I was peeling apart your code as an example for an application meant for internal usage at the TU Wien. I like to have variograms for ASCAT soil moisture data. But there I need some more performance, so I was profiling the code and looking to build in some memoizing functionality for e.g. methods that contain expensive calculations. That is why I came across this and I thought I let you know.

Perhaps I can show you this code some time, also to check if credits are due somewhere for if it should go public.

I would also be happy to try and turn the above into a PR.

Kind regards,
Martin

mmaelicke · 2023-05-12T07:14:44Z

Yeah, nice. Then I guess the best way to proceed is that you open a PR with your code, which also makes you a collaborator (and co-author of the Zenodo publication). That way credit is attributed clearly and transparently. I can also assist with the PR, if you feel like I can be helpful.

MartinSchobben · 2023-05-12T07:32:29Z

Yes, and I should list you probably as contributor (or cite your work) on my ASCAT application if it would go public. As I am heavily relying on your knowledge of spatial autocorrelation.

mmaelicke · 2023-05-12T07:35:23Z

Yes, and I should list you probably as contributor (or cite your work) on my ASCAT application if it would go public. As I am heavily relying on your knowledge of spatial autocorrelation.

Just to let you know: There is also a model description paper on GMD, which you can cite if you use the package, DOI: 10.5194/gmd-15-2505-2022
URL: https://gmd.copernicus.org/articles/15/2505/2022/gmd-15-2505-2022.html

…gstat mmaelicke#145)

MartinSchobben added a commit to MartinSchobben/scikit-gstat that referenced this issue Jul 6, 2023

Performance enhancement Variogram in _calc_groups() method (scikit-…

5ee9e83

…gstat mmaelicke#145)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance increase for `Variogram. _calc_groups()`. #145

Possible performance increase for `Variogram. _calc_groups()`. #145

MartinSchobben commented May 11, 2023

mmaelicke commented May 12, 2023

MartinSchobben commented May 12, 2023

mmaelicke commented May 12, 2023

MartinSchobben commented May 12, 2023

mmaelicke commented May 12, 2023

Possible performance increase for Variogram. _calc_groups(). #145

Possible performance increase for Variogram. _calc_groups(). #145

Comments

MartinSchobben commented May 11, 2023

mmaelicke commented May 12, 2023

MartinSchobben commented May 12, 2023

mmaelicke commented May 12, 2023

MartinSchobben commented May 12, 2023

mmaelicke commented May 12, 2023

Possible performance increase for `Variogram. _calc_groups()`. #145

Possible performance increase for `Variogram. _calc_groups()`. #145