[ENH] functional n_jobs parameter for knn classifier #2478

baraline · 2025-01-01T15:33:12Z

Describe the feature or idea you want to propose

Current n_jobs params is not doing anything in knn classifier.

Describe your proposed solution

Make use of it ! TBD how exactly. If not possible a warning should at least be raised.

Describe alternatives you've considered, if relevant

No response

Additional context

Was curious after looking at sequentia benchmark on why we were so slow...

Ramana-Raja · 2025-01-06T23:20:15Z

I think parallelization could potentially be applied in the _kneighbors method, but, as shown in the image below, it actually worsens execution time. This is likely because the task itself is too lightweight or quick to benefit from parallel processing. So, I’d say it’s safe to conclude that parallelization isn’t really useful here

As for the warning, I’m not entirely sure what you mean, but maybe we can just remove the n_jobs parameter and turn off multithreading?

Testing on data:

baraline · 2025-01-07T10:36:39Z

Thanks for the benchmarking job ! Could you try switching the joblib backend to threading and see if it makes a difference ? As the distance function are njit numba function we shouldn't need to use loky backend. Otherwise, defining the kneighbors function as a numba function with parallel=True might work better.
Additionally, using more computing intensive functions like dtw instead of euclidean might change results

aadya940 · 2025-01-07T10:46:25Z

@baraline @Ramana-Raja How about parallelizing the loop in _predict in addition to what @Ramana-Raja has already done, it doesn't seem to have any looping dependencies. So could try?

baraline · 2025-01-07T10:51:06Z

It can work aswell yes, but then you need to balance number of threads (for kneighbor) per process (for sample or group of sample to predict) to find the right balance

aadya940 · 2025-01-07T11:33:36Z

Maybe we can limit the number of threads per process with max cpu count divided by njobs?

Ramana-Raja · 2025-01-07T11:58:49Z

@baraline @Ramana-Raja How about parallelizing the loop in _predict in addition to what @Ramana-Raja has already done, it doesn't seem to have any looping dependencies. So could try?

This approach also doesn’t seem to work as intended if the data is small. It actually ends up making the execution time worse, as you can see below

testing done with dtw:

But if the data is large it performs better:

testing done on data:

aadya940 · 2025-01-07T12:56:31Z

@Ramana-Raja That behavior is expected. It occurs because the time required for process creation and context switching exceeds the compute time. To address this, analyze how the problem scales and plot a graph comparing execution times with and without parallelization. The intersection point will indicate the optimal input size where parallelism becomes beneficial. Ideally, this function should allow for dynamically switching between single-threaded and multithreaded execution.

Ramana-Raja · 2025-01-07T14:02:09Z

@Ramana-Raja That behavior is expected. It occurs because the time required for process creation and context switching exceeds the compute time. To address this, analyze how the problem scales and plot a graph comparing execution times with and without parallelization. The intersection point will indicate the optimal input size where parallelism becomes beneficial. Ideally, this function should allow for dynamically switching between single-threaded and multithreaded execution.

This seems to depend on the CPU, right? The optimal data size might vary for different users with different CPU's. I think the best approach is to leave it configurable as a hyperparameter

aadya940 · 2025-01-07T14:05:50Z

Makes sense, @baraline wdyt?

baraline · 2025-01-07T19:55:33Z

I think the simplest would be to offload such hasssle to the numba compiler. We could use the existing functions of the distance module (e.g. euclidean_pairwise_distance) to compute the distance matrix in parallel, by adding a n_jobs=1 optional parameter, using the set_num_threads numba function and making _euclidean_pairwise_distance and _euclidean_from_multiple_to_multiple_distance use parallel=True with prange on both loops. This would not change default behaviour but would allow us to make use of n_jobs without much work.

This would necessitate a change in such function for all distances though, @chrisholder what do you think ?

baraline added enhancement New feature, improvement request or other non-bug code enhancement classification Classification package labels Jan 1, 2025

TonyBagnall changed the title ~~[ENH] functionnal n_jobs parameter for knn classifier~~ [ENH] functional n_jobs parameter for knn classifier Jan 10, 2025

MatthewMiddlehurst assigned chrisholder Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] functional n_jobs parameter for knn classifier #2478

[ENH] functional n_jobs parameter for knn classifier #2478

baraline commented Jan 1, 2025 •

edited

Loading

Ramana-Raja commented Jan 6, 2025 •

edited

Loading

baraline commented Jan 7, 2025 •

edited

Loading

aadya940 commented Jan 7, 2025 •

edited

Loading

baraline commented Jan 7, 2025 •

edited

Loading

aadya940 commented Jan 7, 2025

Ramana-Raja commented Jan 7, 2025 •

edited

Loading

aadya940 commented Jan 7, 2025

Ramana-Raja commented Jan 7, 2025

aadya940 commented Jan 7, 2025

baraline commented Jan 7, 2025 •

edited

Loading

[ENH] functional n_jobs parameter for knn classifier #2478

[ENH] functional n_jobs parameter for knn classifier #2478

Comments

baraline commented Jan 1, 2025 • edited Loading

Describe the feature or idea you want to propose

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Ramana-Raja commented Jan 6, 2025 • edited Loading

baraline commented Jan 7, 2025 • edited Loading

aadya940 commented Jan 7, 2025 • edited Loading

baraline commented Jan 7, 2025 • edited Loading

aadya940 commented Jan 7, 2025

Ramana-Raja commented Jan 7, 2025 • edited Loading

aadya940 commented Jan 7, 2025

Ramana-Raja commented Jan 7, 2025

aadya940 commented Jan 7, 2025

baraline commented Jan 7, 2025 • edited Loading

baraline commented Jan 1, 2025 •

edited

Loading

Ramana-Raja commented Jan 6, 2025 •

edited

Loading

baraline commented Jan 7, 2025 •

edited

Loading

aadya940 commented Jan 7, 2025 •

edited

Loading

baraline commented Jan 7, 2025 •

edited

Loading

Ramana-Raja commented Jan 7, 2025 •

edited

Loading

baraline commented Jan 7, 2025 •

edited

Loading