The road towards faster decision trees #517

MaxHalford · 2021-03-13T12:02:02Z

MaxHalford
Mar 13, 2021
Maintainer

scikit-multiflow contributed to River their implementations and expertise regarding decision trees. We've recently been having some discussions about improving the speed of decision trees. We also want to build a single base tree class from which each tree can inherit (this would have many benefits in terms of code organisation).

In an ideal world, all the trees would be implemented in Cython. They would be storing the nodes in a list. One could walk down tree by using integer arithmetic (see here for more info). This is the "flattened tree method" described by Andrew Tulloch. It's also the method used by onelearn. It's really blazing fast!

The thing is that we have many different kinds of trees, and building a unified base class that would cover all usecases seems like too daunting a task. I believe we need to procede incrementally (pun intended) in a series of steps we all agree on. Here's my proposal:

We create a Python base class that covers all usecases. There is a Node interface that each new tree model has to implement. Each node has a children attribute, which is a list of Nodes. It would be very each to support more than 2 children and child removal. Node would have a bunch of methods for walking and displaying. Our codebase should be much clearer once we're done with this. @smastelini and @jacobmontiel are free to cythonize the split searchers, which are the main bottlenecks.
A. Implement the "flattened tree method" for half-space trees in Cython. Half-space trees are static trees: there's no need to add or remove nodes. Therefore, they should be really easy to cythonize and implement with the integer logic. The way I see it, it should be possible to hook this implementation to some of the stuff implemented in step 1 -- not the walking mechanism, but the visualisation stuff.
B. Integrate onelearn into River. We have approval from @stephanegaiffas to do this. We could start by simply porting the code into River without having it inherit from our base class. There are a couple of ways to do this. We could either copy/paste the code or use git subtree as we did to merge scikit-multiflow into the creme repository. Having onelearn would be great for everyone: more exposure for such an awesome model and blazing fast speed for users. Now that @JovanVeljanoski has proven that River can integrate well with Vaex, decision trees that process millions of rows per second doesn't seem far away...
Thanks all the learnings gathered in step 2, we might see a way forward to unify all our trees under a single base class. In any case, if we do step 2, we'll be really happy. It's not the end of the world if half-space trees and onelearn have their own implementation and the rest of the trees are unified.

For step 1, I will take charge of implementing this base class. I will adapt the code of anomaly.HalfSpaceTrees so that it uses this new base class. This may serve as an example so that @smastelini can port the code in the tree module.

Feel free to share any thought and/or challenge my proposal.

smastelini · 2021-03-13T13:51:57Z

smastelini
Mar 13, 2021
Maintainer

Sounds good to me (and totally doable in the long run). As I understand it, the flattened implementation (our main target) would not be easily applicable to all cases, right? We might need d-ary heaps, by d is not known beforehand (and should change during the stream processing). In the end, we might waste memory, have to reallocate memory constantly (assuming arrays, not lists), and end up with a "sparse" representation (some nodes have two children, others three, and so on). If lists are used, we have to remember that they are not necessarily contiguous, which might explain why a pure-Python implementation based on lists doesn't give us significant speedups.

A somewhat tricky workaround, and it's just a crazy idea I had yesterday (really didn't give it much thought to evaluate the real effectiveness), would be to use dictionaries instead (?). Yeap, we could assign identifiers to the path till a node (leaf or non-leaf) and use that as the key for the node. Crazy, huh? The problem is that I'm not sure how cache-friendly is that approach. So we might get something similar to the pointer-based strategy, although lookups are usually O(1).

Regardless of performance, if we get a common base class is already a good improvement! #488 is an effort towards making our decision trees more modular and could facilitate the usage of a shared base class.

Concerning (2.B) I am more than happy to hear that! I have been thinking about Mondrian Forests for quite some time, but I never really studied this family of algorithms. Besides that, onlearn indeed applies a clever combination of arrays and jit that would be more than welcome to River! We didn't talk much about that, but jit is another possibility that deserves further research. I can't wait to see vaex and river working more and more together!

Finally, I agree with (3). I expect our codebase to evolve and improve in the long run. We need to define the foundations that will the way to the future of the trees.

0 replies

smastelini · 2021-03-13T13:57:51Z

smastelini
Mar 13, 2021
Maintainer

Just a quick addition:

not the walking mechanism, but the visualisation stuff

That's a must! Online learning + interpretability are, IMHO, important concerns to keep in mind! The recent refactoring of the pipeline visualization is the first step towards improving the inspection of tree-based models. @MaxHalford, how feasible would it be to have dynamic visualizations? I mean, to able to monitor the trees growing intermitently.

3 replies

MaxHalford Mar 13, 2021
Maintainer Author

@MaxHalford, how feasible would it be to have dynamic visualizations? I mean, to able to monitor the trees growing intermitently.

It depends exactly what you mean but I'm sure we can figure something out :)

A somewhat tricky workaround, and it's just a crazy idea I had yesterday (really didn't give it much thought to evaluate the real effectiveness), would be to use dictionaries instead (?)

Would you be willing to do a quick benchmark? That will me make a more informed decision when I implement the base class from step 1.

smastelini Mar 13, 2021
Maintainer

It depends exactly what you mean but I'm sure we can figure something out :)

Basically, keep updating the tree plot without having to call .draw every time :D

Would you be willing to do a quick benchmark? That will me make a more informed decision when I implement the base class from step 1.

Sure! I'll play a little bit with it!

MaxHalford Mar 13, 2021
Maintainer Author

I can think of something! We could update the HTML on the fly when a new node is updated. We just need some mechanism when a node appears or is removed, but that's not too difficult.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The road towards faster decision trees #517

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The road towards faster decision trees #517

MaxHalford Mar 13, 2021 Maintainer

Replies: 2 comments · 3 replies

smastelini Mar 13, 2021 Maintainer

smastelini Mar 13, 2021 Maintainer

MaxHalford Mar 13, 2021 Maintainer Author

smastelini Mar 13, 2021 Maintainer

MaxHalford Mar 13, 2021 Maintainer Author

MaxHalford
Mar 13, 2021
Maintainer

Replies: 2 comments 3 replies

smastelini
Mar 13, 2021
Maintainer

smastelini
Mar 13, 2021
Maintainer

MaxHalford Mar 13, 2021
Maintainer Author

smastelini Mar 13, 2021
Maintainer

MaxHalford Mar 13, 2021
Maintainer Author