The road towards faster decision trees #517
Replies: 2 comments 3 replies
-
Sounds good to me (and totally doable in the long run). As I understand it, the flattened implementation (our main target) would not be easily applicable to all cases, right? We might need d-ary heaps, by A somewhat tricky workaround, and it's just a crazy idea I had yesterday (really didn't give it much thought to evaluate the real effectiveness), would be to use dictionaries instead (?). Yeap, we could assign identifiers to the path till a node (leaf or non-leaf) and use that as the key for the node. Crazy, huh? The problem is that I'm not sure how cache-friendly is that approach. So we might get something similar to the pointer-based strategy, although lookups are usually Regardless of performance, if we get a common base class is already a good improvement! #488 is an effort towards making our decision trees more modular and could facilitate the usage of a shared base class. Concerning (2.B) I am more than happy to hear that! I have been thinking about Mondrian Forests for quite some time, but I never really studied this family of algorithms. Besides that, Finally, I agree with (3). I expect our codebase to evolve and improve in the long run. We need to define the foundations that will the way to the future of the trees. |
Beta Was this translation helpful? Give feedback.
-
Just a quick addition:
That's a must! Online learning + interpretability are, IMHO, important concerns to keep in mind! The recent refactoring of the pipeline visualization is the first step towards improving the inspection of tree-based models. @MaxHalford, how feasible would it be to have dynamic visualizations? I mean, to able to monitor the trees growing intermitently. |
Beta Was this translation helpful? Give feedback.
-
scikit-multiflow contributed to River their implementations and expertise regarding decision trees. We've recently been having some discussions about improving the speed of decision trees. We also want to build a single base tree class from which each tree can inherit (this would have many benefits in terms of code organisation).
In an ideal world, all the trees would be implemented in Cython. They would be storing the nodes in a list. One could walk down tree by using integer arithmetic (see here for more info). This is the "flattened tree method" described by Andrew Tulloch. It's also the method used by onelearn. It's really blazing fast!
The thing is that we have many different kinds of trees, and building a unified base class that would cover all usecases seems like too daunting a task. I believe we need to procede incrementally (pun intended) in a series of steps we all agree on. Here's my proposal:
Node
interface that each new tree model has to implement. Each node has achildren
attribute, which is a list ofNode
s. It would be very each to support more than 2 children and child removal.Node
would have a bunch of methods for walking and displaying. Our codebase should be much clearer once we're done with this. @smastelini and @jacobmontiel are free to cythonize the split searchers, which are the main bottlenecks.B. Integrate
onelearn
into River. We have approval from @stephanegaiffas to do this. We could start by simply porting the code into River without having it inherit from our base class. There are a couple of ways to do this. We could either copy/paste the code or usegit subtree
as we did to merge scikit-multiflow into the creme repository. Havingonelearn
would be great for everyone: more exposure for such an awesome model and blazing fast speed for users. Now that @JovanVeljanoski has proven that River can integrate well with Vaex, decision trees that process millions of rows per second doesn't seem far away...For step 1, I will take charge of implementing this base class. I will adapt the code of
anomaly.HalfSpaceTrees
so that it uses this new base class. This may serve as an example so that @smastelini can port the code in thetree
module.Feel free to share any thought and/or challenge my proposal.
Beta Was this translation helpful? Give feedback.
All reactions