Decision Tree: Efficient splitting of nodes, minimize number of gini evaluations

I have a dataset specific problem where i need to use a splitting function other than gini_index. This requires me to re-write a decision tree from scratch. I have a working model, but itis highly inefficient.

To make a split i currently iterate though each feature and then through each unique datapoint in that dataset for each node (total of nodes x features x unique levels gini evaluations). Cause of this my DT on a 300k X 145 dataset has been running for 2 days.

How can I cut down on the number of splitting evaluations, or speed up the program. I read Fisher Yates algorithm in Sklean's code, but I don't understand the logic. Any help would be appreciated.

Topic decision-trees scikit-learn

Category Data Science


In general, to reduce the amount of time needed to run your dataset through the See4.5 (C4.5) algorithm, you'll want to reduce the number of nodes in the tree needing to be processed.

This can be done utilizing pruning, optimal operator selection, and incorporating a heuristic into your decision tree search.

Alpha-beta pruning, bidirectional search, and the Minmax algorithm for operator selection are a good choice when it comes to decision-tree time reduction.

I'm not going to write an entire book here, however look into artificial intelligence and see what they've accomplished so far. Alot of it needs tweaking because they keep switching (but that's another story), however, if you come across any books saying bidirectional search is in any way non-optimal, just ignore that because it's inherent that researchers can't code that well.

A good implementation of the Gini algorithm in practical use is available through Ross Quinlan's website. If you look in to and understand the C5.0 source code, you should be on decision tree research level as there isn't a clear explanation available online as far as I can tell detailing the new algorithm's additions.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.