Is it possible to do hard-coded decision tree on some variables and random forest / something on the remaining ones?

Is it possible to do hard-coded decision tree on some variables and random forest / something on the remaining ones?

The situation seems that for some variables it's possible to draw strong empirical assumptions, but for others their relative importance seems more random.

So e.g.

Researcher is certain that splitting X1 5 and X2 3 gives best information, since they are empirically sound splits e.g. based on stakeholder views. And X1, X2 are more important than X3, X4, X5, since X3, X4, X5 are redundant, if X1 or X2 don't exist.

Thus the model could essentially be based on X1, X2 only , but X3, X4, X5 should add explanatory power. Yet their relative importances are not known. Using the decision tree to them might be prone to model inaccuracies due to random forest or something perhaps offering better reduction in overfitting etc.

Topic decision-trees

Category Data Science


It is possible to create a hybrid system between human selected and machine learned rules in a decision tree.

Hybrid systems have fallen out of favor because they are more difficult to create, use, and maintain. Often times the human domain experts that are most capable of creating useful rules are not capable of formatting the rules so the machine learning system can use them.


The important point here is the distinction between rule-based and data-driven:

  • A rule-based predicting system is an algorithm which calculates the target variable based on rules which have been predetermined and implemented by a human expert.
  • A data-driven predicting system is an algorithm which is first trained on some labelled training data in order to automatically determine the "rules".

Both types of systems can be used to predict the target variable for some new instance. Both types of systems can (and should) be evaluated using some labeled test set.

So in general yes, one can hard-code a decision tree. If it's entirely hard-coded then it's a rule-based system.

In theory one can use a hybrid method to build a tree using some human-determined rules and the rest based on data. However in this case there's no established algorithm which says how the two types of "rules" should be combined (to the best of my knowledge). For example one could create the top of the tree manually and then let the learning algorithm determine the rest of the tree. Or one could create multiple trees, some rule-based and some data-driven, then use an ensemble method to combine their predictions.

But it's important to realize that this kind hybrid method could lead to inconsistencies: if the expert decides to give priority to a rule which is not supported (or doesn't have a high importance) in the data, then it's likely that the resulting model will perform poorly.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.