Are there rules of thumb for xgboosts hyperperameter selection?

There are multiple parameters that need to be specified in the XGBClassifier. Certainly gridsearchcv could give some insight into optimal hyperparameters, but I would imagine there are some rules of thumb for some reasonable hyperparameter selection. For example, for a training set of ~200,000 examples with ~1000 features is it possible to specify reasonable values for n_estimators, learning_rate, and max_depth with just this information alone?

Topic xgboost classification machine-learning

Category Data Science


Using a callback and early stopping you can set the number of boosting rounds to some „high“ number and wait until early stopping takes effect. So no need for much tuning here.

You may keep the standard learning rate for a start (and probably lower them later). Lower learning rate will lead to slower learning progress (requires more rounds). So if you can wait, you can set a somewhat lower learning rate than standard in the first place.

I would leave max. depth at default (=6). Deeper trees tend to overfit soon. Shallow trees may not learn properly.

With 1000 features/200000 rows, you may benefit from using colsample_bytree and/or subsample. The former will randomly sample columns (similar to random forest, i.e. stochastic gradient boosting). The latter will sample rows. Both introduce some randomness and help to avoid overfitting / improve generalisation of the model.

In case you suspect to have features which are not very powerful wrt prediction, you can use alpha, lambda for regulation.

Also have a look at feature importance after some reasonable model run to remove features with zero or very low predictive power.

See the docs for details.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.