Predict continuous variable based on categorical columns mostly

I have a large dataset (40 mil rows, 50 columns) with mostly categorical columns (some of them are numerical) and I am using Python/Pandas. Categorical columns have up to 3000 unique labels.

I am looking for best practices on how to approach this. Obviously one-hot encoding (OHE) as it is is out of question. I have tried to make smaller number of categories and do OHE in that way but the model was very bad, a lot of information is being lost. Also, memory is an issue and everything takes a long time.

Should I sample the data in this case? If so, how? Categorical columns depend on each other, they are nested. Label encoding and other encoders also didn't show good results. I have tried CatBoost Regressor and other tree like models. How would you approach this problem starting from data visualisation, feature engineering sampling, modelling?

Topic python-3.x pandas

Category Data Science


With such a big dataset, I would start using a random sample of the data as a smaller training set, until you have identified an algorithm that is suitable.

With so many descriptors, I would start with a Random Forest classifier. Although not necessarily the best final model, it is a good way to explore feature importance.

Happily, these choices fit together well: you can train each iteration of the forest building with a different subsample from the dataset.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.