Predict continuous variable based on categorical columns mostly
I have a large dataset (40 mil rows, 50 columns) with mostly categorical columns (some of them are numerical) and I am using Python/Pandas. Categorical columns have up to 3000 unique labels.
I am looking for best practices on how to approach this. Obviously one-hot encoding (OHE) as it is is out of question. I have tried to make smaller number of categories and do OHE in that way but the model was very bad, a lot of information is being lost. Also, memory is an issue and everything takes a long time.
Should I sample the data in this case? If so, how? Categorical columns depend on each other, they are nested. Label encoding and other encoders also didn't show good results. I have tried CatBoost Regressor and other tree like models. How would you approach this problem starting from data visualisation, feature engineering sampling, modelling?
Topic python-3.x pandas
Category Data Science