how to deal with columns that has different value in only 1 or 2 rows?

I have very high dimensional data. Almost 20% of the columns has different value in less than 1% of rows. All of these are binary columns and many columns has 0s filled in more than almost 98% of rows.

Some more info: Target variable is an imbalanced(91.9%:8.1%) binary variable.

Every variable I have, except 3, are binary.

I would like some ideas on how to deal with columns like this? drop them or smote to have more data?

Thanks in advance.

Topic data-cleaning data-mining machine-learning

Category Data Science


In other words, you have sparse binary features. A vast majority of the data is zeros. The remaining data are ones.

One option is to transform the features to be denser. This can be done with dimension reduction or feature hashing.

Another option is to pick an algorithm robust to sparse features.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.