how to deal with columns that has different value in only 1 or 2 rows?

Question

how to deal with columns that has different value in only 1 or 2 rows?

Naveen Reddy Marthala

2022年4月24日 22:04

I have very high dimensional data. Almost 20% of the columns has different value in less than 1% of rows. All of these are binary columns and many columns has 0s filled in more than almost 98% of rows.

Some more info: Target variable is an imbalanced(91.9%:8.1%) binary variable.

Every variable I have, except 3, are binary.

I would like some ideas on how to deal with columns like this? drop them or smote to have more data?

Thanks in advance.

Topic data-cleaning data-mining machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2022年4月24日 22:04

In other words, you have sparse binary features. A vast majority of the data is zeros. The remaining data are ones.

One option is to transform the features to be denser. This can be done with dimension reduction or feature hashing.

Another option is to pick an algorithm robust to sparse features.

how to deal with columns that has different value in only 1 or 2 rows?

About