Unable to understand which features to choose

I am a newbie here, but I am trying to work with a dataset which gives the attempt at the goal by a footballer,which will predict one of 2 possible outputs - whether or not they could score the goal or not.

I have done some basic cleaning but I am still getting only 60% accuracy on whatever classifier I use from sklearn.

I have removed a few features which I thought would not contribute to the 'y' value, and encoded a few values, but still it doesn't bump up the accuracy.

I am not sure how to attach a csv, but this is what the data looks like -- data

For the rows that 'is_goal' doesn't have a value, the aim is to find out if a goal was scored or not.

Should I attach the notebook as well to show what work I've done till now?

Topic dataset feature-selection

Category Data Science


Create a correlation plot and remove all highly correlated columns. Remove columns that are irrelevant to the prediction, such as home/away

If you're interested in feature selection you can use the methods in mlxtend library. It's likely that selecting the right features won't improve your results drastically. There can be other things that can be done to do so.

Here are some suggestions to improve your results:

  1. Make sure you're treating the columns with continuous and categorical columns in appropriate ways - encoding only the categorical columns
  2. Normalize all columns that have continuous variables, perhaps using StandardScaler
  3. Experiment with using feature extraction techniques like PCA
  4. Try using gradient boosting (LightGBM or CatBoost)
  5. There seems to be missing data. Make sure to deal with them correctly. If they are very less you can drop those rows.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.