Unable to understand which features to choose

Question

Unable to understand which features to choose

Karan Singh

2022年5月2日 02:02

I am a newbie here, but I am trying to work with a dataset which gives the attempt at the goal by a footballer,which will predict one of 2 possible outputs - whether or not they could score the goal or not.

I have done some basic cleaning but I am still getting only 60% accuracy on whatever classifier I use from sklearn.

I have removed a few features which I thought would not contribute to the 'y' value, and encoded a few values, but still it doesn't bump up the accuracy.

I am not sure how to attach a csv, but this is what the data looks like -- data

For the rows that 'is_goal' doesn't have a value, the aim is to find out if a goal was scored or not.

Should I attach the notebook as well to show what work I've done till now?

Topic dataset feature-selection

Category Data Science

Sid · Accepted Answer · 2019年7月21日 05:47

Create a correlation plot and remove all highly correlated columns. Remove columns that are irrelevant to the prediction, such as home/away

If you're interested in feature selection you can use the methods in mlxtend library. It's likely that selecting the right features won't improve your results drastically. There can be other things that can be done to do so.

Here are some suggestions to improve your results:

Make sure you're treating the columns with continuous and categorical columns in appropriate ways - encoding only the categorical columns
Normalize all columns that have continuous variables, perhaps using StandardScaler
Experiment with using feature extraction techniques like PCA
Try using gradient boosting (LightGBM or CatBoost)
There seems to be missing data. Make sure to deal with them correctly. If they are very less you can drop those rows.

Unable to understand which features to choose

About