What's the purpose of statistical analysis ( statistically important features) vs feature elimination in machine learning

I am developing a classification model for covid19 symptoms (after being ill) and I don't understand statistical analysis importance (some parts of it)

1 Firstly: Basically we perform statystical analysis to learn about data. However what's the purpose of counting mean, standard deviation as shown here:

https://www.sciencedirect.com/science/article/pii/S0010482522000762#bib27

What insight will it give me?

2 Moreover: They perform statistical test like Chi-Square to find the statistically significant features. Suppose they have around 15 blood parameteres and the tests would tell that only 10 of them are statistically important. Does it mean those 5 won't be used in the training and can be removed?

3 If they can be removed: Would feature elimination prove the same? Suppose we used Recursive Feature Elimination / Random forest with 10-best features. Would results be the same?

Topic classification statistics machine-learning

Category Data Science


Though not in the details, it looks like they took some of the continuous variables, ranked them, and then used Chi-square to determine feature set. No explanation given as to why they did that. Also regarding the features not found significant. You can certainly uses them in model. chi-square is a weak test, and there may be interactions found in the model which are meaningful.

In any case The statistical tests were exploratory. Then were not used for inference directly. It is always a good practice to perform basic statistical descriptive statistics before approaching any ML. For example they could have not performed the missing value imputation without first seeing how many there were. Also note that MVC variable has overlapping confidence intervals between COVID and non-COVID responses, which sometimes is a signal that there is not a significant difference due to that variables.

They selected four features: white blood cell count (WBC), monocyte count (MOT), age, and lymphocyte count (LYT) and they ran them through 8 machine learning algorithms to classify and they used a stacked ML model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.