Comparing performances of different models using hypothesis testing
A common workflow in applied ML is you train different models on some data and evaluate it on a particular test set. I have often seen people just selecting a ML metric based on their requirements and choosing ML models on the basis of that.
But is the above process right ? Shouldn't we be ideally doing hypothesis testing and arriving at statistical and practical significance before saying model A model B
simply based on ML metric calculated on a common test set
Topic statistics machine-learning
Category Data Science