How to apply model to training data to identify mislabeled observations?

Question

How to apply model to training data to identify mislabeled observations?

OverflowingTheGlass

2022年4月24日 07:02

I have a list of people, attributes about those people (height, weight, blood pressure, etc.), and a binary target variable called has_heart_issues. This data represents the full population of data, and I am trying to determine whether anyone who is listed as "No" for has_heart_issues is similar to the people who are listed as "Yes".

To answer this question, I split the data into training (70%) and testing (30%). I trained a random forest model on the training, and I tested it on the testing. The results are good, but I don't know how to apply to the population since I used most of it for training. Is there any way to apply the model to the full dataset (including the training) since I had labels for the full dataset to start with? Essentially, I am trying to determine whether any of the people were mislabeled.

Is it okay to apply the model to the training data to find the "mislabeled" records?

Topic random-forest classification machine-learning

Category Data Science

Dave Kielpinski · Accepted Answer · 2020年3月11日 18:10

There is exactly one thing you can check by examining the predictions on your training data. That is the numerical convergence of your model training routine. Any validation of model accuracy can only use holdout data or test data - that is the entire point of cross validation. Once the model architecture and hyperparameters have been optimized through n-fold cross-validation, the standard procedure is to train a single production model on the entire dataset. At that point, you've gotten all the information from the training set that you can.

How to apply model to training data to identify mislabeled observations?

About