What should I do to test the confidence of my deep learning model?

I've recently fine-tuned a deep learning framework/model BERT for a sentiment classification task. I had a 80/10/10 train/validation and test set. After running several experiments, I've gotten a decent model that I'd like to eventually productionize. However, I'm looking to create an experiment to test the robustness/reliability/confidence of the model before putting it into production. What are some ways/experiments that can be conducted to test the robustness/reliability/confidence of this model or its predictions?

For example, are there statistically sound principles behind calculating the standard error for binary predictions on a new datapoint?

Topic deep-learning classification statistics machine-learning

Category Data Science


For binary predictions, it is standard to evaluate models based on their ROC and PRC curves. Some metrics are also useful, namely MCC, which is probably the most holistic scalar metric.

Using these metrics, you should evaluate the model via cross-validation. For deep models that take significant time to train, k-fold cross-validation is often sufficient. If you want, you can also do repeated k-fold cross validation, if time permits.

Lastly, while not always possible, many consider the use of different datasets to be one of the best indicators of reliability. Splitting a single dataset still risks "leaking" a common bias in both the training and testing sets. When using two or more datasets, biases that the model learns in the training dataset probably wouldn't manifest in a completely different dataset, hence a more objective evaluation methodology that mimics a production environment. Such biases include, among many things, data acquisition methods, preprocessing/data cleaning methods, etc..

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.