What should I do to test the confidence of my deep learning model?
I've recently fine-tuned a deep learning framework/model BERT for a sentiment classification task. I had a 80/10/10 train/validation and test set. After running several experiments, I've gotten a decent model that I'd like to eventually productionize. However, I'm looking to create an experiment to test the robustness/reliability/confidence of the model before putting it into production. What are some ways/experiments that can be conducted to test the robustness/reliability/confidence of this model or its predictions?
For example, are there statistically sound principles behind calculating the standard error for binary predictions on a new datapoint?
Topic deep-learning classification statistics machine-learning
Category Data Science