How to build regression model on residuals

Let's say you have a good-performing regression model, but the end result is not just yet there.

One of the ideas, I came across to improve model performance is to build a second model using the first model's residuals (y-y_pred) on top. This model would suppose to learn for which observations the first model is overpredicting or underpredicting. The results from the second model would then be used together with the first's to improve the prediction.

Has anyone tried to implement this to a regression task? Are there any resources that I miss that have successfully implemented this?

Here are some of the questions I have:

  • Should the second model be trained only on the residuals? or on residuals+features?
  • What should be the overall best practices to implement this in a sklearn pipeline?

Topic regression scikit-learn machine-learning

Category Data Science


I think what you are describing can be compared to gradient boosting for regression. There the concept is similar to what you describe -> you iteratively fit a simple base model to your regression task and then try to fit a next one to the errors you have in order to explained the uncertainty of your first model. This is usually done by decision trees and are therefore not directly to be compared to a linear regression but you might find it useful to look deeper into this concept (especially there are already a lot of packages which are implementing it in Python and R)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.