A New Approach To Handle Data Shift Based On Feature Importance Measurement
Data shift poses a significant challenge to machine learning models' reliability and predictive performance when the distribution of data in the deployment context differs from the training data distribution. In this study, we propose a method to handle data shift, called Random Forest with Biased Splitting (RFBS). RFBS first trains a standard random forest on the source domain, then it measures the importance of each feature in that model on the target data, and uses those feature importances to learn a new random forest model adapted to the target domain. We conducted experiments on 15 benchmark tabular datasets and observed that RFBS outperformed a standard random forest on the target domain and performed very competitively against many established and state-of-the-art data-shift-handling methods.
