A New Approach To Handle Data Shift Based On Feature Importance Measurement

Data shift poses a significant challenge to machine learning models' reliability and predictive performance when the distribution of data in the deployment context differs from the training data distribution. In this study, we propose a method to handle data shift, called Random Forest with Biased Splitting (RFBS). RFBS first trains a standard random forest on the source domain, then it measures the importance of each feature in that model on the target data, and uses those feature importances to learn a new random forest model adapted to the target domain. We conducted experiments on 15 benchmark tabular datasets and observed that RFBS outperformed a standard random forest on the target domain and performed very competitively against many established and state-of-the-art data-shift-handling methods.

Ewerton Costadelle
Universidade Federal Fluminense (UFF) / Instituto Federal de Educação, Ciência e Tecnologia de Rondônia (IFRO)
Brazil

Maia Marcelo
Universidade Federal Fluminense (UFF) / Instituto Brasileiro de Geografia e Estatística (IBGE)
Brazil

Alexandre Plastino
Universidade Federal Fluminense (UFF)
Brazil

Alex Freitas
University of Kent
United Kingdom