Integrating Heterogeneous Legacy Soil Data For Smart Agriculture
The heterogeneity and fragmentation of legacy soil analysis datasets limit their reuse in digital agriculture systems. This paper addresses the challenge of systematically integrating semantically inconsistent and structurally heterogeneous soil data into a unified and interoperable infrastructure. We present a reproducible extract--transform--load (ETL) pipeline that combines schema alignment, unit normalization, and supervised textual similarity to harmonize free-text attributes, followed by domain-expert validation. The workflow was applied to publicly available soil datasets comprising 32,148 records and 216 attributes. After cleaning, validation, and georeferencing checks, 16,181 consistent entries were consolidated into a relational database. The supervised similarity-based mapping significantly reduced manual effort while preserving semantic coherence, achieving an expert agreement rate of 85%. The resulting database provides a structured, georeferenced, and traceable resource suitable for downstream applications in digital agriculture, supporting improved data interoperability and alignment with sustainable agricultural practices.
