Automated Bird Species Identification: Evaluating Transformers and Cnns In Bioacoustic Soundscapes
Automatic identification of bird vocalizations from field recordings is a key tool for biodiversity monitoring and conservation. In this study, we evaluated and compared different strategies for classifying bird sounds using a balanced dataset of 20 species and stratified 5-fold cross-validation. We compared traditional machine-learning approaches (Random Forest, Support Vector Machine [SVM], and Multi-Layer Perceptron [MLP]) trained on an extended set of acoustic features (MFCCs, delta coefficients, pitch, and spectral features) with specialized bioacoustic deep-learning frameworks based on CNN and Transformer architectures. Classical models optimized with SMOTE and hyperparameter tuning achieved a mean macro F1 Score of 0.614 ± 0.012 (SVM), whereas an OpenSoundscape CNN based on ResNet50 achieved 0.841 ± 0.005. Transfer learning approaches with pre-trained embeddings yielded superior performance: BirdNET combined with a lightweight MLP classifier reached 0.770 ± 0.020, and the best result was obtained using embeddings from a pre-trained Transformer model (Audio Spectrogram Transformer), with a macro F1 Score of 0.910 ± 0.005, outperforming all evaluated approaches. These results show that Transformer-based deep representations specialized for bioacoustics substantially outperform both traditional methods and CNNs, and that the stratified cross-validation scheme provides reliable estimates of real-world performance. Finally, the main findings and recommendations for deploying these technologies in passive acoustic monitoring projects are discussed.
