Preliminary Findings From An Ensemble Classifying Study of Spanish-Speaking Health Discourses On Social Media
Healthcare surveys capture what healthcare systems want to know, while online social networks (OSNs) capture what patients actually say. While institutional surveys are susceptible to question framing bias, online social networks contain naturalistic, unfiltered health discourse. This study presents a ensemble approach that uses keyword, traditional machine learning and deep learning to detect and classify disease-related discussions across Spanish-speaking Reddit communities in Mexico and Argentina, targeting oncology, depression, and diabetes. A corpus of 32,339,750 comments was processed through a thorough filter process, which culminates with disease-focused ensemble and BERT-based neural classification. Three binary classifiers fine-tuned on dccuchile/bert-base-spanish-wwm-cased achieved F1 scores of 0.977, 0.993, and 0.995 for oncology, depression, and di-abetes, respectively. The classification of depression resulted in a significant number of false positives. The BERT model's inability to reliably differentiate the usage of the word 'crisis' in Spanish, which appears in both psychological and economic contexts, was resolved through the application of rule-based post-processing. This study provides a replicable baseline for Spanish-language health classification on Reddit, with direct relevance to Mexico and Argentina, two countries whose Reddit health discourse remains largely unexplored at scale.
