Content Embeddings and Graphs For Modeling Similarity Among Scientific Articles
The accelerated growth of scientific production poses challenges for identifying relevant works in literature searches, screening, and review processes. This paper proposes a hybrid methodology that integrates text embeddings and similarity graphs to model and visualize semantic relationships among scientific articles. The approach adopts a dual semantic encoding strategy: two complementary textual views of each document are independently embedded, and the pairwise similarities derived from each encoding serve distinct roles. (i) The similarity between title–abstract embeddings defines the graph structure (edges and weights), while (ii) the similarity between keywords and indexing descriptor embeddings visually qualifies those connections as an explicit thematic affinity layer encoded through color bands. In the embedding stage, the SPECTER model is applied to both textual views, and similarities are computed using cosine similarity. Using a configurable threshold, an undirected weighted graph is constructed, whose visualization is organized using a force-directed layout (Fruchterman–Reingold), with node sizes proportional to citation counts as a visual attribute. An exploratory evaluation was conducted on a real-world corpus of scientific articles, executing the complete pipeline from text extraction to graph visualization. The results indicate operational feasibility and the production of semantically structured graphs with visually interpretable patterns of global semantic proximity and thematic affinity. This version focuses on demonstrating applicability and interpretability, and it constitutes a foundation for future quantitative validation against reference methods and expert-based evaluation.
