An Uncertainty-Gated Neuro-Symbolic Framework for High-Coverage Topic Modeling and Trend Analysis in Scholarly Corpora with LLM Assistance
| dc.contributor.author | Demir, Onur | |
| dc.contributor.author | Saran, Murat | |
| dc.date.accessioned | 2026-06-05T08:49:35Z | |
| dc.date.available | 2026-06-05T08:49:35Z | |
| dc.date.issued | 2026 | |
| dc.description.abstract | The rapid growth of scientific literature demands scalable methods that can track research evolution, yet density-based topic models such as BERTopic systematically exclude low-density documents as outliers, obscuring emerging and niche research areas. We propose a Neuro-Symbolic, Uncertainty-Gated Framework that recovers these outliers through geometric centroid reassignment and an ontological entropy gate derived from the Computer Science Ontology (CSO), routing only genuinely ambiguous cases to a local Large Language Model (Qwen2.5-14B via Ollama). A controlled ablation study demonstrates that centroid reassignment provides the largest coverage gain (+ 22.9 percentage points (pp)), the CSO entropy gate preserves niche-topic integrity, and selective LLM routing adds an additional + 5.9 pp. On 12,535 Turkish computer engineering theses (TR-CS; 2001-2025), the full pipeline raises coverage from 75.5% +/- 1.2 % (Bare BERTopic) to 95.7% +/- 0.4% (five-seed means) while maintaining competitive coherence (NPMI = 0.112 +/- 0.006) and cross-seed stability (AMI = 0.832 +/- 0.015), at similar to 15x fewer LLM calls than a fully generative Pure-LLM baseline. Mann-Kendall trend tests on the high-coverage series identify 69 statistically significant trends (FDR q < 0.05), and cross-corpus validation on similar to 200K arXiv CS abstracts confirms that the architecture generalizes beyond the primary dataset. The framework offers a reproducible, cost-effective solution for monitoring scientific developments in rapidly evolving fields. | |
| dc.identifier.doi | 10.1109/ACCESS.2026.3687277 | |
| dc.identifier.issn | 2169-3536 | |
| dc.identifier.scopus | 2-s2.0-105037802188 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12416/16136 | |
| dc.identifier.uri | https://doi.org/10.1109/ACCESS.2026.3687277 | |
| dc.language.iso | en | |
| dc.publisher | IEEE-Inst Electrical Electronics Engineers Inc | |
| dc.relation.ispartof | IEEE Access | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.subject | Computer Science Ontology (CSO) | |
| dc.subject | Large Language Models (LLMs) | |
| dc.subject | Scientometrics | |
| dc.subject | Neuro-Symbolic AI | |
| dc.subject | Topic Modeling | |
| dc.subject | Outlier Detection | |
| dc.subject | Trend Analysis | |
| dc.title | An Uncertainty-Gated Neuro-Symbolic Framework for High-Coverage Topic Modeling and Trend Analysis in Scholarly Corpora with LLM Assistance | en_US |
| dc.type | Article | |
| dspace.entity.type | Publication | |
| gdc.author.scopusid | 60615606700 | |
| gdc.author.scopusid | 24722292900 | |
| gdc.author.wosid | Saran, Murat/U-5382-2018 | |
| gdc.coar.access | open access | |
| gdc.coar.type | text::journal::journal article | |
| gdc.description.department | Çankaya University | |
| gdc.description.departmenttemp | [Demir, Onur; Saran, Murat] Cankaya Univ, Dept Comp Engn, Ankara, Turkiye | |
| gdc.description.endpage | 66464 | |
| gdc.description.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
| gdc.description.startpage | 66445 | |
| gdc.description.volume | 14 | |
| gdc.description.woscitationindex | Science Citation Index Expanded | |
| gdc.identifier.wos | WOS:001763003100037 | |
| gdc.index.type | Scopus | |
| gdc.index.type | WoS | |
| relation.isAuthorOfPublication.latestForDiscovery | f92fb8be-a1b3-4888-abaf-40ad03004780 | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 0b9123e4-4136-493b-9ffd-be856af2cdb1 |
