An Adaptive and Context-Aware Text Segmentation Method for Information Retrieval

Şirin, Burçe

An Adaptive and Context-Aware Text Segmentation Method for Information Retrieval

Date

2026

Authors

Şirin, Burçe

Abstract

Günümüzde dijital uygulamaların artmasıyla birlikte metinsel verilerin hacmi artmakta; ayrıca dil, yapı, içerik, uzunluk gibi özellikler bakımından giderek daha çeşitlenmekte ve karmaşıklaşmaktadır. Böyle geniş bir bilgi havuzunda bireylerin ihtiyaç duydukları bilgiye doğru ve etkin bir şekilde erişmeleri giderek zorlaşmakta; bu da bilgi erişim sistemlerini vazgeçilmez hale getirmektedir. Ancak bu sistemlerde, özellikle uzun ve içerik açısından zengin metinlerin tek parça halinde ele alınması anlamsal kayma ve işlem maliyetinin artması sorunlarına yol açabilmektedir. Bu sorunların çözümü için, metinlerin anlam bütünlüğü korunarak daha küçük parçalara ayrılmasını amaçlayan çeşitli metin segmentasyon yöntemleri geliştirilmiştir. Bununla birlikte, mevcut çalışmalar segmentasyon başarısının veri seti özellikleri ve görev gereksinimlerine bağlı olarak değiştiğini ve tüm senaryolar için geçerli tek bir yaklaşım bulunmadığını göstermektedir. Bu durum, yeni ve uyarlanabilir yöntemlere olan ihtiyacı ortaya koymaktadır. Bu tezde, bu ihtiyaçtan yola çıkarak, veriye ve bağlama duyarlı bir segmentasyon yöntemi önerilmekte ve bilgi erişimi açısından verimliliği ve etkinliği değerlendirilmektedir. Önerilen yöntemde öncelikle metinler cümlelere ayrılmakta ve her cümle için yoğun vektör temsilleri oluşturulmaktadır. Ardışık cümleler arasındaki anlamsal ilişkiler, bu vektörleri kullanan bir maliyet fonksiyonu ile modellenmekte ve dinamik programlama yaklaşımı ile küresel maliyeti en aza indiren segment sınırları belirlenmektedir. Önerilen yöntem, farklı veri kümeleri ve vektör temsili oluşturma stratejileri kullanılarak bir bilgi erişim hattı üzerinde bir referans yöntemle karşılaştırılmıştır. Değerlendirmeler, ortalama segment boyutlarına, segmentasyon süresine ve MRR, DCG ve nDCG bilgi erişim metriklerine göre yapılmıştır. Deneysel bulgular, önerilen yöntemin verimlilik açısından başarılı olduğunu ortaya koymaktadır. İyileşmeler sınırlı düzeyde olmakla birlikte, etkin bilgi erişimi konusunda da tutarlı kazanımlar sağlandığı gözlemlenmiştir.
Today, with the rapid growth of digital applications, the volume of textual data has increased significantly and become more diverse and complex in language, structure, content, and length. In such a broad information environment, accurately and efficiently accessing required information has become more difficult for users, making information retrieval systems indispensable. However, processing long and content-rich documents as a single unit may cause semantic blurring and increased computational cost. To address these issues, different text segmentation approaches have been proposed to divide texts into smaller units while preserving semantic coherence. Nevertheless, existing studies show that the effectiveness of segmentation methods depends on dataset characteristics and task requirements, and there is no single approach that performs best in all scenarios. This highlights the need for more adaptable methods. In response, this thesis proposes an adaptive, context-aware segmentation method and evaluates its efficiency and effectiveness in information retrieval. In the method, documents are first divided into sentences, and dense vectors are generated for each sentence. Semantic relationships between adjacent sentences are modeled using a cost function based on these vectors, and dynamic programming is employed to determine the segment boundaries that minimize the global cost. The proposed method is compared with a baseline within a retrieval pipeline using different datasets and embedding strategies. The evaluation considers average segment length, segmentation time, retrieval metrics, including MRR, DCG, and nDCG. Experimental findings demonstrate that the proposed method is highly efficient. Although the observed improvements are modest, they consistently indicate its effectiveness in information retrieval.

Keywords

Computer Engineering and Computer Science and Control, Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol

Turkish CoHE Thesis Center URL

Click Here

End Page

78

URI

https://hdl.handle.net/20.500.12416/16075
https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=5T1_CZ5-UGb9QCmoURec4LCTlXM8hC22XipDt7vgzVyxHBvw4IRN_Iu3Jgsrzw94

Collections

Yüksek Lisans Tezleri

Full item page

Google Scholar™

Check

An Adaptive and Context-Aware Text Segmentation Method for Information Retrieval

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

OpenAIRE Downloads

OpenAIRE Views

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Description

Keywords

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

Scopus Q

Source

Volume

Issue

Start Page

End Page

URI

Collections

Google Scholar™

Sustainable Development Goals

SDG data could not be loaded because of an error. Please refresh the page or try again later.