Crawling the web using Apache Nutch and Lucene

Abdulwahid, Nibras

Crawling the web using Apache Nutch and Lucene

dc.contributor.author	Abdulwahid, Nibras
dc.date.accessioned	2016-08-11T08:09:37Z
dc.date.available	2016-08-11T08:09:37Z
dc.date.issued	2014
dc.description.abstract	The availability of information in large quantities on the Web makes it difficult for user selects resources about their information needs. The good link between the internet users and this information is Search engine. Search engine is kind of Information Retrieval (IR). It works on data collection from the Web by software program is called crawler, bot or spider. Most of Search Engines users don't know the mechanism of action the Search Engine, like how Search Engine works and how it catch information in the Web and how it rank the results to users. For this reason in this thesis used the open-source Search Engine is researched in detail. In this study, we used each of (Apache Nutch and Lucene) to clarify work of Web crawling open source. They are released under the Apache Software Foundation. Nutch is a web Search Engine working to search and index Web Pages from the World Wide Web (WWW). Nutch is based or built on top of Lucene. It uses in the information retrieval technology. It has more software libraries to indexing of large-size data. Lucene doesn't care about information existing in the Web, like PDF, TEXT, and MS Word. It is working to indexing these documents and convert them to the data can be utilized. The benefit of using both Nutch and Lucene in this study, they are free and we can their development. The Nutch and Lucene are written by Java language, it is a computer programming language. Furthermore, we used Tag Cloud Technology to analysis and view the Lucene content or its index	en_US
dc.description.abstract	Webde yer alan geniş boyuttaki bilgilerin varlığı, kullanıcıların ihtiyacı olan bilgiyi seçmesini zorlaştırmaktadır. Bu bilgiler ile internet kullanıcıları arasındaki bağlantı yolu arama motorlarıdır. Arama motorları. Crawler, bot veya örümcek adı verilen yazılımlar aracılığıyla web'deki veri koleksiyonları üzerinde çalışır. Birçok arama motoru kullanıcısı arama motorlarının çalışma mekanizmasını bilmezler. Örneğin arama motorları nasıl çalışır veya web üzerinde bilgiyi nasıl yakalar yahut bilgiyi nasıl sıralar. Bu çalışmada açık kaynak tabanlı arama motorlarının nasıl çalıştığını detaylı incelenmiştir. Bu çalışmada, açık kaynak kod tabanlı Web Crawler programlarını izah ederken apache nutch ve lucene yazılımlarını tek tek kullanılmıştır. Bunlar Apache yazılım kurumu tarafından yayınlanmıştır. Nutch bir web crawler olup, world wide web üzerinde indeksleme yapabilmektedir. Nutch bir lucene mimarisi üzerinde geliştirilmiştir. Bilgi erişimi teknolojileri kullanır. Büyük boyuttaki verileri indeksleyebilmek için birçok yazılım kütüphanesi mevcuttur. Lucene web üzerinde var olan PDF, TEXT veya MS WORD gibi bilgiler ile ilgilenmez. Bu dökümanları indeksleyerek, faydalı olabileceği türe dönüştürür. Bu çalışmada Nutch ve Lucene'nin bir arada kullanılmasının faydası, birbirinden bağımsız olmalarının yanı sıra Nutch ve Lucene'nin ikisinin de Java ile geliştirilmesidir. Ayrıca Lucene içeriğini veya indeksini görüntülemek ve analiz edebilmek için Tag Cloud Technology'i kullanılmalıdır	en_US
dc.identifier.citation	ABDULWAHID, N. (2014). Crawling the web using Apache Nutch and Lucene. Yayımlanmamış yüksek lisans tezi. Ankara: Çankaya Üniversitesi Fen Bilimleri Enstitüsü	en_US
dc.identifier.uri	https://hdl.handle.net/20.500.12416/1220
dc.language.iso	en	en_US
dc.publisher	Çankaya Üniversitesi	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Open Source Web Search Engine	en_US
dc.subject	Tag Cloud	en_US
dc.subject	Web Crawling	en_US
dc.subject	Açık Kaynak Kodlu Web Arama Motoru	en_US
dc.subject	Apache Nutch	en_US
dc.subject	Apache Lucene	en_US
dc.title	Crawling the web using Apache Nutch and Lucene	tr_TR
dc.title	Crawling the Web Using Apache Nutch and Lucene	en_US
dc.title.alternative	Apache Nutch ve Lucene kullanarak web tarama	en_US
dc.type	Master Thesis	en_US
dspace.entity.type	Publication
gdc.coar.access	open access
gdc.coar.type	text::thesis::master thesis
gdc.description.department	Çankaya Üniversitesi, Fen Bilimleri Enstitüsü, Matematik Bilgisayar Bölümü	en_US
gdc.publishedmonth	7
relation.isOrgUnitOfPublication.latestForDiscovery	0b9123e4-4136-493b-9ffd-be856af2cdb1

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Abdulwahid, Nibras.pdf
Size:: 5.53 MB
Format:: Adobe Portable Document Format
Description:: Watermarked PDF

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Yüksek Lisans Tezleri