850One of my university courseworks was to develop a working search engine. Which consists of modules like text analyzer, stemmer, tokenizer, web crawler and so on. Even though I'm not 100% staisfied with my performance, because when I did I learned so much, but was lazy to go back and correct or improve my design. Anyway you can find documentation, exectutables and full source in this article.
The EagleEye was developed for ENG554 course at University Of Portsmouth by Raivis Strogonovs The task of the course was to develop a working search engine with 5 basic modules: Tokenizer, stemmer, Inverted file builder, web crawler and GUI. The search engine is more or less written by following a book - "Introduction to information retrieval" by Chritopher D. Manning, Prabhakar Raghavan, Hinrich Schutze
It incorporates the following features:
- Web crawler (multi-threaded)
- Inverted index builder (multi-threaded)
- Searching with custom VSM scoring
- Text Analyzer - tokenizer and stemmer
- Spell Checker
More or less everyone should be able to use the EagleEye without reading operation manual, but to begin using the search engine you have to first have to press a button "Start Eagling". It is responsible for starting the web crawler and indexer threads.
If you want to configure the indexer and web crawler then before you press "Start Eagling" go to "Tools/Options" and there you can set the following things:
- Indexer Threads
- Crawler Threads
- Crawler Timeout in ms - basically the delay between crawling the same host
- Crawler depth - limit how many child links can be accessed
- Seeds - the initial web-sites from where crawlers will start crawling
After you have configured and crawled small portion of the world wide web, the press "Learn English" button so the EagleEye can learn English spelling. After it has learned English it will notify the user in the search bar.
At this stage you are ready to do some searching, just enter your query in the search bar and press Enter or button "Fly!" It will open a new panel, where it will display all the search results if any. If there are more than 10 results, it will split split them in pages. However, when you have done your initial search you can't return back to home panel, but you can carry on searching using the results panel. It is done similarly to home panel, just input your query in the search bar and press Enter or "Fly!".
If for some reasons you want to recrawl the world wide web, you can do it as many times as you wish, just press the "Start Eagling" button
- The web documents are saved in your RAM memory and there is no check how much memory is used by the search engine. Windows has a limit of 1Gb per application, if the EagleEye will exceed that, then it will crash. For Linux and Mac it may vary.