Skip to content

rizvisar/Travel-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search-Engine-For-Travel

Chosen problem is to build a search engine on the domain Travel. The project was phased into 5 parts namely, crawling, Indexing, UI, clustering and query expansion.

The crawler was written in Python code using libraries ‘urllib’ and BeautifulSoup. We had also acknowledged the politeness factor while crawling by fixing a delay of a 5 seconds if the crawl was hitting the same web server. The first web page that the crawler parsed for each parent url was the ‘robots.txt’ page. This provided us with the permission of crawling its subsequent pages. We were able to crawl 113839 documents from the following list of URLS.

The index has been built using Whoosh which is a fast, pure Python search engine library. Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible. The output document of the Crawler (urllistDict.csv) has been used as the web graph to create the inlink and outlink maps. The data from the same file as well as the data from the crawled documents list has been passed to whoosh’s schema to create the index. The urllistDict file consists of URL, docID and parent ID list which have been used to create a URL map dictionary with docID as keys and URL, parent ID list as the values. This represents a web graph structure. Inlink and outlink maps have been populated using this URL map. Inlink map has docid as key and parent id list as value. Outlink map has docid as key and child doc ids as values. Raw frequency and Okapi BM25F are used as the weighing schemes while performing search on the index. As HITS algorithm is based on query time, hubs and authority scores are calculated in the run time. As inlinks and outlinks are used during run time, they are already created during page ranking and ready to use for HITS. Once the search returned the top documents from the index, I used top results as a root set for implementing HITS algorithm. Then the documents which have high rankings will be sent as output to the UI. The Query is given as input through user Interface and sends the request to the Search Index API with query, then it searches the result from the already generated index, implementing PageRank and HITS algorithm and returns the result back to API as list of URLs and then API sends data to UI. Multiple set of results are sent to UI based on Relevance Models. After the query is given as input, search Index API is called which returns array string of ranked results to clustering.

Clustering uses the dynamically fetched ranked results to perform various types of clustering and gives relevant results. As per my observation, the results obtained from clustering are sometimes same as the given top results. But it differs from query to query. Used Kmeans algorithm from python’s scikit learn package for flat clustering. Repeatedly executed the clustering model by varying the no of clusters K and continued util WSS (within cluster sum of squares) stabilized. Such a way of finding fixing no of clusters is called elbow method. By doing the same, we fixed our K to be 15.

The User interface is designed in a way that it prompts the user to enter his search query. The entire application is built on the Model View Controller (MVC) architecture. The front end part of the application has the styling from the bootstrap CSS from various styling sheets available online. The application is a single page application with dynamic content. The entire dynamic nature of the content is handled by a single Django(WSGI server). A controller handles the search query term and invokes the appropriate services for the The API calls are responsible for sending queries to the backend for retrieval of search query results.

The different techniques for query expansion on which relevant feedbacks were tested are:

  1. Rocchio Algorithm
  2. Metric Cluster

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published