GitHub - FOUZANKHAN/SEARCHENGINE-FINAL: Project Milestone 3

FOUZANKHAN / SEARCHENGINE-FINAL Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Project Milestone 3 - works for relevancefiles only

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MobyDick		MobyDick
PICTURES(DEF vs DEF_v)		PICTURES(DEF vs DEF_v)
cranfieldfiles		cranfieldfiles
documents		documents
indexing		indexing
jsonfiles		jsonfiles
jsonsample		jsonsample
queries		queries
relevancefiles		relevancefiles
text		text
Projectmilestone2v2.py		Projectmilestone2v2.py
Projectmilestone3.py		Projectmilestone3.py
README		README
Search Engine Milestone 3-1.pdf		Search Engine Milestone 3-1.pdf
bytepositions.db		bytepositions.db
diskindexwriter.py		diskindexwriter.py
instructor.txt		instructor.txt
invertedindexer.py		invertedindexer.py
merger_postings.py		merger_postings.py
milestone2.py		milestone2.py
positionalinvind.py		positionalinvind.py
randomfile.py		randomfile.py
ranked_strategy.py		ranked_strategy.py
termdocumentindexer.py		termdocumentindexer.py
variant_prec.py		variant_prec.py
variant_rel.py		variant_rel.py

Repository files navigation

Developed an advanced search engine tailored for U.S. National Parks, leveraging SQLite3 database and sophisticated retrieval strategies.

Indexed and stored 36,800 documents with optimized retrieval time of O(1) using SQLite3 database, ensuring efficient access to park information.

Implemented support for boolean and phrase queries to provide precise search results, enhancing user experience and search accuracy.

Evaluated the search engine's efficiency, achieving a Mean Average Precision (MAP) of 0.258, indicating the relevance of search results, and a Mean Response Time of 132.79 milliseconds, ensuring fast response times for user queries.

Project Milestone 1
what it does?
Performs positional inverted indexing on 36000 documents of JSON files.
Positional inverted index is a datastructure to store each and every doc content in an efficient way along with it's positions
Takes about 3 ~4 mins to index all the documents.
What else can it perform?
It can perform query searches /Boolean style
1. If you want to find a document with a combine term such as "national park"
> It will give all the document with all the documents that have national park in that order
2. If the document contains term x and term y
> the query would be Termx<space>TermY
3. If the document contains termx or termy
> TermX<+>termy
4. If the document cpntains termx at a distance of two words
for eg "New York is where I shouldn't go"
New is at a distance of 1 word from is
> [New NEAR/2 is] //2 here means at a position +1

Project Milestone 2
Disk Indexing
Main file is Projectmilestone2v2
We performed disk indexing i.e saving the index to the disk.
stored it in gaps.
Format of the disk stored index is
<dft,docid,tftd,p1,p2,p3,----pn>
Dft = Documents that contain the term
docid = number of the doc or doc id
tftd = number of times that term is in the document
p1 = first position and so on.
secondly, the documents and positions are saved as gaps i.e doc 3, doc 5 have the term then it is saved as, 3,5-3,x-(5-3) so on, same for positions.
It also does ranked retrivals for both phrase and boolean queries
default, tf-idf, okapi bm25, wacky strategies.

Project Milestone 3
Calculates MEAN Average Precision, Mean Response Time and Throughput for all the different ranking strategies. It also performs vocabulary elimination and elimates
the documents with low wqt(wieght of query terms) designated a threshold.

DIFFERENT THRESHOLDS COMPARISON FOR DEFAULT
------------------------------------------------------
Threshold = 0.8
MAP = 0.203
Throughput = 63.109
MRT = 0.015
------------------------------------------------------
Threshold = 1.0
MAP = 0.181
Throughput = 113.715
MRT = 0.00879
------------------------------------------------------
Threshold = 1.4(SELECTED THIS FOR HIGHEST MAP and Throughput)
MAP = 0.258
Throughput = 132.79
MRT = 0.0075