-
Scraping
- Scraped ~7000 documents using
https://en.wikipedia.org/wiki/Science_fiction_film
as a seed usingBeautifulSoup
- Customizable depth
- Duplicate detection
- Saved in
.json
format withparagraphs
,table of contents
,url
andtitle
as fields
- Scraped ~7000 documents using
-
Tokenization
- Standard tokenizer
- Token filters:
stop
,lowercase
,snowball stemmer
-
Support for
BM25
andJelinek-Mercer
Language Model -
Retrieval of top
k
relevant documents in order -
Support for
conjunctive
anddisjunctive
queries -
User interface with the following features
Dropdown keyword suggestions
based on Levenstein distance using Fuzzy searchSnippets
that displays the most relevant fragments built usingunified highlighter
- Interface to change between the models and modes as per user's requirements
- Displaying results as clickable links for better access
python3 run.py