Elasticsearch based Search Engine

Salient features

Scraping
- Scraped ~7000 documents using https://en.wikipedia.org/wiki/Science_fiction_film as a seed using BeautifulSoup
- Customizable depth
- Duplicate detection
- Saved in .json format with paragraphs, table of contents , url and title as fields
Tokenization
- Standard tokenizer
- Token filters: stop, lowercase, snowball stemmer
Support for BM25 and Jelinek-Mercer Language Model
Retrieval of top k relevant documents in order
Support for conjunctive and disjunctive queries
User interface with the following features
- Dropdown keyword suggestions based on Levenstein distance using Fuzzy search
- Snippets that displays the most relevant fragments built using unified highlighter
- Interface to change between the models and modes as per user's requirements
- Displaying results as clickable links for better access

python3 run.py

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
corpus_final		corpus_final
elastic_search		elastic_search
static		static
templates		templates
web_scraping		web_scraping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
community_detection.ipynb		community_detection.ipynb