Semantic indexing tool for PDF repositories
This project is motivated by existing text searching tools like Agent Ransack (FileLocator), which can quickly keyword search through thousands of files
Use PyMuPDF for faster file readsAdd support for text file readingImplement custom chunking algorithmSwitch to approximate NN search (pynndescent) for fast queries- Add BM25 search as a complement to semantic search (in progress)
- Test multiprocessing for faster PDF reads (in progress)
- Add a GUI
- Create Windows executable