A high-performance text indexing engine for searching large documents or corpora implemented in Rust and inspired by the C++ Pisa text search engine.
-
The following workflow is inspired from the PISA Index Building Pipeline (Mallia et al., 2019 ):
-
Collection Processing
- Load documents
- Extract contents
- Tokenize
- Filter (Stemming + Stopword removal)
-
Forward Index
- Term Lexicon
- Document Lexicon
-
Inverted Index
- Document reordering
- Compression
-
Index Compression
-
Query Pocessing