PISA v1 Proposal

Introduction

The purpose of this document is to propose a high-level architecture and format of PISA index and related structures.

Goals

This section describes the desired outcomes of the proposed design, in no particular order.

General

Performance

We aim to preserve the high-performance nature of the project. Although some concessions might be made to facilitate new functionality, the performance when not using those should generally not decline. However, in certain cases, we may still decide to sacrifice some performance if it improves maintainability or user experience, though these would have to be consider on a case-by-case basis, and should be avoided if possible.

Functionality

We want to preserve the existing functionality unless we deem some of it unnecessary. For example, we want to maintain the support for a range of different retrieval algorithms, index compression techniques, etc. It is acceptable that some new functionality excludes some of these or otherwise compromises performance for them, e.g., if we introduce index updates, it may be less efficient to do it for quantized indexes, but the current functionality of the quantized indexes should be maintained.

Easy by default, powerful if needed

We want to make it easier to use by default, but still powerful and flexible if needed. For example, we want to introduce abstractions to manage an index without having to manually keep track of all individual files (index, lexicon, metadata, etc.), but we still want to maintain low-level access to those if needed for a power user.

Composability

Our code should facilitate building custom systems. A single component, e.g., inverted index or lexicon, should be decoupled enough to use on its own.

Extensibility

The project should be easy to extend with new functionality, which is especially important for academic purposes.

Safety

We currently suffer from many segfaults whenever someone makes a mistake, e.g., passes the wrong encoding value when running queries. This should be avoided by encoding these parameters in the index itself, so that checks can be made and we can gracefully shut down if two components are not compatible.

Stability

The v1 API should be stable and backwards-compatible. Thus, we should also clearly delineate which parts of the code are part of the public API, e.g., by moving any "private" template code to a separate "private" namespace (such as pisa::details). This should be clearly documented.

Documentation

We should clearly document all the components and public-facing APIs and commands.

New Functionality

This is a list of new functionality we would like to support in the future. Note that this does not mean these have to be implemented for v1, but rather we should have them in mind when designing.

Index updates

We want a facility to update an existing index. This will likely mean we want to implement two functionality: index merging and querying multiple indexes at a time. Additionally, we may want to implement an in-memory dynamic index for new documents before they are written to the disk in the usual format. Similarly, we can think of document deletion, which most likely involves marking documents as deleted in the existing index, and "committing" the deletes by rebuilding the index structure without the marked documents. Furthermore, document modifications can be implemented by adding a new document and marking the old one for deletion.

Building index from collection

This may be at least partially covered by index updates, or at least is related to it. A user that is not interested in index details should be able to feed documents to PISA and get all necessary structures instead of manually going through multiple stages.

Unification of compressed and uncompressed index formats

We should be able to index documents straight to the compressed format, and use the compressed format in place of the uncompressed one when doing things like computing max-scores. We do not want to take away the ability to first invert and then use uncompressed index for operations like reordering documents, but rather make it an optional step.

Server mode

We want the ability to run PISA as a server and expose APIs for index operations. We can aim at compatibility between REST API and CLI, but the server would allow us to do some additional in-memory caching, and delay writing some things to disk when doing many updates.

Faceted search

Searching across multiple indexed "fields". This a down-the-road nice-to-have, but we should at least make our architecture facilitate it for the future or 3rd party implementations.

Disabling features at compile time

Although we want to cover many compression and retrieval algorithms, this comes at a high cost during compilation. For many (most?) use cases, however, a user wants a single encoding. Thus, it would be very beneficial if we could disable some of them at compile time, reducing compilation time and binary size.

Reranking

Multi-stage ranking is something we want to address in the future. Therefore, we should make sure that we facilitate easy and efficient feature calculation.

Dense retrieval

Dense retrieval will require separate type of index and therefore will have little impact on the design of the index itself, but we should consider it when designing high-level functionality.

Definitions

This is a work-in-progress list of definitions, subject to change

Collection

Collection represents any data created from a given corpus. This includes any fields (at first, only one), shards, lexicons, and any other related data structures.

Forward Index

Stores document data, including non-indexed payload.

❓ Should forward index be sharded together with the inverted index? This makes little difference when all resides on the same machine, but impacts distributed systems where shards are on different machines. Note that some of the document-indexed data must be available at query time for scoring, so does it make sense to make forward index part of the overall "Index" structure?

Index

Index stores searchable text data. A single index is parameterized by its encoding format and other properties like whethe it is quantized or not.

Shard (Index segment)

A single static inverted index structure, along with auxiliary data, such as document lengths, max scores (or frequencies), etc.

❓ Should we follow the existing naming convention and call it "index segment" as opposed to "shard"? This may be a good idea as "shard" could be regarded as a set of multiple index segments for purposes of distributing over multiple machines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly