PISA v1 Proposal

Introduction

The purpose of this document is to propose a high-level architecture and format of PISA index and related structures.

Goals

This section describes the desired outcomes of the proposed design, in no particular order.

General

Performance

We aim to preserve the high-performance nature of the project. Although some concessions might be made to facilitate new functionality, the performance when not using those should generally not decline. However, in certain cases, we may still decide to sacrifice some performance if it improves maintainability or user experience, though these would have to be consider on a case-by-case basis, and should be avoided if possible.

Functionality

We want to preserve the existing functionality unless we deem some of it unnecessary. For example, we want to maintain the support for a range of different retrieval algorithms, index compression techniques, etc. It is acceptable that some new functionality excludes some of these or otherwise compromises performance for them, e.g., if we introduce index updates, it may be less efficient to do it for quantized indexes, but the current functionality of the quantized indexes should be maintained.

Easy by default, powerful if needed

We want to make it easier to use by default, but still powerful and flexible if needed. For example, we want to introduce abstractions to manage an index without having to manually keep track of all individual files (index, lexicon, metadata, etc.), but we still want to maintain low-level access to those if needed for a power user.

Composability

Our code should facilitate building custom systems. A single component, e.g., inverted index or lexicon, should be decoupled enough to use on its own.

Extensibility

The project should be easy to extend with new functionality, which is especially important for academic purposes.

Safety

We currently suffer from many segfaults whenever someone makes a mistake, e.g., passes the wrong encoding value when running queries. This should be avoided by encoding these parameters in the index itself, so that checks can be made and we can gracefully shut down if two components are not compatible.

Stability

The v1 API should be stable and backwards-compatible. Thus, we should also clearly delineate which parts of the code are part of the public API, e.g., by moving any "private" template code to a separate "private" namespace (such as pisa::details). This should be clearly documented.

Documentation

We should clearly document all the components and public-facing APIs and commands.

New Functionality

This is a list of new functionality we would like to support in the future. Note that this does not mean these have to be implemented for v1, but rather we should have them in mind when designing.

Index updates

We want a facility to update an existing index. This will likely mean we want to implement two functionality: index merging and querying multiple indexes at a time. Additionally, we may want to implement an in-memory dynamic index for new documents before they are written to the disk in the usual format. Similarly, we can think of document deletion, which most likely involves marking documents as deleted in the existing index, and "committing" the deletes by rebuilding the index structure without the marked documents. Furthermore, document modifications can be implemented by adding a new document and marking the old one for deletion.

Building index from collection

This may be at least partially covered by index updates, or at least is related to it. A user that is not interested in index details should be able to feed documents to PISA and get all necessary structures instead of manually going through multiple stages.

Unification of compressed and uncompressed index formats

We should be able to index documents straight to the compressed format, and use the compressed format in place of the uncompressed one when doing things like computing max-scores. We do not want to take away the ability to first invert and then use uncompressed index for operations like reordering documents, but rather make it an optional step.

Server mode

We want the ability to run PISA as a server and expose APIs for index operations. We can aim at compatibility between REST API and CLI, but the server would allow us to do some additional in-memory caching, and delay writing some things to disk when doing many updates.

Faceted search

Searching across multiple indexed "fields". This a down-the-road nice-to-have, but we should at least make our architecture facilitate it for the future or 3rd party implementations.

Disabling features at compile time

Although we want to cover many compression and retrieval algorithms, this comes at a high cost during compilation. For many (most?) use cases, however, a user wants a single encoding. Thus, it would be very beneficial if we could disable some of them at compile time, reducing compilation time and binary size.

Reranking

Multi-stage ranking is something we want to address in the future. Therefore, we should make sure that we facilitate easy and efficient feature calculation.

Dense retrieval

Dense retrieval will require separate type of index and therefore will have little impact on the design of the index itself, but we should consider it when designing high-level functionality.

Definitions

This is a work-in-progress list of definitions, subject to change

Collection

Collection represents any data created from a given corpus. This includes any fields (at first, only one), shards, lexicons, and any other related data structures.

Forward Index

Stores document data, including non-indexed payload.

❓ Should forward index be sharded together with the inverted index? This makes little difference when all resides on the same machine, but impacts distributed systems where shards are on different machines. Note that some of the document-indexed data must be available at query time for scoring, so does it make sense to make forward index part of the overall "Index" structure?

Index

Index stores searchable text data. A single index is parameterized by its encoding format and other properties like whethe it is quantized or not.

Shard (Index segment)

A single static inverted index structure, along with auxiliary data, such as document lengths, max scores (or frequencies), etc.

❓ Should we follow the existing naming convention and call it "index segment" as opposed to "shard"? This may be a good idea as "shard" could be regarded as a set of multiple index segments for purposes of distributing over multiple machines.

Posting Lists

A static inverted index containing all of the shard's postings.

Document lengths

Maps document IDs to lengths.

Local Term Lexicon

Maps terms to internal docIDs.

Global Term Lexicon

Maps terms to global statistics used in scoring, e.g., no. term occurrences or no. term postings.

Inverted Index

An inverted index in context of this document is a static inverted index structure over a single text field. It does not support adding or modifying documents -- this is something that would be done on a higher level. However, we may need a way of marking documents for deletion on a lower level because we can't simply filter them out at the end (we would end up with fewer documents than requested). That said, it is probably best to encode that in the that in a separate structure and handle that when aggregating top-k documents as opposed to having it encoded directly in the index. This way, we may optimize retrieval in cases when no deletions are detected.

An inverted index is parameterized by the type of posting list encoding. Posting lists contain pairs of document IDs and an integer payload (frequency or quantized score).

The index interface must provide the following functionality:

return size (number of terms),
return number of documents,
access a posting list at a given position (equivalent to the term ID).

Some additional functionality may be provided, such as iterating over all posting lists.

Note that we do not mention any merging capability here, as merging must be done at the segment level (which includes lexicons and document lengths).

We can express the index interface in terms of C++ concept:

template <typename T, typename Cursor>
concept InvertedIndex = PostingCursor<Cursor, std::uint32_t>
&& requires(T const i, std::uint32_t termid) {
    { i.operator[](termid) } -> std::same_as<Cursor>;
    { i.num_docs() } -> std::convertible_to<std::size_t>;
    { i.size() } -> std::convertible_to<std::size_t>;
};

When accessing a posting list, we get a posting cursor, expressed here by the PostingCursor concept, which will be explained below.

Notice that the cursor is parameterized by another type. This is because this concept is broader than just reading from an index structure. For example, we can apply a transformation for posting scoring, which produces a cursor of type T that satisfies PostingCursor<T, float>.

Posting Cursor

As mentioned above, a cursor is parameterized by its payload, which is always an integer for an inverted index representation, but can be different if we apply a transformation.

Any cursor implements the following functionality:

return the number of postings in the list,
return the document ID at the current position,
return the payload value the current position,
move to the next position.

Here is a C++ concept representation:

template <typename C, typename Payload>
concept PostingCursor = requires(C const &cursor, Payload payload)
{
    { cursor.docid() } -> std::convertible_to<std::uint32_t>;
    { cursor.size() } noexcept -> std::convertible_to<std::size_t>;
} && requires(C cursor) {
    { cursor.value() } -> std::convertible_to<Payload>;
    cursor.next();
};

We can build upon this definition to define some specialized cursors. For example, for a posting list in which the postings are sorted in the increasing order of document IDs, we define an additional function next_geq(d) that moves the cursor to the next position at which the document ID is at least docid (this operation doesn't make sense if the ordering is different).

Thus, we can define SortedPostingCursor as follows:

template <typename C, typename Payload>
concept SortedPostingCursor = PostingCursor<C, Payload>
&& requires(C cursor, std::uint32_t docid) {
    cursor.next_geq(docid);
};

Any document-at-a-time (DAAT) retrieval algorithm will require this concept to be satisfied by the cursors, while term-at-a-time (TAAT) algorithms will have a relaxed requirement.

Similarly, we can define concepts for other cursors, e.g., for posting lists stored in score-at-a-time (SAAT) order.

Transformations & Aggregations

Two major operations on posting cursors are transformations and aggregations.

A transformation takes a cursor, or multiple cursors, and produces another cursor. An example of a transformation is a scoring cursor. Given a cursor and a scoring function (with some additional data used for scoring the particular term/list), it produces another cursor with a floating point payload. A different kind of transformation are union or intersection. These can be used for more complex retrieval algorithms[^1].

To retrieve the final set of documents, we employ aggregations. They take one or more cursors, and return a set of retrieved documents. For each algorithm, we define a collector. For example, if we want to retrieve all of the matching documents, we can define a collector that takes a list of cursors, transforms them into a union (or intersection), and collects them to a vector. This collector takes any type of cursor. Other cursor may constrain the type of cursors they accept. For example, the MaxScore or BlockMaxWand collectors will require a special max-score and blockmax-score (sorted) cursors, respectively.

[^1]: For example: https://dl.acm.org/doi/abs/10.1145/3488560.3498489

Provide feedback

Saved searches

Use saved searches to filter your results more quickly