-
-
Notifications
You must be signed in to change notification settings - Fork 66
PISA v1 Proposal
The purpose of this document is to propose a high-level architecture and format of PISA index and related structures.
This section describes the desired outcomes of the proposed design, in no particular order.
We aim to preserve the high-performance nature of the project. Although some concessions might be made to facilitate new functionality, the performance when not using those should generally not decline. However, in certain cases, we may still decide to sacrifice some performance if it improves maintainability or user experience, though these would have to be consider on a case-by-case basis, and should be avoided if possible.
We want to preserve the existing functionality unless we deem some of it unnecessary. For example, we want to maintain the support for a range of different retrieval algorithms, index compression techniques, etc. It is acceptable that some new functionality excludes some of these or otherwise compromises performance for them, e.g., if we introduce index updates, it may be less efficient to do it for quantized indexes, but the current functionality of the quantized indexes should be maintained.
We want to make it easier to use by default, but still powerful and flexible if needed. For example, we want to introduce abstractions to manage an index without having to manually keep track of all individual files (index, lexicon, metadata, etc.), but we still want to maintain low-level access to those if needed for a power user.
Our code should facilitate building custom systems. A single component, e.g., inverted index or lexicon, should be decoupled enough to use on its own.
The project should be easy to extend with new functionality, which is especially important for academic purposes.
We currently suffer from many segfaults whenever someone makes a mistake, e.g., passes the wrong encoding value when running queries. This should be avoided by encoding these parameters in the index itself, so that checks can be made and we can gracefully shut down if two components are not compatible.
The v1 API should be stable and backwards-compatible.
Thus, we should also clearly delineate which parts of the code are part
of the public API, e.g., by moving any "private" template code to a
separate "private" namespace (such as pisa::details
). This should be
clearly documented.
We should clearly document all the components and public-facing APIs and commands.
This is a list of new functionality we would like to support in the future. Note that this does not mean these have to be implemented for v1, but rather we should have them in mind when designing.
We want a facility to update an existing index. This will likely mean we want to implement two functionality: index merging and querying multiple indexes at a time. Additionally, we may want to implement an in-memory dynamic index for new documents before they are written to the disk in the usual format. Similarly, we can think of document deletion, which most likely involves marking documents as deleted in the existing index, and "committing" the deletes by rebuilding the index structure without the marked documents. Furthermore, document modifications can be implemented by adding a new document and marking the old one for deletion.
This may be at least partially covered by index updates, or at least is related to it. A user that is not interested in index details should be able to feed documents to PISA and get all necessary structures instead of manually going through multiple stages.
We should be able to index documents straight to the compressed format, and use the compressed format in place of the uncompressed one when doing things like computing max-scores. We do not want to take away the ability to first invert and then use uncompressed index for operations like reordering documents, but rather make it an optional step.
We want the ability to run PISA as a server and expose APIs for index operations. We can aim at compatibility between REST API and CLI, but the server would allow us to do some additional in-memory caching, and delay writing some things to disk when doing many updates.
Searching across multiple indexed "fields". This a down-the-road nice-to-have, but we should at least make our architecture facilitate it for the future or 3rd party implementations.
Although we want to cover many compression and retrieval algorithms, this comes at a high cost during compilation. For many (most?) use cases, however, a user wants a single encoding. Thus, it would be very beneficial if we could disable some of them at compile time, reducing compilation time and binary size.
Multi-stage ranking is something we want to address in the future. Therefore, we should make sure that we facilitate easy and efficient feature calculation.
Dense retrieval will require separate type of index and therefore will have little impact on the design of the index itself, but we should consider it when designing high-level functionality.
This is a work-in-progress list of definitions, subject to change
Collection represents any data created from a given corpus. This includes any fields (at first, only one), shards, lexicons, and any other related data structures.
Stores document data, including non-indexed payload.
❓ Should forward index be sharded together with the inverted index? This makes little difference when all resides on the same machine, but impacts distributed systems where shards are on different machines. Note that some of the document-indexed data must be available at query time for scoring, so does it make sense to make forward index part of the overall "Index" structure?
Index stores searchable text data. A single index is parameterized by its encoding format and other properties like whethe it is quantized or not.
A single static inverted index structure, along with auxiliary data, such as document lengths, max scores (or frequencies), etc.
❓ Should we follow the existing naming convention and call it "index segment" as opposed to "shard"? This may be a good idea as "shard" could be regarded as a set of multiple index segments for purposes of distributing over multiple machines.
A static inverted index containing all of the shard's postings.
Maps document IDs to lengths.
Maps terms to internal docIDs.
Maps terms to global statistics used in scoring, e.g., no. term occurrences or no. term postings.
An inverted index in context of this document is a static inverted index structure over a single text field. It does not support adding or modifying documents -- this is something that would be done on a higher level. However, we may need a way of marking documents for deletion on a lower level because we can't simply filter them out at the end (we would end up with fewer documents than requested). That said, it is probably best to encode that in the that in a separate structure and handle that when aggregating top-k documents as opposed to having it encoded directly in the index. This way, we may optimize retrieval in cases when no deletions are detected.
An inverted index is parameterized by the type of posting list encoding. Posting lists contain pairs of document IDs and an integer payload (frequency or quantized score).
The index interface must provide the following functionality:
- return size (number of terms),
- return number of documents,
- access a posting list at a given position (equivalent to the term ID).
Some additional functionality may be provided, such as iterating over all posting lists.
Note that we do not mention any merging capability here, as merging must be done at the segment level (which includes lexicons and document lengths).
We can express the index interface in terms of C++ concept:
template <typename T, typename Cursor>
concept InvertedIndex = PostingCursor<Cursor, std::uint32_t>
&& requires(T const i, std::uint32_t termid) {
{ i.operator[](termid) } -> std::same_as<Cursor>;
{ i.num_docs() } -> std::convertible_to<std::size_t>;
{ i.size() } -> std::convertible_to<std::size_t>;
};
When accessing a posting list, we get a posting cursor, expressed here
by the PostingCursor
concept, which will be explained below.
Notice that the cursor is parameterized by another type. This is because
this concept is broader than just reading from an index structure. For
example, we can apply a transformation for posting scoring, which
produces a cursor of type T
that satisfies PostingCursor<T, float>
.
As mentioned above, a cursor is parameterized by its payload, which is always an integer for an inverted index representation, but can be different if we apply a transformation.
Any cursor implements the following functionality:
- return the number of postings in the list,
- return the document ID at the current position,
- return the payload value the current position,
- move to the next position.
Here is a C++ concept representation:
template <typename C, typename Payload>
concept PostingCursor = requires(C const &cursor, Payload payload)
{
{ cursor.docid() } -> std::convertible_to<std::uint32_t>;
{ cursor.size() } noexcept -> std::convertible_to<std::size_t>;
} && requires(C cursor) {
{ cursor.value() } -> std::convertible_to<Payload>;
cursor.next();
};
We can build upon this definition to define some specialized cursors.
For example, for a posting list in which the postings are sorted in the
increasing order of document IDs, we define an additional function
next_geq(d)
that moves the cursor to the next position at which the
document ID is at least docid
(this operation doesn't make sense if
the ordering is different).
Thus, we can define SortedPostingCursor
as follows:
template <typename C, typename Payload>
concept SortedPostingCursor = PostingCursor<C, Payload>
&& requires(C cursor, std::uint32_t docid) {
cursor.next_geq(docid);
};
Any document-at-a-time (DAAT) retrieval algorithm will require this concept to be satisfied by the cursors, while term-at-a-time (TAAT) algorithms will have a relaxed requirement.
Similarly, we can define concepts for other cursors, e.g., for posting lists stored in score-at-a-time (SAAT) order.
Two major operations on posting cursors are transformations and aggregations.
A transformation takes a cursor, or multiple cursors, and produces another cursor. An example of a transformation is a scoring cursor. Given a cursor and a scoring function (with some additional data used for scoring the particular term/list), it produces another cursor with a floating point payload. A different kind of transformation are union or intersection. These can be used for more complex retrieval algorithms[^1].
To retrieve the final set of documents, we employ aggregations. They
take one or more cursors, and return a set of retrieved documents. For
each algorithm, we define a collector. For example, if we want to
retrieve all of the matching documents, we can define a collector that
takes a list of cursors, transforms them into a union (or intersection),
and collects them to a vector. This collector takes any type of cursor.
Other cursor may constrain the type of cursors they accept. For example,
the MaxScore
or BlockMaxWand
collectors will require a special
max-score and blockmax-score (sorted) cursors, respectively.
[^1]: For example: https://dl.acm.org/doi/abs/10.1145/3488560.3498489