Remote Vector Index Build Component -- Client Low Level Design #2518
Labels
Features
Introduces a new unit of functionality that satisfies a requirement
Roadmap:Vector Database/GenAI
Project-wide roadmap label
Overview
The Remote Vector Service Client will handle all communication with the Remote Vector Index Build Service. Its main responsibilities, as outlined below, are to call the remote build service to begin index construction once the vector files have been uploaded, periodically monitor the build requests, and pass the completed index build path along to the index writer after construction is complete, handling any errors along the way.
Architecture and Implementation
Core Components
Http Client
The HttpClient we will start with is the ApacheHttpClient. Holistically speaking, the Apache client is well-supported, well-documented, and widely used. Apache offers a synchronous client, which matches the overall logic of segment merge/flush, where the thread is dedicated to the task until completion. An asynchronous API in this case would mean the API returns a future/promise of the result and doesn’t block the calling thread. The server-side implementation for an asynchronous API adds a lot more complexity.
ApacheHttpClient is already a dependency from OpenSearch Core through the AWS SDK, so we can transitively use the dependency from OpenSearch Core instead of adding it to k-NN, leaving versioning logic to Core. The included HttpClient implementations are already used by the
opensearch/ml-commons
package, and in Core OpenSearch by the S3 Service, AWS KMS, and more. Theml-connectors
use case is to create connectors to outside services (e.g. OpenAI). By following the same patterns and utilizing its established dependencies and protocols, we stay consistent with current standards and are aligned on proven security practices.Future iterations can easily decide to switch to a different synchronous client from the SDK (AWS Common Runtime (CRT) Http Client or Java’s UrlConnectionHTTPClient). For more info: HTTP clients - AWS SDK for Java 2.x. By utilizing the HttpClient through the SDK instead of directly, we gain any future updates/new client support for free.
We will start with Basic Authentication, which is very simple to set up with Apache, but the HttpClients support other auth methods such as Digest and NTLM. API keys, environment variables, etc can be easily added to the request header.
Future improvements to the client may support other protocols, such as gRPC and HTTP/2.
Class Creation
The RemoteIndexClient will be created inside
KNNPlugin
and follow a singleton instance pattern, accessible with.getInstance()
. The Apache HttpClient docs suggest to always re-use client instances: “They are expensive to create, but they are also fully thread safe, so multiple threads can use the same instance ofCloseableHttpClient
to execute multiple requests concurrently taking full advantage of persistent connection re-use and connection pooling.”The Singleton client’s outgoing requests will always be bounded by the number of merge threads working at once on the node, which is fixed by thread pool size. This pattern also allows for clean shutdown of the HTTP client and streamlined metrics collection across all build service interactions.
The Client may need to be rebuilt to change certain settings, so we can use something like
.addSettingsUpdateConsumer
in k-NN settings to rebuild the client on any setting changes that cannot be updated dynamically.Code Flow
writeFiles
finishes andtriggerBuild
is called.while (status == RUNNING_INDEX_BUILD);
RUNNING_INDEX_BUILD:
Continue polling if timeout is not exceeded. If it is, break the polling loop, call the /_cancel API, and fallback to local CPU build.FAILED_INDEX_BUILD:
This indicates a terminal error from the build service. Client will break polling loop and fallback to local CPU build.COMPLETED_INDEX_BUILD:
Break polling loop. Index file path is in API response, use it to signal to IndexBuildService that index build is done.COMPLETED
status.Error Handling
Timeouts
With the challenges that come with working with a remote service, there needs to be some concept of a timeout on the client side. We can start with 1 hour, most builds should be done within minutes. Having a timeout on the full process (defined as how long we will wait for
COMPLETED_INDEX_BUILD
after sending the initial build request) allows us to fall back to the CPU build when we estimate it is worth it.More benchmarking and testing will help us establish some SLA from the server side, and we can timeout on 2x this SLA, as an example. Even if the GPU worker is busy, it still may be faster to wait than falling back to a CPU build, but this timeout threshold will be crucial to determine this.
The server should also implement some timeout of its own after receiving no polling requests after some time in case the cluster shut down.
Cancel Job
The client will use /_cancel exclusively after a timeout, as defined above, to signal to the server that it has timed out and is abandoning the build. This is to allow the build service to clean up and cancel the job to avoiding wasting GPU resources. Client-side timeout is the only build failure that is determined by the client. Cancel requests do not need to be sent by the client in other failure cases (like
FAILED_INDEX_BUILD
) since they represent failures that have been determined by the server, so the server will take care of the cleanup / cancel process.The user may use this API outside of their OpenSearch process if they decide they need to cancel an ongoing build for any reason.
Retriable Request Errors
Some HTTP requests may receive a response with an error status code. The client will agree to resend an HTTP request that receives any of the following errors:
The client comes with recommended default retry logic. For our first iteration, most of the defaults provided should be kept in place. In practice, nodes will not be sending extreme amounts of requests to the same build endpoint, so the default values will keep load in check while also not being too restrictive. Future iterations may tweak these settings based on observed behavior and benchmarking results.
We will tweak the max attempts down to 2 (1 retry) to avoid overloading the server.
Non-Retriable Request Errors
The following failure codes have been designated to signify that the build request should be abandoned. These status codes will throw an IOException with a descriptive error, and the node will fall back to CPU build. Most will require intervention (bad auth, bad endpoint, unhealthy build service)
Aside from Retriable and Non-Retriable Requests as defined above, which are specifically for HTTP errors, there is no other concept of a retry on the client side. The client will never be responsible for manually resubmitting a build request after a failure (except if the initial build request fails with a Retriable Request error code).
API Contract
Trigger Build
Index Parameters note:
The remote build service will need certain required parameters to accurately read the vectors (data_type, dimensions, etc). It also needs to support per-method (HNSW, IVF) parameters. We need to be able to pass these parameters while supporting the ability to add or update index parameters in the future.
An additional “task_status” field was considered for the /_build response, in case the build was already in progress from a different request, but was ultimately decided to be unnecessary.
Get Status
Cancel Job
The API contract guarantees that the client can assume a build no longer needs to be tracked after a successful call to /_cancel and subsequent status calls should return that the build does not exist. This can be used to differentiate between a canceled and failed build, the latter returning the
FAILED_GRAPH_BUILD
status.Authentication
If authentication is configured:
Operations
Settings
All the first iteration settings will be cluster level.
knn.remote_build_service.endpoint (List of strings)
- User specified endpoint(s) for index build serverAuthentication can be optionally set by the user in the OpenSearch Keystore
knn.remote_build_service.password
./bin/opensearch-keystore add knn.remote_build_service.password
The following settings will have strong defaults supported by benchmarking but can be unadvertised to the user to avoid too many knobs (available to tweak for advanced use cases):
knn.remote_build_service.poll_interval
knn.remote_build_service.timeout
Security
The build service will start by supporting Basic Authentication (user:pass) that is sent from the OpenSearch Keystore. In the future, we can add different credentials providers to our HTTP client.
Metrics
The client will collect the following metrics for KNNStats which can be listed under a
remote_build
JSON level that is only published if the feature flag is enabled.status_request_success_count
status_request_failure_count
index_build_success_count
index_build_failure_count
index_build_timeout_count
From Jay’s LLD:
Future Improvements / Below The Line
Sources
REST API error codes and their descriptions were taken from https://www.restapitutorial.com/advanced/responses/retries.
The text was updated successfully, but these errors were encountered: