Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Vector Index Build Component -- Client Low Level Design #2518

Open
owenhalpert opened this issue Feb 11, 2025 · 1 comment
Open

Remote Vector Index Build Component -- Client Low Level Design #2518

owenhalpert opened this issue Feb 11, 2025 · 1 comment
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement Roadmap:Vector Database/GenAI Project-wide roadmap label

Comments

@owenhalpert
Copy link
Contributor

owenhalpert commented Feb 11, 2025

Overview

The Remote Vector Service Client will handle all communication with the Remote Vector Index Build Service. Its main responsibilities, as outlined below, are to call the remote build service to begin index construction once the vector files have been uploaded, periodically monitor the build requests, and pass the completed index build path along to the index writer after construction is complete, handling any errors along the way.


Architecture and Implementation

Core Components

Http Client

The HttpClient we will start with is the ApacheHttpClient. Holistically speaking, the Apache client is well-supported, well-documented, and widely used. Apache offers a synchronous client, which matches the overall logic of segment merge/flush, where the thread is dedicated to the task until completion. An asynchronous API in this case would mean the API returns a future/promise of the result and doesn’t block the calling thread. The server-side implementation for an asynchronous API adds a lot more complexity.

ApacheHttpClient is already a dependency from OpenSearch Core through the AWS SDK, so we can transitively use the dependency from OpenSearch Core instead of adding it to k-NN, leaving versioning logic to Core. The included HttpClient implementations are already used by the opensearch/ml-commons package, and in Core OpenSearch by the S3 Service, AWS KMS, and more. Theml-connectors use case is to create connectors to outside services (e.g. OpenAI). By following the same patterns and utilizing its established dependencies and protocols, we stay consistent with current standards and are aligned on proven security practices.

Future iterations can easily decide to switch to a different synchronous client from the SDK (AWS Common Runtime (CRT) Http Client or Java’s UrlConnectionHTTPClient). For more info: HTTP clients - AWS SDK for Java 2.x. By utilizing the HttpClient through the SDK instead of directly, we gain any future updates/new client support for free.

We will start with Basic Authentication, which is very simple to set up with Apache, but the HttpClients support other auth methods such as Digest and NTLM. API keys, environment variables, etc can be easily added to the request header.

Future improvements to the client may support other protocols, such as gRPC and HTTP/2.

Class Creation

The RemoteIndexClient will be created inside KNNPlugin and follow a singleton instance pattern, accessible with .getInstance() . The Apache HttpClient docs suggest to always re-use client instances: “They are expensive to create, but they are also fully thread safe, so multiple threads can use the same instance of CloseableHttpClient to execute multiple requests concurrently taking full advantage of persistent connection re-use and connection pooling.”

The Singleton client’s outgoing requests will always be bounded by the number of merge threads working at once on the node, which is fixed by thread pool size. This pattern also allows for clean shutdown of the HTTP client and streamlined metrics collection across all build service interactions.

The Client may need to be rebuilt to change certain settings, so we can use something like .addSettingsUpdateConsumer in k-NN settings to rebuild the client on any setting changes that cannot be updated dynamically.


Code Flow

Image
  1. In KNN’s internal Remote Index Build flow, writeFiles finishes and triggerBuild is called.
  2. In RemoteVectorServiceClient, we construct the build request and submit API call to trigger a remote index build.
    1. Assumption: vectors are successfully uploaded at remote store URL. Any errors in this process will have been handled by the k-NN index builder component before interacting with the client.
    2. Assumption: build service is up and running. (As a future improvement, we may add a health check endpoint)
  3. Initialize timeout tracking (set startTime to current time)
  4. Kick off the polling loop with an initial delay
  5. while (status == RUNNING_INDEX_BUILD);
    1. Poll the server every 30 seconds, checking the build status.
    2. Check statusCode of HTTP response, handle accordingly by either resending the HTTP request up to a configurable max retry count or abandoning the build.
    3. If statusCode is 2xx, handleBuildStatus(response)
      1. RUNNING_INDEX_BUILD: Continue polling if timeout is not exceeded. If it is, break the polling loop, call the /_cancel API, and fallback to local CPU build.
      2. FAILED_INDEX_BUILD: This indicates a terminal error from the build service. Client will break polling loop and fallback to local CPU build.
      3. COMPLETED_INDEX_BUILD: Break polling loop. Index file path is in API response, use it to signal to IndexBuildService that index build is done.
        1. Assumption: Server verifies the index_build is present at the remote store URL before sending COMPLETED status.

Error Handling

Timeouts

With the challenges that come with working with a remote service, there needs to be some concept of a timeout on the client side. We can start with 1 hour, most builds should be done within minutes. Having a timeout on the full process (defined as how long we will wait for COMPLETED_INDEX_BUILD after sending the initial build request) allows us to fall back to the CPU build when we estimate it is worth it.

More benchmarking and testing will help us establish some SLA from the server side, and we can timeout on 2x this SLA, as an example. Even if the GPU worker is busy, it still may be faster to wait than falling back to a CPU build, but this timeout threshold will be crucial to determine this.

The server should also implement some timeout of its own after receiving no polling requests after some time in case the cluster shut down.

Cancel Job

The client will use /_cancel exclusively after a timeout, as defined above, to signal to the server that it has timed out and is abandoning the build. This is to allow the build service to clean up and cancel the job to avoiding wasting GPU resources. Client-side timeout is the only build failure that is determined by the client. Cancel requests do not need to be sent by the client in other failure cases (like FAILED_INDEX_BUILD ) since they represent failures that have been determined by the server, so the server will take care of the cleanup / cancel process.

The user may use this API outside of their OpenSearch process if they decide they need to cancel an ongoing build for any reason.

Retriable Request Errors

Some HTTP requests may receive a response with an error status code. The client will agree to resend an HTTP request that receives any of the following errors:

408 (Request Timeout)
429 (Too Many Requests)
500 (Internal Server Error)
502 (Bad Gateway): Suggests a temporary network issue or service stack disruption that may self-correct.
503 (Service Unavailable): May be due to temporary service outages or in-progress deployments.
504 (Gateway Timeout): A downstream server (e.g., DNS) didn’t respond in time.
509 (Bandwidth Limit Exceeded)

The client comes with recommended default retry logic. For our first iteration, most of the defaults provided should be kept in place. In practice, nodes will not be sending extreme amounts of requests to the same build endpoint, so the default values will keep load in check while also not being too restrictive. Future iterations may tweak these settings based on observed behavior and benchmarking results.

We will tweak the max attempts down to 2 (1 retry) to avoid overloading the server.

Retries 2 times for a total of 3 attempts. You can adjust this with StandardRetryStrategy.Builder#maxAttempts(int).

For non-throttling errors, it uses the BackoffStrategy``#exponentialDelaybackoff strategy, with a base delay of 100 milliseconds and a max delay of 20 seconds. You can adjust this withStandardRetryStrategy.Builder#backoffStrategy`.

For throttling errors (429, 502, 503, 509), it uses the BackoffStrategy#exponentialDelay backoff strategy, with a base delay of 1 second and a max delay of 20 seconds. You can adjust this with StandardRetryStrategy.Builder#throttlingBackoffStrategy.

Performs circuit breaking (disabling retries) in the event of high downstream failures. The first attempt is always executed, only retries are disabled. Adjust with StandardRetryStrategy.Builder#circuitBreakerEnabled.

Non-Retriable Request Errors

The following failure codes have been designated to signify that the build request should be abandoned. These status codes will throw an IOException with a descriptive error, and the node will fall back to CPU build. Most will require intervention (bad auth, bad endpoint, unhealthy build service)

400 (Bad Request): Fix the issue in the request before trying again.
401 (Unauthorized): Fix the authentication issue before trying again
403 (Forbidden): The client lacks permission to access the resource. Retrying will not change the server’s response.
404 (Not Found): The requested resource does not exist. Retrying will not succeed unless the resource becomes available later due to some background processing.
405 (Method Not Allowed): The HTTP method used is not supported. Retrying with the same method will not resolve the issue.
409 (Conflict): Indicates a conflict in the request such as unique constraints for referential integrity. Retrying without addressing the conflict will continue to fail.
422 (Unprocessable Entity): The server understands the request but cannot process it due to semantic errors. Fix the issue in the request before trying again.\

Aside from Retriable and Non-Retriable Requests as defined above, which are specifically for HTTP errors, there is no other concept of a retry on the client side. The client will never be responsible for manually resubmitting a build request after a failure (except if the initial build request fails with a Retriable Request error code).


API Contract

Trigger Build

POST /_build

Request Parameters
{
    "repository_type" : "String", // Derived from repository settings
    "repository_name" : "String", // ⬆️
    "object_path" : "String", // Points to the directory all vector/doc id blobs are written to
    "tenant_id" : "String", // Unique identifier for the cluster making the request
    "index_parameters" : {
        ...
        "method": {
            "name": "hnsw",
            ...
        }
        ...
    }
}

Request Response:
{
    "job_id" : "String" // Unique identifier per build
}

Index Parameters note:

The remote build service will need certain required parameters to accurately read the vectors (data_type, dimensions, etc). It also needs to support per-method (HNSW, IVF) parameters. We need to be able to pass these parameters while supporting the ability to add or update index parameters in the future.

  • To start, we will find a minimal set of parameters needed to construct the index builds. This may just be the full internal field mapping and we allow the server to use whatever parameters it needs.
  • Implementation will focus on making the parameters easily passable.

An additional “task_status” field was considered for the /_build response, in case the build was already in progress from a different request, but was ultimately decided to be unnecessary.

  • Duplicate build requests will return the same job ID
  • As of now there is no scenario in which a client would re-submit an identical build, but in that case, the returned job ID would then still be polled and any errors handled as normal.

Get Status

GET /_status/{job_id}

Request Response:
{
    "task_status" : "String", //RUNNING_INDEX_BUILD, FAILED_INDEX_BUILD, COMPLETED..
    "index_path" : "String" // If completed
}

Cancel Job

POST /_cancel/{job_id}

Request Response:
{
    "status" : "String" // Success/failure
}

The API contract guarantees that the client can assume a build no longer needs to be tracked after a successful call to /_cancel and subsequent status calls should return that the build does not exist. This can be used to differentiate between a canceled and failed build, the latter returning the FAILED_GRAPH_BUILD status.

Authentication

If authentication is configured:

Authentication:
- Endpoints must support Basic Authentication
- Username and password must be provided in the Authorization header
- 401 Unauthorized response if credentials are missing or invalid

Operations

Settings

All the first iteration settings will be cluster level.

  • knn.remote_build_service.endpoint (List of strings) - User specified endpoint(s) for index build server
    • This parameter is null-checked (and potentially health checked as a future improvement) by the index builder before vectors are even uploaded to avoid wasteful upload.
    • Setting is implemented as a list of strings to allow round robin task assignment in a multi-server setup.
    • In future iterations we can add this as an index setting to allow users to utilize different GPU fleets per index.

Authentication can be optionally set by the user in the OpenSearch Keystore

  • knn.remote_build_service.password
    • Example: ./bin/opensearch-keystore add knn.remote_build_service.password
  • If username/password are not configured, the RemoteIndexClient will continue without authentication. Note that this configuration has to be consistent across any/all build services — if these settings are set, all requests will be sent with these authentication parameters.

The following settings will have strong defaults supported by benchmarking but can be unadvertised to the user to avoid too many knobs (available to tweak for advanced use cases):

  • knn.remote_build_service.poll_interval
    • We can also dynamically set this per build instead of using the cluster setting if we find in benchmarking that there is an efficient build size/polling interval relationship.
  • knn.remote_build_service.timeout
    • Same as above.

Security

The build service will start by supporting Basic Authentication (user:pass) that is sent from the OpenSearch Keystore. In the future, we can add different credentials providers to our HTTP client.


Metrics

The client will collect the following metrics for KNNStats which can be listed under a remote_build JSON level that is only published if the feature flag is enabled.

  • Status Request Success/Failure Count (using the HTTP client’s metric collection)
    • status_request_success_count
    • status_request_failure_count
  • Overall Index Build Success/Failure Count (tracked by client)
    • index_build_success_count
    • index_build_failure_count
  • Index Build Timeout Count (tracked by client)
    • index_build_timeout_count

From Jay’s LLD:

since k-nn stats today only supports node level stats, in the first version we will provide only node level stats and users will need to reference the node level stats to understand if the remote vector index build feature is being successfully used. Shard/index level metrics will be explored as a part of Future Improvements / Below The Line


Future Improvements / Below The Line

  • We will write an interface that is API-agnostic to allow future support for gRPC or other protocols
  • Improved polling mechanisms
    • Use historical index build data to better set initial delay and polling interval
    • Index/shard level intervals (instead of cluster setting)
    • Use 3rd party library instead of implementing polling logic from scratch
  • Expanded Auth support
    • Implementers can make a new AuthenticatedBuildRequest class to include an API key or any other custom auth headers
  • Health check endpoint
    • To add more context to failures and to avoid wasteful data uploads, check the health of the build service endpoint before uploading/sending build requests
  • Expose a retry remote index build API to user
    • If the endpoint is wrong, the user should be able to update their cluster setting and retry the remote index build using the job id from logs
  • Round robin task assignment over list of build service endpoints
  • Client library module
    • Extract client code into a separate library module to separate client concerns from plugin

Sources

REST API error codes and their descriptions were taken from https://www.restapitutorial.com/advanced/responses/retries.

@owenhalpert
Copy link
Contributor Author

In implementation, I found that using the Apache HTTP client through the AWS SDK is too restrictive — the SDK doesn't expose all the configurations of the client. Furthermore, the mentioned retry and metrics benefits of the SDK are tied to the service clients (like S3Client), whereas we'd just be using the underlying Apache client without these benefits. For this reason, I am still choosing to use the Apache client, but through a direct dependency on the library, specifically the 5.x version in line with OpenSearch Core. Without this, I would've had to implement a retry strategy from scratch, whereas now we can take advantage of Apache's DefaultHttpRequestRetryStrategy as one example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement Roadmap:Vector Database/GenAI Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

2 participants