Skip to content

Latest commit

 

History

History
198 lines (150 loc) · 10.9 KB

inference_protocols.md

File metadata and controls

198 lines (150 loc) · 10.9 KB

Inference Protocols and APIs

Clients can communicate with Triton using either an HTTP/REST protocol, a GRPC protocol, or by an in-process C API or its C++ wrapper.

HTTP/REST and GRPC Protocols

Triton exposes both HTTP/REST and GRPC endpoints based on standard inference protocols that have been proposed by the KServe project. To fully enable all capabilities Triton also implements HTTP/REST and GRPC extensions to the KServe inference protocol. GRPC protocol also provides a bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream. We typically recommend using the unary version for inference requests. The streaming version should be used only if the situation demands it. Some of such use cases can be:

  • Assume a system with multiple Triton server instances running behind a Load Balancer. If a sequence of inference requests is needed to hit the same Triton server instance, a GRPC stream will hold a single connection throughout the lifetime and hence ensure the requests are delivered to the same Triton instance.
  • If the order of requests/responses needs to be preserved over the network, a GRPC stream will ensure that the server receives the requests in the same order as they were sent from the client.

The HTTP/REST and GRPC protocols also provide endpoints to check server and model health, metadata and statistics. Additional endpoints allow model loading and unloading, and inferencing. See the KServe and extension documentation for details.

HTTP Options

Triton provides the following configuration options for server-client network transactions over HTTP protocol.

Compression

Triton allows the on-wire compression of request/response on HTTP through its clients. See HTTP Compression for more details.

Mapping Triton Server Error Codes to HTTP Status Codes

This table maps various Triton Server error codes to their corresponding HTTP status codes. It can be used as a reference guide for understanding how Triton Server errors are handled in HTTP responses.

Triton Server Error Code HTTP Status Code Description
TRITONSERVER_ERROR_INTERNAL 500 Internal Server Error
TRITONSERVER_ERROR_NOT_FOUND 404 Not Found
TRITONSERVER_ERROR_UNAVAILABLE 503 Service Unavailable
TRITONSERVER_ERROR_UNSUPPORTED 501 Not Implemented
TRITONSERVER_ERROR_UNKNOWN,
TRITONSERVER_ERROR_INVALID_ARG,
TRITONSERVER_ERROR_ALREADY_EXISTS,
TRITONSERVER_ERROR_CANCELLED
400 Bad Request (default for other errors)

GRPC Options

Triton exposes various GRPC parameters for configuring the server-client network transactions. For usage of these options, refer to the output from tritonserver --help.

SSL/TLS

These options can be used to configure a secured channel for communication. The server-side options include:

  • --grpc-use-ssl
  • --grpc-use-ssl-mutual
  • --grpc-server-cert
  • --grpc-server-key
  • --grpc-root-cert

For client-side documentation, see Client-Side GRPC SSL/TLS

For more details on overview of authentication in gRPC, refer here.

Compression

Triton allows the on-wire compression of request/response messages by exposing following option on server-side:

  • --grpc-infer-response-compression-level

For client-side documentation, see Client-Side GRPC Compression

Compression can be used to reduce the amount of bandwidth used in server-client communication. For more details, see gRPC Compression.

GRPC KeepAlive

Triton exposes GRPC KeepAlive parameters with the default values for both client and server described here.

These options can be used to configure the KeepAlive settings:

  • --grpc-keepalive-time
  • --grpc-keepalive-timeout
  • --grpc-keepalive-permit-without-calls
  • --grpc-http2-max-pings-without-data
  • --grpc-http2-min-recv-ping-interval-without-data
  • --grpc-http2-max-ping-strikes

For client-side documentation, see Client-Side GRPC KeepAlive.

GRPC Status Codes

Triton implements GRPC error handling for streaming requests when a specific flag is enabled through headers. Upon encountering an error, Triton returns the appropriate GRPC error code and subsequently closes the stream.

  • triton_grpc_error : The header value needs to be set to true while starting the stream.

GRPC status codes can be used for better visibility and monitoring. For more details, see gRPC Status Codes

For client-side documentation, see Client-Side GRPC Status Codes

Limit Endpoint Access (BETA)

Triton users may want to restrict access to protocols or APIs that are provided by the GRPC or HTTP endpoints of a server. For example, users can provide one set of access credentials for inference APIs and another for model control APIs such as model loading and unloading.

The following options can be specified to declare a restricted protocol group (GRPC) or restricted API group (HTTP):

--grpc-restricted-protocol=<protocol_1>,<protocol_2>,...:<restricted-key>=<restricted-value>
--http-restricted-api=<API_1>,API_2>,...:<restricted-key>=<restricted-value>

The option can be specified multiple times to specifies multiple groups of protocols or APIs with different restriction settings.

  • protocols / APIs : A comma-separated list of protocols / APIs to be included in this group. Note that currently a given protocol / API is not allowed to be included in multiple groups. The following protocols / APIs are recognized:

  • restricted-key : The GRPC / HTTP request header to be checked when a request is received. The completed header for GRPC will be in the form of triton-grpc-protocol-<restricted-key>. The completed header for HTTP will be in the form of <restricted-key>.

  • restricted-value : The header value required to access the specified protocols.

Example

To start the server with a set of protocols and APIs restricted for admin usage and the rest of the protocols and APIs left unrestricted use the following command line arguments:

tritonserver --grpc-restricted-protocol=shared-memory,model-config,model-repository,statistics,trace:<admin-key>=<admin-value> \
             --http-restricted-api=shared-memory,model-config,model-repository,statistics,trace:<admin-key>=<admin-value> ...

GRPC requests to admin protocols require that an additional header triton-grpc-protocol-<admin-key> is provided with value <admin-value>. HTTP requests to admin APIs required that an additional header <admin-key> is provided with value <admin-value>.