Metric stream optimizations #152

fabxc · 2018-11-29T11:04:13Z

Protobuf is not a particularly cheap format for sending bulk data in terms of bandwidth, CPU and allocations. For metric data, the overhead is particularly high since the actual datapoint is often merely 16 bytes in size.

Currently there are several optimizations in the agent protocol to mitigate that:

Node and Resource information is cached by contract based on the most recently sent one.
A metric which has been sent in full in a stream before, can be referenced just by name in future messages.

For 1. things are relatively simple since per open stream the receiver merely has to hold the most recent item in memory. For each change in Node or Resource, a significant batch of metrics are likely sent, which makes it amortize well.

For 2. I see a few concerns/questions.

The agent needs to keep N metric definitions per stream in memory. This means a client can easily DoS an agent by creating an excessive amount of metrics. Should the protocol contract specify how this ought to be handled?
- Closing the entire stream for the client that exceeds the maximum number of metrics?
- Dropping metrics over that limit?
Can this work in a federated setup, i.e. multiple tiers of agents or a backend that natively understands the OC agent protocol? The metric name alone no longer seems enough, since multiple applications send metric with the same name yet different definitions. I think we can't impose global consistency requirements here.

All this, still leaves each time series entry in the protocol to specify the label values for its metric. While those are not nearly as expensive, they are still a notable overhead.
Are we generally open to potentially optimizing those as well via stream state?

All bandwidth savings aside if the backend receives a full set of (node, resource, metric definition, label values) again for each sample in the end, this means the underlying TSDB has to do a full index lookup on all those properties to find the right series to write the sample to, which is extremely expensive.
In an ideal world, it would be great to have a unique identifier per series within a stream, that can be propagated all the way to the exporter, which may then write samples directly by primary key (after the first one).

For reference, this ultimately brought down Prometheus's CPU usage by up to 80% after implementing a bunch of custom decoding hacks. But a clean support at the protocol level would be greatly preferable of course.

Understandably there's a complexity/benefit ratio to consider here. But is there general interest to investigate this further?

The text was updated successfully, but these errors were encountered:

songy23 · 2019-01-15T23:40:03Z

The agent needs to keep N metric definitions per stream in memory. This means a client can easily DoS an agent by creating an excessive amount of metrics. Should the protocol contract specify how this ought to be handled?

Dropping metrics over that limit?

IMO it makes more sense to set a limit on the maximum number of metric definitions on Agent, and drop additional ones over the limit.

Can this work in a federated setup, i.e. multiple tiers of agents or a backend that natively understands the OC agent protocol? The metric name alone no longer seems enough, since multiple applications send metric with the same name yet different definitions.

All bandwidth savings aside if the backend receives a full set of (node, resource, metric definition, label values) again for each sample in the end, this means the underlying TSDB has to do a full index lookup on all those properties to find the right series to write the sample to, which is extremely expensive.
In an ideal world, it would be great to have a unique identifier per series within a stream, that can be propagated all the way to the exporter, which may then write samples directly by primary key (after the first one).

Agree. So instead of referencing an already-sent metric by name, maybe it's better to return a unique ID to the client. Client can choose to send the ID instead of the full metric definition in an open stream. The ID should be unique in terms of (node, resource, metric definition), so that even if two metrics have the same name on two nodes, their IDs are guaranteed to be different.

tsloughter · 2019-02-05T19:26:58Z

Are the cached metrics kept after a stream is closed? Since the metrics are cached per-stream there is already separation to prevent name clashes, so I wouldn't expect a need for a separate id.

songy23 mentioned this issue Jan 16, 2019

release: make a v0.2.0 release containing the grpc-gateway updates #161

Closed

songy23 mentioned this issue Mar 13, 2019

Exporter/Metrics/OcAgent: Optimization for already-sent metrics. census-instrumentation/opencensus-java#1637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric stream optimizations #152

Metric stream optimizations #152

fabxc commented Nov 29, 2018

songy23 commented Jan 15, 2019

tsloughter commented Feb 5, 2019

Metric stream optimizations #152

Metric stream optimizations #152

Comments

fabxc commented Nov 29, 2018

songy23 commented Jan 15, 2019

tsloughter commented Feb 5, 2019