-
Notifications
You must be signed in to change notification settings - Fork 95
Metric stream optimizations #152
Comments
IMO it makes more sense to set a limit on the maximum number of metric definitions on Agent, and drop additional ones over the limit.
Agree. So instead of referencing an already-sent metric by name, maybe it's better to return a unique ID to the client. Client can choose to send the ID instead of the full metric definition in an open stream. The ID should be unique in terms of (node, resource, metric definition), so that even if two metrics have the same name on two nodes, their IDs are guaranteed to be different. |
Are the cached metrics kept after a stream is closed? Since the metrics are cached per-stream there is already separation to prevent name clashes, so I wouldn't expect a need for a separate id. |
Protobuf is not a particularly cheap format for sending bulk data in terms of bandwidth, CPU and allocations. For metric data, the overhead is particularly high since the actual datapoint is often merely 16 bytes in size.
Currently there are several optimizations in the agent protocol to mitigate that:
For 1. things are relatively simple since per open stream the receiver merely has to hold the most recent item in memory. For each change in Node or Resource, a significant batch of metrics are likely sent, which makes it amortize well.
For 2. I see a few concerns/questions.
All this, still leaves each time series entry in the protocol to specify the label values for its metric. While those are not nearly as expensive, they are still a notable overhead.
Are we generally open to potentially optimizing those as well via stream state?
All bandwidth savings aside if the backend receives a full set of
(node, resource, metric definition, label values)
again for each sample in the end, this means the underlying TSDB has to do a full index lookup on all those properties to find the right series to write the sample to, which is extremely expensive.In an ideal world, it would be great to have a unique identifier per series within a stream, that can be propagated all the way to the exporter, which may then write samples directly by primary key (after the first one).
For reference, this ultimately brought down Prometheus's CPU usage by up to 80% after implementing a bunch of custom decoding hacks. But a clean support at the protocol level would be greatly preferable of course.
Understandably there's a complexity/benefit ratio to consider here. But is there general interest to investigate this further?
The text was updated successfully, but these errors were encountered: