Enhance the instrumentations events to support metrics collection #1706

ilia-beliaev-miro · 2024-07-17T12:50:50Z

Is your feature request related to a problem? Please describe.
We are using different languages in different service to connect to Kafka to consume and produce messages. For all this different clients we want to have an observability dashboard with all the same metrics provided by each service there.

The java kafka client library has many metrics, and we were able to collect some of them thank to the instrumentation events. However, some of them in the current state of kafkajs are very difficult (runtime override of many different functions or classes) or impossible to collect because the information for those metrics is not exposed via Instrumentation Events. In case of the java client, these metrics collection is part of the library itself.

Describe the solution you'd like
To collect this metrics either enhancements are needed for existing instrumentation events or new events could be added.
Metrics that we can't collect at the moment:

failed_authentication_rate - should be collected after failed authentication attempt
failed_reauthentication_rate - should be collected when re-authentication takes place
fetch_manager_fetch_throttle_time_avg - should be collected by observing the value throttleTimeMs or clientSideThrottleTime, but they are not exposed at the moment
fetch_manager_fetch_throttle_time_max - same as the previous one
coordinator_failed_rebalance_total - ??
fetch_manager_records_consumed_rate - probably should be collected in decodeRecord function
fetch_manager_bytes_consumed_rate - same as the above
coordinator_commit_latency_avg - ??
fetch_manager_records_lag - ??
time_between_poll_avg - ??
poll_idle_ratio_avg - ??
coordinator_assigned_partitions - ??

Additional context

2m · 2024-09-24T10:53:03Z

We are also in a similar situation (microservices written for Java and Node) and we are working on having unified dashboards. Our dashboards only have 3 graphs for Kafka - message consume/produce rate and consumer lag.

Currently we show batch (as opposed to message) consume/produce rate. Which is a bit different from message rates, but is still file.

The one metric that we miss the most is the fetch_manager_records_lag - this shows if there are any outsdanding records after the last fetch. This is especially useful for services that have high throughput. In such case the broker is always reporting consumer lag, because there are always messages inflight. That makes it cumbersome to use the metric from the broker when making scaling decisions.

Tracking fetch_manager_records_lag on the client side is then needed in order to notice when the consumer is not able to keep up with the incoming messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance the instrumentations events to support metrics collection #1706

Enhance the instrumentations events to support metrics collection #1706

ilia-beliaev-miro commented Jul 17, 2024

2m commented Sep 24, 2024

Enhance the instrumentations events to support metrics collection #1706

Enhance the instrumentations events to support metrics collection #1706

Comments

ilia-beliaev-miro commented Jul 17, 2024

Additional context

2m commented Sep 24, 2024