Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the instrumentations events to support metrics collection #1706

Open
ilia-beliaev-miro opened this issue Jul 17, 2024 · 1 comment
Open

Comments

@ilia-beliaev-miro
Copy link

Is your feature request related to a problem? Please describe.
We are using different languages in different service to connect to Kafka to consume and produce messages. For all this different clients we want to have an observability dashboard with all the same metrics provided by each service there.

The java kafka client library has many metrics, and we were able to collect some of them thank to the instrumentation events. However, some of them in the current state of kafkajs are very difficult (runtime override of many different functions or classes) or impossible to collect because the information for those metrics is not exposed via Instrumentation Events. In case of the java client, these metrics collection is part of the library itself.

Describe the solution you'd like
To collect this metrics either enhancements are needed for existing instrumentation events or new events could be added.
Metrics that we can't collect at the moment:

  • failed_authentication_rate - should be collected after failed authentication attempt
  • failed_reauthentication_rate - should be collected when re-authentication takes place
  • fetch_manager_fetch_throttle_time_avg - should be collected by observing the value throttleTimeMs or clientSideThrottleTime, but they are not exposed at the moment
  • fetch_manager_fetch_throttle_time_max - same as the previous one
  • coordinator_failed_rebalance_total - ??
  • fetch_manager_records_consumed_rate - probably should be collected in decodeRecord function
  • fetch_manager_bytes_consumed_rate - same as the above
  • coordinator_commit_latency_avg - ??
  • fetch_manager_records_lag - ??
  • time_between_poll_avg - ??
  • poll_idle_ratio_avg - ??
  • coordinator_assigned_partitions - ??

Additional context

@2m
Copy link

2m commented Sep 24, 2024

We are also in a similar situation (microservices written for Java and Node) and we are working on having unified dashboards. Our dashboards only have 3 graphs for Kafka - message consume/produce rate and consumer lag.

Currently we show batch (as opposed to message) consume/produce rate. Which is a bit different from message rates, but is still file.

The one metric that we miss the most is the fetch_manager_records_lag - this shows if there are any outsdanding records after the last fetch. This is especially useful for services that have high throughput. In such case the broker is always reporting consumer lag, because there are always messages inflight. That makes it cumbersome to use the metric from the broker when making scaling decisions.

Tracking fetch_manager_records_lag on the client side is then needed in order to notice when the consumer is not able to keep up with the incoming messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants