-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differentiate back-off exceptions from 'real' application errors in Listener Micrometer timer metrics for retry topics #2237
Comments
Hi @theopigott, thanks for bringing this up. It should be easy enough to simply ignore I'm wondering though if it'd be interesting to have separate metrics of the What do you think? Thanks. |
Thanks @tomazfernandes - yes that makes sense for an easy fix. It could indeed also be interesting to record metrics based on the |
Makes sense @theopigott, thanks. Is that something you'd be interested in contributing a PR to? If that's ok with @garyrussell of course. |
Given that it's a behavior change; it would have to be optional (e.g. Adding highly variable tags to meters is not recommended (e.g. actual back off time); some metrics back-ends don't handle such variability well. |
I think that the back off time would be the value of a metric rather than a tag. Then the 'count' would be the number of back-offs that happened, and the 'sum' (for example) would be the total time sleeping due to back-offs. But would it make sense to first skip treating |
Makes sense; thanks. |
Expected Behavior
As per the docs on 'Monitoring Listener Performance', there are
Micrometer
timers calledspring.kafka.listener
which are tagged with aresult
(success
orfailure
) andexception
. I would expect the metrics generated with thefailure
tag to capture true failures (e.g. anIOException
from some resource that is used to process records). Any back-off exceptions, which are expected to occur for topics with a delay configured, should be treated separately, e.g. with a different tag value forresult
orexception
.Current Behavior
A
failure
timer is recorded whenever aRuntimeException
occurs while processing a record. When dealing with retry topics, this includes aKafkaBackoffException
which may be thrown insideinvokeOnMessage
(or the batch equivalent) when the listener determines that the timestamp of the latest record is not ready to be processed yet. Theexception
is always recorded asListenerExecutionFailedException
so there is no way to differentiate back-off exceptions from other exceptions.Context
I would like to analyze the listener metrics to gain insight into failures (how often they happen, the performance impact, etc.), but I'm interested in application logic failures (e.g. database is unavailable) rather than expected framework level failures (back-off exceptions). I was surprised to see my metrics indicating many failures despite the application logs showing that all records were successfully processed until I realized that the failures must actually be due to these
KafkaBackoffException
s.I could implement my own timers/metrics inside my
KafkaListener
, but I would prefer to be able to use the existing timers that are provided by the framework.The text was updated successfully, but these errors were encountered: