-
Notifications
You must be signed in to change notification settings - Fork 50
Decide what to do when async event queue is full. #262
Comments
IMO a third option is to make this configurable, i.e users can statically configure whether to 1. drop events or 2. keep all events but block the thread when buffer is full. |
We may need to provide both options in the same binary, e.g. dropping version for metrics/traces and blocking version for logging. |
Hi, I am author of the PR (census-instrumentation/opencensus-java#1837) mentioned by @songy23, CTO of a microservice-based tech company and considering to use OpenCensus/Stackdriver Trace. Thank you for discussing about this issue. I think it is feasible to providing option to choose whether to drop metrics & trace events or block application thread to keep metrics & trace. And also it is natural to provide separate option to control logging. Are there any other concern about this issue? |
Personally I think we can do a lot of things here:
Any opinion on this? |
Also I think it will be good to count all these dropped events. |
My opinion:
Regarding 3, we can borrow ideas from how QoS was done in networking layer. |
A stepping back question on queue, depending on the scenario we might want to have FIFO or LIFO queue, or even a priority queue (or no queue at all). For example, during live site troubleshooting, people might want to get the latest info instead of having to wait until all traces/logs flushed from the queue, in this case if we have to drop some data, dropping old data (the less important data) might make more sense. |
I like the idea of having priorities in the queue if we were to drop events. For example in Java |
By user point of view, I strongly agree with the Design Principle that @reyang mentioned: "monitoring SDK should minimize the impact on the target performance characteristics". It is important to expand usage of monitoring tool. To archive the principle, dropping all events seems to be an certain way rather than disabling some feature (z-page / running span list). For example, if there are overwhelming traffics (such as described in census-instrumentation/opencensus-java#1837 (comment)), not only running span list but also tracing it self causes blocking. But, when queue overflow occurs, automatically disabling z-page / running span list seems to be a good idea to prevent memory leak. When queue get full, running span in the list will never be closed. Disabling z-page and clearing running span list solves such issue. In addition, z-page / running span list is not so much important for operating such large system under overwhelming traffics, IMHO. |
+1 for making the default behavior to drop spans and reduce functionality rather than slow down the host app |
In a few OpenCensus implementations, span events and stats recording events are treated as entries in a ring buffer. Currently if the buffer is full, producers will be blocked. In large-scale systems who generate a lot of events, blocking may be a problem. As an example see census-instrumentation/opencensus-java#1837 (comment).
Alternatively we can consider dropping events when the ring buffer is full (which is implemented in census-instrumentation/opencensus-java#1837). However, although may be fine to drop span/stats events, when we introduce logging API to OpenCensus, it may not be good to drop logs.
This question originates from census-instrumentation/opencensus-java#1809 in Java, but we also had a similar issue in Go (brought up by Istio in istio/istio#9964).
/cc @dinooliva @rghetia @bogdandrutu @c24t
The text was updated successfully, but these errors were encountered: