Decide what to do when async event queue is full. #262

songy23 · 2019-04-23T22:18:06Z

In a few OpenCensus implementations, span events and stats recording events are treated as entries in a ring buffer. Currently if the buffer is full, producers will be blocked. In large-scale systems who generate a lot of events, blocking may be a problem. As an example see census-instrumentation/opencensus-java#1837 (comment).

Alternatively we can consider dropping events when the ring buffer is full (which is implemented in census-instrumentation/opencensus-java#1837). However, although may be fine to drop span/stats events, when we introduce logging API to OpenCensus, it may not be good to drop logs.

This question originates from census-instrumentation/opencensus-java#1809 in Java, but we also had a similar issue in Go (brought up by Istio in istio/istio#9964).

/cc @dinooliva @rghetia @bogdandrutu @c24t

songy23 · 2019-04-23T22:25:17Z

IMO a third option is to make this configurable, i.e users can statically configure whether to 1. drop events or 2. keep all events but block the thread when buffer is full.

dinooliva · 2019-04-23T22:33:59Z

We may need to provide both options in the same binary, e.g. dropping version for metrics/traces and blocking version for logging.

saiya · 2019-04-25T02:47:41Z

Hi, I am author of the PR (census-instrumentation/opencensus-java#1837) mentioned by @songy23, CTO of a microservice-based tech company and considering to use OpenCensus/Stackdriver Trace. Thank you for discussing about this issue.

I think it is feasible to providing option to choose whether to drop metrics & trace events or block application thread to keep metrics & trace. And also it is natural to provide separate option to control logging.

Are there any other concern about this issue?

bogdandrutu · 2019-04-30T00:50:19Z

Personally I think we can do a lot of things here:

Disable RunningSpanStore if z-pages are not enable;
Default blocking (allow via a config to disable this); If this is disabled then no z-pages support (easier than what the current PR tries to do).

Any opinion on this?

bogdandrutu · 2019-04-30T00:50:36Z

Also I think it will be good to count all these dropped events.

reyang · 2019-05-03T00:49:44Z

My opinion:

Default non-blocking - Design Principle: monitoring SDK should minimize the impact on the target performance characteristics.
Need to have a way to know (refer to the ETW design):
2.1 When we started to drop data.
2.2 When we have recovered.
Optionally provide the metrics on the queue size, flush latency (time taken for an item to be sent out since it entered the queue) and data drop counter.

Regarding 3, we can borrow ideas from how QoS was done in networking layer.

reyang · 2019-05-03T21:55:31Z

A stepping back question on queue, depending on the scenario we might want to have FIFO or LIFO queue, or even a priority queue (or no queue at all).

For example, during live site troubleshooting, people might want to get the latest info instead of having to wait until all traces/logs flushed from the queue, in this case if we have to drop some data, dropping old data (the less important data) might make more sense.

songy23 · 2019-05-03T22:05:41Z

I like the idea of having priorities in the queue if we were to drop events. For example in Java RegisterSpan and UnregisterSpan are both queue events (https://github.com/census-instrumentation/opencensus-java/blob/0e1c7d0a406ea9784d6016dcd92386504a1e206a/impl_core/src/main/java/io/opencensus/implcore/trace/export/InProcessSampledSpanStoreImpl.java#L302-L344). IMO UnregisterSpan events should not be dropped, otherwise there could be memory leaks. On the other hand when the event queue became full, we probably don't want to add more spans to the in-process store, so it's OK to drop RegisterSpan events.

saiya · 2019-05-08T07:02:26Z

By user point of view, I strongly agree with the Design Principle that @reyang mentioned: "monitoring SDK should minimize the impact on the target performance characteristics". It is important to expand usage of monitoring tool.

To archive the principle, dropping all events seems to be an certain way rather than disabling some feature (z-page / running span list). For example, if there are overwhelming traffics (such as described in census-instrumentation/opencensus-java#1837 (comment)), not only running span list but also tracing it self causes blocking.

But, when queue overflow occurs, automatically disabling z-page / running span list seems to be a good idea to prevent memory leak. When queue get full, running span in the list will never be closed. Disabling z-page and clearing running span list solves such issue. In addition, z-page / running span list is not so much important for operating such large system under overwhelming traffics, IMHO.

mtwo · 2019-06-26T18:40:24Z

+1 for making the default behavior to drop spans and reduce functionality rather than slow down the host app

songy23 added the discussion label Apr 23, 2019

songy23 mentioned this issue Apr 23, 2019

Prevent blocking on queue overflow (#1809) census-instrumentation/opencensus-java#1837

Closed

songy23 mentioned this issue Apr 30, 2019

0.21.0 release census-instrumentation/opencensus-java#1857

Closed

3 tasks

songy23 mentioned this issue May 2, 2019

Refactor async exporter census-instrumentation/opencensus-python#642

Merged

dmichel1 mentioned this issue May 3, 2019

Memory settings census-instrumentation/opencensus-java#1767

Closed

lmuhlha mentioned this issue May 13, 2019

Fix tracing memory leak (Heroic-api) spotify/heroic#469

Closed

This was referenced Jun 12, 2019

Add recommendations about blocking / queuing / resource consumption for language libraries open-telemetry/opentelemetry-specification#94

Closed

Add Performance and Blocking specification open-telemetry/opentelemetry-specification#130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide what to do when async event queue is full. #262

Decide what to do when async event queue is full. #262

songy23 commented Apr 23, 2019

songy23 commented Apr 23, 2019

dinooliva commented Apr 23, 2019

saiya commented Apr 25, 2019

bogdandrutu commented Apr 30, 2019

bogdandrutu commented Apr 30, 2019

reyang commented May 3, 2019 •

edited

Loading

reyang commented May 3, 2019

songy23 commented May 3, 2019

saiya commented May 8, 2019

mtwo commented Jun 26, 2019

Decide what to do when async event queue is full. #262

Decide what to do when async event queue is full. #262

Comments

songy23 commented Apr 23, 2019

songy23 commented Apr 23, 2019

dinooliva commented Apr 23, 2019

saiya commented Apr 25, 2019

bogdandrutu commented Apr 30, 2019

bogdandrutu commented Apr 30, 2019

reyang commented May 3, 2019 • edited Loading

reyang commented May 3, 2019

songy23 commented May 3, 2019

saiya commented May 8, 2019

mtwo commented Jun 26, 2019

reyang commented May 3, 2019 •

edited

Loading