Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Performance and Blocking specification #130

Merged
merged 13 commits into from
Jul 31, 2019
9 changes: 8 additions & 1 deletion specification/library-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,13 @@ Note that mocking is also possible by using SDK and a Mock `Exporter` without ne

The mocking approach chosen will depend on the testing goals and at which point exactly it is desirable to intercept the telemetry data path during the test.

## Performance and Blocking

See the [Performance and Blocking](performance.md) specification for
guidelines on the performance expectations that API implementations should meet, strategies for meeting these expectations, and a description of how implementations should document their behavior under load.

## Concurrency and Thread-Safety

See [Concurrency and Thread-Safety](concurrency.md) specification for guidelines on what concurrency safeties should API implementations provide and how they should be documented.
See the [Concurrency and Thread-Safety](concurrency.md) specification for
guidelines on what concurrency safeties should API implementations provide
and how they should be documented.
44 changes: 44 additions & 0 deletions specification/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Performance and Blocking of OpenTelemetry API

This document defines common principles that will help designers create language libraries that are safe to use.

## Key principles

Here are the key principles:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting "Instrumentation cannot be a failure modality".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I agree with the policy itself: "Instrumentation cannot be a failure modality". But this file/PR is focusing on performance / blocking matter.

Could you make it as separate GitHub issue or something?

The topic relates with about error handling, recovery, retry, logging, handling information loss etc. Seems not to be so simple.


- **Library should not block end-user application by default.**
- **Library should not consume unbounded memory resource.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there libraries for which this is not true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For blocking, opencensus-java blocks end-user application when a queue gets full: census-instrumentation/opencensus-java#1837 (this is why I get concerned about those matters).

An easy solution to the blocking matter is to use an unbounded queue to avoid blocking. In compensation, it consumes memory. I don't know the monitoring library which uses an unbounded queue, but I think clarifying it is meaningful.

Also, unbounded memory usage matter is related to the log volume matter described in "End-user application should aware of the size of logs" section: https://github.com/open-telemetry/opentelemetry-specification/pull/130/files#diff-44dc82a7e6286380ed89736215beda74R33 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we single out memory only? CPU and latency impact are often more important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that computation overhead (CPU usage, latency) is also a possible cause of unwelcome blocking as discussed in #130 (comment) .

Could you read the above comment?

If we need to deep dive into CPU and latency impact matters, it is better to create separate PR/issue, I feel.


Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible. So that it should not block the end-user application nor consume too much memory resource.

See also [Concurrency and Thread-Safety](concurrency.md) if the implementation supports concurrency.

### Tradeoff between non-blocking and memory consumption

Incomplete asynchronous I/O tasks or background tasks may consume memory to preserve their state. In such a case, there is a tradeoff between dropping some tasks to prevent memory starvation and keeping all tasks to prevent information loss.

If there is such tradeoff in language library, it should provide the following options to end-user:

- **Prevent information loss**: Preserve all information but possible to consume many resources
- **Prevent blocking**: Dropping some information under overwhelming load and show warning log to inform when information loss starts and when recovered
- Should provide option to change threshold of the dropping
- Better to provide metric that represents effective sampling ratio
- Language library might provide this option for Logging

### End-user application should be aware of the size of logs

Logging could consume much memory by default if the end-user application emits too many logs. This default behavior is intended to preserve logs rather than dropping it. To make resource usage bounded, the end-user should consider reducing logs that are passed to the exporters.

Therefore, the language library should provide a way to filter logs to capture by OpenTelemetry. End-user applications may want to log so much into log file or stdout (or somewhere else) but not want to send all of the logs to OpenTelemetry exporters.

In a documentation of the language library, it is a good idea to point out that too many logs consume many resources by default then guide how to filter logs.

### Shutdown and explicit flushing could block

The language library could block the end-user application when it shut down. On shutdown, it has to flush data to prevent information loss. The language library should support user-configurable timeout if it blocks on shut down.

If the language library supports an explicit flush operation, it could block also. But should support a configurable timeout.

## Documentation

If language specific implementation has special characteristics that are not described in this document, such characteristics should be documented.