Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Performance and Blocking specification #130

Merged
merged 13 commits into from
Jul 31, 2019
7 changes: 6 additions & 1 deletion specification/library-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,13 @@ Note that mocking is also possible by using the SDK and a Mock Exporter without

The mocking approach chosen will depend on the testing goals and at which point exactly it is desirable to intercept the telemetry data path during the test.

## Performance and Blocking

See the [Performance and Blocking](performance.md) specification for
guidelines on the performance expectations that API implementations should meet, strategies for meeting these expectations, and a description of how implementations should document their behavior under load.

## Concurrency and Thread-Safety

See [Concurrency and Thread-Safety](concurrency.md) specification for
See the [Concurrency and Thread-Safety](concurrency.md) specification for
guidelines on what concurrency safeties should API implementations provide
and how they should be documented.
49 changes: 49 additions & 0 deletions specification/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Performance and Blocking of OpenTelemetry API

This document defines common principles that will help designers create language libraries that are safe to use.

## Key principles

Here are the key principles:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting "Instrumentation cannot be a failure modality".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I agree with the policy itself: "Instrumentation cannot be a failure modality". But this file/PR is focusing on performance / blocking matter.

Could you make it as separate GitHub issue or something?

The topic relates with about error handling, recovery, retry, logging, handling information loss etc. Seems not to be so simple.


- **Library should not block end-user application by default.**
- **Library should not consume unbounded memory resource.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there libraries for which this is not true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For blocking, opencensus-java blocks end-user application when a queue gets full: census-instrumentation/opencensus-java#1837 (this is why I get concerned about those matters).

An easy solution to the blocking matter is to use an unbounded queue to avoid blocking. In compensation, it consumes memory. I don't know the monitoring library which uses an unbounded queue, but I think clarifying it is meaningful.

Also, unbounded memory usage matter is related to the log volume matter described in "End-user application should aware of the size of logs" section: https://github.com/open-telemetry/opentelemetry-specification/pull/130/files#diff-44dc82a7e6286380ed89736215beda74R33 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we single out memory only? CPU and latency impact are often more important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that computation overhead (CPU usage, latency) is also a possible cause of unwelcome blocking as discussed in #130 (comment) .

Could you read the above comment?

If we need to deep dive into CPU and latency impact matters, it is better to create separate PR/issue, I feel.


Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible. So that it should not block the end-user application nor consume too much memory resource.

Especially, most telemetry exporters need to call API of servers to export traces. API call operations should be performed in asynchronous I/O or background thread to prevent blocking end-user applications.
saiya marked this conversation as resolved.
Show resolved Hide resolved

See also [Concurrency and Thread-Safety](concurrency.md) if the implementation supports concurrency.

### Tradeoff between non-blocking and memory consumption

Incomplete asynchronous I/O tasks or background tasks may consume memory to preserve their state. In such a case, there is a tradeoff between dropping some tasks to prevent memory starvation and keeping all tasks to prevent information loss.

If there is such tradeoff in language library, it should provide the following options to end-user:

- **Prevent blocking**: Dropping some information under overwhelming load and show warning log to inform when information loss starts and when recovered
- Should be a default option other than Logging
- Should provide option to change threshold of the dropping
- Better to provide metric that represents effective sampling ratio
- Language library might provide this option for Logging
- **Prevent information loss**: Preserve all information but possible to consume many resources
- Should be a default option for Logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why logging is different? Should be an option for every signal and probably the default for all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for approval!

Why logging is different?

About metrics/tracing, most existing tracing libraries support sampling & asynchronous processing. They drop some data in a fixed (or dynamic) rate and also drops data sometimes due to an internal buffer overflow. Such instrumentations are generally expected to sample information. But not expected to degrade end user's application to capture complete data.
About logging, as far as I know, most of the existing logging libraries do not drop logs but may block application. I feel dropping logs by default is a surprising behavior for most users.

Related materials: Logging v. instrumentation, Question: Can zipkin be used to replace logging?

Related conversation in this PR: #130 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was mostly about why not make everything blocking as default. We should support non blocking for sure but I feel that blocking can be default for all the implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saiya just a data point: https://github.com/uber-go/zap by default samples (throttles) logs

Copy link
Contributor Author

@saiya saiya Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support non blocking for sure but I feel that blocking can be default for all the implementations.

In my opinion, the blocking strategy is a little bit surprising for users who use instrumentation (tracing, metrics). But the non-blocking strategy can be an opt-in option as you mentioned. Both are fine for me.

Should we apply the blocking strategy for all components (tracing, metrics, and logging)?
If so, I will update this PR.

Copy link
Contributor

@z1c0 z1c0 Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key distinction to make here is dropping (sampling) vs throttling when it comes to logs:

  • Dropping logs arbitrarily will definitely come as a surprise for most users and may lose valuable information.
  • Throttling however can be really useful. If an application starts flooding the log for whatever reason, intelligent throttling along the lines of "log message "...." appeared X times in the last minute" does not lose information and can prevent severe performance degradations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping logs arbitrarily will definitely come as a surprise for most users and may lose valuable information.

Yes. This is what I meant and why I distinguished logs from other instruments (tracing, metrics).

Throttling however can be really useful. If an application starts flooding the log for whatever reason, intelligent throttling along the lines of "log message "...." appeared X times in the last minute" does not lose information and can prevent severe performance degradations.

Such intelligent seems to be awesome for me.

In my understanding, OpenTelemetly's specification of logging is not yet defined. Thus such technique should be discussed in logging API definition, IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised why do you care that much about the default value and why do you treat metrics different than logs. In big companies (at least some that I know) metrics are more important than logs - for example monitoring/alerting is done using metrics not logs. Also we do use blocking for all the signals and we don't run into issues.

So I would suggest to treat all the signals the same. So probably go with a default blocking or not-blocking but for all the signals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood prescribing a default strategy (blocking or information loss) is a difficult topic. At the beginning, I assumed most users prefer non-blocking (information loss) for metrics/tracing. But now I understood there are various needs.

So that I deleted sentences such as Should be a default option other than Logging and Should be a default option for Logging in 34380b3.

In this PR, I want to clarify principle (avoid degrading user application as possible) and describe desired options to deal with a tradeoff. If we need to prescribe default, it can be done in separate PR, I feel.

- Language library might provide this option also for Tracing and Metrics

### End-user application should aware of the size of logs
saiya marked this conversation as resolved.
Show resolved Hide resolved

Logging could consume much memory by default if the end-user application emits too many logs. This default behavior is intended to preserve logs rather than dropping it. To make resource usage bounded, the end-user should consider reducing logs that are passed to the exporters.

Therefore, the language library should provide a way to filter logs to capture by OpenTelemetry. End-user applications may want to log so much into log file or stdout (or somewhere else) but not want to send all of the logs to OpenTelemetry exporters.

In a documentation of the language library, it is a good idea to point out that too many logs consume many resources by default then guide how to filter logs.

### Shutdown and explicit flushing could block

The language library could block the end-user application when it shut down. On shutdown, it has to flush data to prevent information loss. The language library should support user-configurable timeout if it blocks on shut down.

If the language library supports an explicit flush operation, it could block also.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you want to also support a timeout for this operation and a negative or zero timeout means blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I also looked #186 (it describes timeout about shutdown operation).

Added sentence But should support a configurable timeout.. More detailed interface (e.g. use nagative integer to specify infinite blocking) can be described in #186's document if needed.


## Documentation

If language specific implementation has special characteristics that are not described in this document, such characteristics should be documented.