-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current handling of OTEL_EXPORTER_OTLP_TIMEOUT is not in line with spec #4044
Comments
Good catch! This affects all OTEL_EXPORTER_OTLP*_TIMEOUT envs btw |
This seems like a difficult issue to address without greatly breaking a majority of users. As of now, there isn't really an elegant way of fixing this besides from a major version bump of the otlp exporters. For maintainers, this implies also a change in release process since the otlp exporters won't be released with the same versions anymore. Open to other suggestions as well. |
Not sure, I can not really judge that. Before thinking about how to communicate a change, I'd try to clarify the actual technical consequences for users.
Please also note: Other than in other language client implementations the timeout has no influence on the total time a export may take (see #4043). Currently the timeout is applied to each export call, the exponential backoff retry is coming on top of that. This is another related aspect which totally unaligned with the spec and other language clients I had a look at. |
As discussed in the SIG a plan could be to add a python specific environment variable to delegate the choice of the adhering to the spec (ms) or keep the current behaviour (s) to the distributions. This way we buy some time upstream but downstream can act fast. And advanced users can do the conversion in autonomy. What do you think? |
Other than handling the conversion from the env var we need to handle the fact that the exporters takes the timeout in seconds as int so in order to have a fine-grain value we need to move to seconds as float or keep int but get milliseconds. |
Discussed this issue in the SIG meeting today. Here is what I propose. We are currently in 1.25.0. We release 1.26.0 normally (this should be happening in the next few days). After we release 1.26.0, we add a warning that gets triggered when someone uses the
Then we release 1.27.0 which could not cause an issue and will show users this first warning above. After releaseing 1.27.0, we change the warning above so that it will mention:
Then we release 1.28.0 which could cause an issue and will show users this second warning above. After releasing 1.28.0 we remove the warning. Then we release 1.29.0 which will show users no warning related to this issue. |
Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue). We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead. |
@LarsMichelsen thank you for the summary in #4044 (comment), I agree with your conclusions. In particular:
I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts
|
And their application will break. That will teach them a lesson. 😎 |
What I meant yesterday it that to reduce the breakage from going seconds to ms in the Exporter classes we may keep seconds in the |
I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. Bumping the major and avoid adding more code to handle this is fine for me too. |
I also stumbled upon this while digging through the shutdown timeout issues. There are several inconsistencies which should not exist without good reasons:
I agree that this need to be thought through carefully and maybe we can come up with better approaches thinking this through further. However, I think it's important for a project to be able to do changes of such kind in some way. Otherwise historic issues will pile up over time making the project unusable at some point.
This would enable users to make the change of their configuration in a minor release and prepare for the major bump which would then change the behavior. I understand this as the main point of the proposal, right? Without an additional warning about the future incompatibility, I doubt many users would use this option. In the end, most people read change logs retrospectively in case they experience any issue. Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do. I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration. |
Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.
I'm proposing a way to configure the behavior, I'm not proposing to change the behavior at the same time. |
Ah, thanks for the clarification. I guess I am just missing the broader picture to get that. Is there a place where I can read more about the greater picture, like the downstream distros you mentioned? |
We are trying to define the broader picture as there are different opinions :) Me and others participating in this project are working for companies building their own distributions (see https://opentelemetry.io/docs/concepts/distributions/) |
Yes I agree. But in the case of a minor version release, I think the user will rightfully come here complaining that we broke things. For a major version, it should be more obvious and it also signals which package might have created the breaking change. Question for maintainers: is there any downside to a major bump beyond that we need to potentially backport security fixes for some finite window? |
Follow up to the SIG discussion: @xrmx's concern is fixing this in a timely manner. @open-telemetry/python-maintainers is there any objection or downside to doing a major version bump? The only thing I can think of is that it will take us out of locksetup. |
Yes the only downside is the maintenance overhead. I think this is a good time to come with an automated process of handling packages that are not released in lockstep (especially now that sdk-extension-aws is also being added to the release cycle: open-telemetry/opentelemetry-python-contrib#1899). Currently, there is not a clean way to do this and may require a bit of complicated version management in our scripts. |
This may be the source of the confusion where the timeout are expressed in seconds: |
Describe your environment
No response
What happened?
In
OTLPExporterMixin.__init__
the environment variableOTEL_EXPORTER_OTLP_TIMEOUT
looks like is handled as seconds while the SDK says it has to be in milliseconds (https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#otel_exporter_otlp_timeout) and the compliance matrix says the Python implementation is compliant.The description in
opentelemetry-sdk/src/opentelemetry/sdk/environment_variables.py
does not mention the exact unit.Since the timeout is currently handed over as timeout to
grpc.experimental.unary_unary
, which is documented like this:To my understanding this only an effect in case there is a connection made, but no answer on the RPC sent back, which might not happen very often.
Steps to Reproduce
Start a
server.py
which accepts the client but does not respond:Use this test
client.py
:Preparation: Start the server
Test 1: Start the client without environment variable. It uses the default timeout of 10 seconds per try.
Test 2: Start the client with the environment variable. The time waiting for the export call is now 5 seconds instead of 10 seconds. But according to the spec it should be 5ms with this value.
Expected Result
The Python implementation is in line with the spec and the documentation is clear about the unit.
Actual Result
Setting
OTEL_EXPORTER_OTLP_TIMEOUT=1000
sets a timeout of 1000 seconds instead of 1 second.Additional context
No response
Would you like to implement a fix?
None
The text was updated successfully, but these errors were encountered: