-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify usage of Distributed Tracing Extension by OpenTelemetry #912
Clarify usage of Distributed Tracing Extension by OpenTelemetry #912
Conversation
Signed-off-by: Joao Grassi <[email protected]>
6c0c753
to
d5090aa
Compare
@joaopgrassi thanks a ton for doing this! Hopefully this will address some of the concerns that have come up in the past.
IMO, I think it would be fine to do this rename if that's what the group wants. |
We also use Distributed Tracing Extension in Azure SDKs and it would be great to see clarification on how it should be used in the CloudEvents spec. I have a tiny concern about renaming it to E.g. in Azure SDKs, we support legacy tracing libraries along with OpenTelemetry, and if someone would ever want to plug another tracing solution, they would still be able to do so and leverage Distributed Tracing extension on CloudEvents. |
Yes, I agree. I also have mixed feelings about renaming it. I think keeping as-is is fine, as long as this document is clear on how it is used and people coming into this can find clear guidance. It can still be used by other means, but that's the user choice, so not much we can do at that point I'm afraid. |
Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension, | ||
if used, MUST carry the same trace information contained in protocol specific tracing headers. | ||
|
||
Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST | ||
carry the trace information of the starting trace of the transmission. | ||
In other words, it MUST NOT carry trace information of each individual hop, since this information is usually | ||
carried using protocol specific headers, understood by tools like [OpenTelemetry](https://opentelemetry.io/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading this part and I don't think the goal of this extension and OTEL are the same.
I think the goal of this extension is to do Event Lineage.
In lineage, we care only about event processing hopes, while OTEL traces every remote call, even middlewares that don't impact the original event.
I think we are changing the meaning of the extension more than clarifying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pierDipi thanks for the input.
The semantic conventions we are trying to define in OTel open-telemetry/opentelemetry-specification#1978 are not defining tracing of the transport layer (middlewares, http transport and etc). We are rather focused on tracing the "application-layer", meaning what happened after I sent this event. This seems to align well with the link you mentioned:
An Event lineage is a path in the event processing network that enables the tracing of the consequences of an event, or finding the reasons for a certain action.
The way we are planning to achieve this is by: Every time an event is created, a Span is created. Then, we get this Span and add it to the event's Distributed Tracing Extension.
Then, on every place this event arrives, a Processing span is created. This processing span is then connected to the create span by adding the context from the Distributed Tracing Extension to it as a Link
With this, you would be able to see all operations that happened because of the initial create operation.
Regarding the old text on this spec :
Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension,
if used, MUST carry the same trace information contained in protocol specific tracing headers.
This can't be true, because if we create a span to track the "event creation" we'll have context A. Then when sending the event, say via HTTP, if the app is auto-instrumented, then there will be another span created for the physical send operation, so context B. The physical send operation is "outside" of the apps domain and we can't really modify the event at that point, to inject the same context (unless the auto-instrumentation understands CloudEvents and does this).
Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.
"starting trace of the transmission" is what the OpenTelemetry conventions is modeling with the "Create Span"
Once the trace context is set on the event, it MUST not be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a use-case with an app:
And this is what a trace will look like in Jaeger
Note the consumer link. If you click that you go to the Create span. Links are not yet well supported by many back-ends, but we hope they will catch up and present this in a more linear way. @lmolkova had a similar example using Azure monitor where links are showed below the create, as you would expect.
## Using the Distributed Tracing Extension | ||
|
||
The | ||
[OpenTelemetry Semantic Conventions for CloudEvents](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/trace/semantic_conventions/cloudevents.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a 404 from this link - is this is placeholder for a new doc that's in-the-works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a placeholder for when it's merged open-telemetry/opentelemetry-specification#1978
On the 12/02 call we approved the overall direction of this PR but we'll hold off merging it until the referenced doc is ready. Then we'll revisit it to make sure that between this PR and that doc everything looks good from an end-user's perspective. |
@duglin thanks for this update! One thing that I should bring up is that if the PR in the OpenTelemetry gets merged, it will still be marked experimental. The reason is that we are also working in the more general messaging conventions and finding out things "as we go". Since CloudEvents and Messaging overlap in some cases, it's likely that we'll need to make changes in the CloudEvents semantic conventions again as well. That then could also affect this document in the CloudEvent spec, but it's hard to say for now. So we might consider only merging this once the CloudEvents semantic conventions in OpenTelemetry is marked as stable. What do you think? |
@joaopgrassi it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you... |
Great, will wait for the OT PR: I hope we include the HTTP header names with the CloudEvents spec here for clarity. |
If it adds improvement to what we have today it's hard to say 😅 but it will def have to be adapted as we work on the OTel side. So let's leave it open for now, and I'll keep working on it. I just found out that we actually have code that produces this |
Just to make it official... we're putting this PR "on hold" until the OT side of things are ready. |
Wanted to share my view: In the last ~6-7 years I've seen a fair share of clients and projects going for stronger choreography often mis-using especially Kafka as a message (queue) expecting "reliable processing" (whatever that means) and experiencing incredible difficulty tracking down anything leaving the happy path. Band-aids are usually hacked together and stuffed on it to somehow gain visibility & "control". Often falling back to using classic message queue metaphors (self-implemented dead-letter-queues with manual copy-back jobs because processing was not implemented in a proper idempotent way etc...). An enormous amount of effort & time goes down the drain for this; and it all starts with visibility. Out of that I want to say: this is incredibly useful and please embrace it / improve on it. Writing down a clear convention how to transport (CloudEvents Extension) and how to link (OpenTelemetry) things to the original event creating span is excellent and the right way to do it. There's a range of use-cases that are just not properly solved otherwise:
The only solution (and this here is similar with links but of course tackling it at the right level) I've seen doing it kind-of right has been Azure EventHub with the corresponding libraries. They also link to the original span showing it then in Azure Monitor. However it's incredibly badly documented, the only good explanation I found back then is on this medium article and inconsistent between e.g. C#, Java and node.js libraries... (or at least it was) So I'd actually (once all is finalised and clarified, and judging from the last fine-adjustments I saw from @lmolkova I have a positive feeling) love to (optionally) see this support in the CloudEvents libraries / or a good hook that the OpenTelemetry libs hook into, because:
And yes, in the most simple scenario (HTTP direct call, no batching, no magic) you'd also see the chain in the normal HTTP W3C trace-headers and auto-instrumentation. But I think also in this scenario the additional link wouldn't hurt... just my (very long) 5cents... |
@joaopgrassi whats the status of things? It appears the OT PR might be merged |
Hi @duglin |
@joaopgrassi cool - thanks for the update! |
This Pull Request is stale because it has been open for 30 days with |
Not stale, we are still working on bringing messaging conventions in OTel to stable. |
/remove-lifecycle stale |
This Pull Request is stale because it has been open for 30 days with |
/remove-lifecycle stale |
@joaopgrassi just a periodic ping to see if anything has changed |
Hi @duglin ! So something did happen actually! We have been working on a OTEP that re-defines the general messaging systems semantic convention and that was merged last week. Now the work is to start porting the text from the OTEP proposal to the OTel specification and hopefully mark it stable. Since CloudEvents have a lot of overlap with it, we have a tracking issue open-telemetry/semantic-conventions#654 to investigate how the new conventions overlap/conflict and how we should proceed. So, some movement, but nothing concrete for this yet. I will keep you/all posted on the progress. |
@joaopgrassi great - thanks for the update |
This Pull Request is stale because it has been open for 30 days with |
/remove-lifecycle stale I didn't forget about this. We are moving forward with the stabilization of messaging conventions in OTel, so getting closer to dig into this :) |
@joaopgrassi thanks for the update! |
Hi all, I'm back at this after a long time. We in the OTel messaging working group are getting close to stabilizing the messaging conventions. Now that we have a clear idea of how the messaging conventions will be, we believe that applications producing CloudEvents can simply benefit and follow the OTel conventions for either HTTP or Messaging - depending on the protocol used to send the events. For per-to-per "simple" communications, the existing HTTP semantic conventions should be enough and the existing trace context propagation via HTTP headers should also be enough. For more complex, asynchronous cases when the events are send via message brokers for example (where also batching can happen), then the messaging conventions are a better fit. Such conventions also define and use the messaging system appropriate way of context propagation. All is to say, we don't really see the reason for the OTel conventions to rely on the distributed tracing extension from CloudEvents. But of course, if the messaging system for example does not have a defined mechanism for context propagation, the CloudEvents instrumentations can still use the distributed tracing extension for it. This won't change the overall trace conventions for CloudEvents. Thus I have now a draft PR on the OTel side that I'd like folks here to take a look and leave their opinions/questions: open-telemetry/semantic-conventions#1182. If you prefer a live discussion, we also have a community meeting every Thursday at 8:00 PT. You can join via this zoom link and find the agenda docs here. |
I believe CloudEvents are used in many scenarios, not just messaging or peer-to-peer. For example, with Azure Event Grid, someone could send a batch of CloudEvents to the service and the service would dispatch them independently of each other. HTTP context propagation in not sufficient in this case - the could be multiple hops between the publisher and the consumer of cloud event. So I do see a huge value in having distributed context extension on CloudEvents regardless of OTel messaging conventions. Cloud Events have broader scope and allow to propagate context when the underlying protocol does not support it. |
Yes I also agree. My goal is to point out that when possible, instrumentations should use the already existing context propagation mechanism of the protocol used and when that doesn't exist, they MAY use the distributed context extension from CloudEvents. No matter though which way is chosen, the existing span conventions should be used in combination, be it HTTP or Messaging trace conventions/structure. I think we are aligned with that vision, just need to iron the wording out :) |
"For per-to-per "simple" communications, the existing HTTP semantic conventions should be enough and the existing trace context propagation via HTTP headers should also be enough." Well, seems like the context setting for messaging in the OTel messaging group got lost. We've had quite a few discussions around this as the messaging group in OTel got started. No, it's not enough. Messaging flows generally have at least two legs. A producer leg and a consumer leg. A direct webhook call from the producer straight to the consumer is an total outlier in messaging scenarios. A CloudEvent flows from a producer to zero or more consumers. There's an application-level context that flows from that producer to those consumers, which is "blind" to the infrastructure that deals with getting the message routed. The message may be routed over multiple hops, starting with a push-style delivery via HTTP to an event broker, then an AMQP replication from that message broker to another, and an MQTT pull-style delivery to a consumer. All those transfers are separate operations. They are sometimes initiated by a pulling party that does not know the event it will be getting it starts pulling and sometimes initiated by the party that has the event in hands. In all cases, I would expect these transfer hops of the message flow to be able to hook up to the context of the event. All those contexts are derivatives of the context set by the producer. So, no, the peer-to-peer contexts for the transport hop are not sufficient. |
Hi @clemensv sure that is all there, nothing got lost, I'm part of the group since the start :). Like I said, when CEs are used in the context of messaging, then of course context propagation via HTTP headers are not enough and that's what the PR is aiming to tell - then context is propagated via the message, just like the messaging conventions specify. It seems I'm having trouble passing my thoughts across. For producer/consumer/messaging scenarios we have it all covered to behave like you wrote above. My point is for those that are NOT using messaging systems and simply doing simple HTTP "peer-to-peer", client/server communication, for example talking to a REST API. I assume nothing prevents people to use CEs in those synchronous scenarios where I simple use CloudEvents to encode my data. That's what the point I'm trying to get across - you don't need any different context propagation, nor extension if there's no hops and no multiple consumers like you have in async scenarios. The existing That's the goal of the PR in the semconv for CloudEvents. But if you say we shouldn't care about those synchronous scenarios and we just talk about messaging/producer/consumers then we can also do that. I just feel we leave the rest of users without much clue on what should be done. |
Are you assuming that the receiver always knows that it's the one and only hop for the incoming data? I can't help but think that it would be nice if they didn't know, or care, and could get the info of interest the same way regardless of the number of hops. |
Hi all, We have now merged the semantic conventions PR for CloudEvents https://github.com/open-telemetry/semantic-conventions/blob/main/docs/cloudevents/cloudevents-spans.md. This addresses I believe all the concerns we discussed here in the last days. This PR is very old and I had a lot of conflicts when trying to rebase it with Please read the description with my rationale of the new changes. Thanks! |
Background
We have a working group in the OpenTelemetry project that is making progress towards defining conventions around tracing for messaging systems.
As part of this effort, the group is defining how to trace applications that use CloudEvents. There is a PR open in the OpenTelemetry specification (Introduce new semantic conventions for CloudEvents) where you can see what the proposal is.
The PR addresses when to create spans, how the context will "flow" between event "hops" and how such spans should look like (name, attributes, type, etc). The goal of this is also to define a standard for Span structure. In theory, people looking at a trace for a "CloudEvents based system" should have the same experience everywhere.
In a nutshell, OpenTelemetry is going to heavily rely on the
Distributed Tracing Extension
defined in this spec. Via this extension, CloudEvents libraries (SDKs, etc) that support OpenTelemetry, can inject the span context into the event, and later on extract this context to continue the trace.Proposed Changes
During one of the meetings, @clemensv suggested that it would make sense that the Distributed Tracing Extension specification "links" to the OpenTelemetry semantic conventions. The idea is that people, (SDK authors or users) coming to read this spec will see the link to OpenTelemetry and will be able to read there and get a full picture of why this is here and how it should be used.
As we all know, this was cause for confusion in the past (Remove distributed tracing extension), as its purpose was not clear and how it related to tracing frameworks such as OpenTelemetry. Now that OpenTelemetry is defining its usage, CloudEvents users will benefit from it and hopefully no more confusions will arise.
Summary of changes
Using the Distributed Tracing Extension
sectionQuestions for this group
OpenTelemetry Distributed Tracing Extension
? Would this be a breaking change?As an example, the Go SDK offers such capability: OpenTelemetry ObservabilityService. Users only have to opt-in to use this "service" and all their CloudEvents operations (Send, Receive) are traced automatically. It might not be possible to offer the same behavior in all SDKs, but at least we should aim to have something that makes users lives easier, using the language constructs available (middlewares, wrappers, handlers, etc).
Release Note
Note: Please take a moment to read through the PR open in the OpenTelemetry spec. If you have questions, concerns or don't agree with something there, leave a comment and consider joining the public meeting we have every Thursday (Instrumentation: Messaging)