Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify usage of Distributed Tracing Extension by OpenTelemetry #912

Closed

Conversation

joaopgrassi
Copy link
Contributor

Background

We have a working group in the OpenTelemetry project that is making progress towards defining conventions around tracing for messaging systems.

As part of this effort, the group is defining how to trace applications that use CloudEvents. There is a PR open in the OpenTelemetry specification (Introduce new semantic conventions for CloudEvents) where you can see what the proposal is.

The PR addresses when to create spans, how the context will "flow" between event "hops" and how such spans should look like (name, attributes, type, etc). The goal of this is also to define a standard for Span structure. In theory, people looking at a trace for a "CloudEvents based system" should have the same experience everywhere.

In a nutshell, OpenTelemetry is going to heavily rely on the Distributed Tracing Extension defined in this spec. Via this extension, CloudEvents libraries (SDKs, etc) that support OpenTelemetry, can inject the span context into the event, and later on extract this context to continue the trace.

Proposed Changes

During one of the meetings, @clemensv suggested that it would make sense that the Distributed Tracing Extension specification "links" to the OpenTelemetry semantic conventions. The idea is that people, (SDK authors or users) coming to read this spec will see the link to OpenTelemetry and will be able to read there and get a full picture of why this is here and how it should be used.

As we all know, this was cause for confusion in the past (Remove distributed tracing extension), as its purpose was not clear and how it related to tracing frameworks such as OpenTelemetry. Now that OpenTelemetry is defining its usage, CloudEvents users will benefit from it and hopefully no more confusions will arise.

Summary of changes

  • Make it clear that this extension is what OpenTelemetry will use to trace CloudEvents applications
  • Removed instructions on how/why this is used, as it's already defined by the OpenTelemetry specification
  • Added a direct link to the relevant OpenTelemetry specification on the Using the Distributed Tracing Extension section

Questions for this group

  • Should we rename this to OpenTelemetry Distributed Tracing Extension? Would this be a breaking change?
  • The OpenTelemetry semantic conventions for CloudEvents mentions that SDKs/Libraries should offer ways to automatically "manage" the Span creation for users (auto-instrumentation). But, since it makes not much sense to define behavior of what CloudEvents SDKs should have in the OTel spec, I think we should consider adding this in the CloudEvents spec somehow.

As an example, the Go SDK offers such capability: OpenTelemetry ObservabilityService. Users only have to opt-in to use this "service" and all their CloudEvents operations (Send, Receive) are traced automatically. It might not be possible to offer the same behavior in all SDKs, but at least we should aim to have something that makes users lives easier, using the language constructs available (middlewares, wrappers, handlers, etc).

Release Note

Clarify usage of Distributed Tracing Extension by OpenTelemetry

Note: Please take a moment to read through the PR open in the OpenTelemetry spec. If you have questions, concerns or don't agree with something there, leave a comment and consider joining the public meeting we have every Thursday (Instrumentation: Messaging)

@duglin
Copy link
Collaborator

duglin commented Nov 19, 2021

@joaopgrassi thanks a ton for doing this! Hopefully this will address some of the concerns that have come up in the past.

Should we rename this to OpenTelemetry Distributed Tracing Extension? Would this be a breaking change?

IMO, I think it would be fine to do this rename if that's what the group wants.

@lmolkova
Copy link

lmolkova commented Nov 22, 2021

We also use Distributed Tracing Extension in Azure SDKs and it would be great to see clarification on how it should be used in the CloudEvents spec.

I have a tiny concern about renaming it to OpenTelemetry Distributed Tracing Extension. It currently relies on the W3C Trace-Context standard only and there is nothing OpenTelemetry-specific about it.

E.g. in Azure SDKs, we support legacy tracing libraries along with OpenTelemetry, and if someone would ever want to plug another tracing solution, they would still be able to do so and leverage Distributed Tracing extension on CloudEvents.

@joaopgrassi
Copy link
Contributor Author

I have a tiny concern about renaming it to OpenTelemetry Distributed Tracing Extension. It currently relies on the W3C Trace-Context standard only and there is nothing OpenTelemetry-specific about it.

Yes, I agree. I also have mixed feelings about renaming it. I think keeping as-is is fine, as long as this document is clear on how it is used and people coming into this can find clear guidance. It can still be used by other means, but that's the user choice, so not much we can do at that point I'm afraid.

Comment on lines -33 to -39
Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension,
if used, MUST carry the same trace information contained in protocol specific tracing headers.

Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.
In other words, it MUST NOT carry trace information of each individual hop, since this information is usually
carried using protocol specific headers, understood by tools like [OpenTelemetry](https://opentelemetry.io/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this part and I don't think the goal of this extension and OTEL are the same.

I think the goal of this extension is to do Event Lineage.
In lineage, we care only about event processing hopes, while OTEL traces every remote call, even middlewares that don't impact the original event.

I think we are changing the meaning of the extension more than clarifying it.

Copy link
Contributor Author

@joaopgrassi joaopgrassi Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pierDipi thanks for the input.

The semantic conventions we are trying to define in OTel open-telemetry/opentelemetry-specification#1978 are not defining tracing of the transport layer (middlewares, http transport and etc). We are rather focused on tracing the "application-layer", meaning what happened after I sent this event. This seems to align well with the link you mentioned:

An Event lineage is a path in the event processing network that enables the tracing of the consequences of an event, or finding the reasons for a certain action.

The way we are planning to achieve this is by: Every time an event is created, a Span is created. Then, we get this Span and add it to the event's Distributed Tracing Extension.

Then, on every place this event arrives, a Processing span is created. This processing span is then connected to the create span by adding the context from the Distributed Tracing Extension to it as a Link

With this, you would be able to see all operations that happened because of the initial create operation.

Regarding the old text on this spec :

Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension,
if used, MUST carry the same trace information contained in protocol specific tracing headers.

This can't be true, because if we create a span to track the "event creation" we'll have context A. Then when sending the event, say via HTTP, if the app is auto-instrumented, then there will be another span created for the physical send operation, so context B. The physical send operation is "outside" of the apps domain and we can't really modify the event at that point, to inject the same context (unless the auto-instrumentation understands CloudEvents and does this).

Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.

"starting trace of the transmission" is what the OpenTelemetry conventions is modeling with the "Create Span"

Once the trace context is set on the event, it MUST not be modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a use-case with an app:
image

And this is what a trace will look like in Jaeger

Producer
image

Consumer
image

Note the consumer link. If you click that you go to the Create span. Links are not yet well supported by many back-ends, but we hope they will catch up and present this in a more linear way. @lmolkova had a similar example using Azure monitor where links are showed below the create, as you would expect.

## Using the Distributed Tracing Extension

The
[OpenTelemetry Semantic Conventions for CloudEvents](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/trace/semantic_conventions/cloudevents.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a 404 from this link - is this is placeholder for a new doc that's in-the-works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a placeholder for when it's merged open-telemetry/opentelemetry-specification#1978

@duglin
Copy link
Collaborator

duglin commented Dec 6, 2021

On the 12/02 call we approved the overall direction of this PR but we'll hold off merging it until the referenced doc is ready. Then we'll revisit it to make sure that between this PR and that doc everything looks good from an end-user's perspective.

@joaopgrassi
Copy link
Contributor Author

joaopgrassi commented Dec 7, 2021

@duglin thanks for this update!

One thing that I should bring up is that if the PR in the OpenTelemetry gets merged, it will still be marked experimental. The reason is that we are also working in the more general messaging conventions and finding out things "as we go". Since CloudEvents and Messaging overlap in some cases, it's likely that we'll need to make changes in the CloudEvents semantic conventions again as well. That then could also affect this document in the CloudEvent spec, but it's hard to say for now.

So we might consider only merging this once the CloudEvents semantic conventions in OpenTelemetry is marked as stable. What do you think?

@duglin
Copy link
Collaborator

duglin commented Dec 8, 2021

@joaopgrassi it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you...

@joaopgrassi joaopgrassi changed the title Clarify usage of Distributed Tracing Extension by OpenTelemetry (#1) Clarify usage of Distributed Tracing Extension by OpenTelemetry Dec 9, 2021
@grant
Copy link
Member

grant commented Dec 9, 2021

Great, will wait for the OT PR:
open-telemetry/opentelemetry-specification#1978
Then look at this PR.

I hope we include the HTTP header names with the CloudEvents spec here for clarity.

@joaopgrassi
Copy link
Contributor Author

@duglin

it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you...

If it adds improvement to what we have today it's hard to say 😅 but it will def have to be adapted as we work on the OTel side. So let's leave it open for now, and I'll keep working on it.

I just found out that we actually have code that produces this ce-traceparent.. so I also have to look for that. It is actually interesting because it just came up today during our OTel messaging meeting that we'll need some sort of header to propagate the initial creation context for messaging in general, and this seems like a good candidate for CloudEvents. This header helps avoiding having to unpack it to read the value of the extension, which could be costly.

@duglin
Copy link
Collaborator

duglin commented Jan 13, 2022

Just to make it official... we're putting this PR "on hold" until the OT side of things are ready.
People should feel free to still comment on it though.

@LeDominik
Copy link

LeDominik commented Jun 26, 2022

Wanted to share my view: In the last ~6-7 years I've seen a fair share of clients and projects going for stronger choreography often mis-using especially Kafka as a message (queue) expecting "reliable processing" (whatever that means) and experiencing incredible difficulty tracking down anything leaving the happy path. Band-aids are usually hacked together and stuffed on it to somehow gain visibility & "control". Often falling back to using classic message queue metaphors (self-implemented dead-letter-queues with manual copy-back jobs because processing was not implemented in a proper idempotent way etc...). An enormous amount of effort & time goes down the drain for this; and it all starts with visibility.

Out of that I want to say: this is incredibly useful and please embrace it / improve on it. Writing down a clear convention how to transport (CloudEvents Extension) and how to link (OpenTelemetry) things to the original event creating span is excellent and the right way to do it.

There's a range of use-cases that are just not properly solved otherwise:

  • (most common) suboptimal auto-instrumentation - and this can happen easily. In many situations it can be a good idea to have an outbox-pattern; you don't want to send out an event proclaiming a new state of the world and then fail internally. Here the event is logically created in the flow and within an RDBMS transaction and then sent out by some work-queue / "smart" in-house "best-practice" framework... so the technical tracing context, threads, etc... are suddenly different and in most cases context is not transported correctly
  • Intermediaries / Proxies / Converters -> this is used as headline example by @joaopgrassi, just check out kNative Eventing & Dapr. I think it's safe to assume that this will grow to funnel data between clouds / products, etc...
  • Batch send which is even supported by CloudEvents HTTP Binding and employed by many of those bespoke in-house "best-practice" frameworks trying to be "smart"
  • Batch consume (obvious) - so you cannot correlate back in at this low-level

The only solution (and this here is similar with links but of course tackling it at the right level) I've seen doing it kind-of right has been Azure EventHub with the corresponding libraries. They also link to the original span showing it then in Azure Monitor. However it's incredibly badly documented, the only good explanation I found back then is on this medium article and inconsistent between e.g. C#, Java and node.js libraries... (or at least it was)

So I'd actually (once all is finalised and clarified, and judging from the last fine-adjustments I saw from @lmolkova I have a positive feeling) love to (optionally) see this support in the CloudEvents libraries / or a good hook that the OpenTelemetry libs hook into, because:

  • encoding an object into CloudEvent standard is actually the moment to create the CloudEvents Create <event_type> SPAN and get that trace context into the CloudEvent
  • decoding a CloudEvent is actually the moment in which well processing starts so CloudEvents Process <event_type> and link your processing to the original CloudEvent Create Span

And yes, in the most simple scenario (HTTP direct call, no batching, no magic) you'd also see the chain in the normal HTTP W3C trace-headers and auto-instrumentation. But I think also in this scenario the additional link wouldn't hurt...

just my (very long) 5cents...

@duglin
Copy link
Collaborator

duglin commented Oct 7, 2022

@joaopgrassi whats the status of things? It appears the OT PR might be merged

@joaopgrassi
Copy link
Contributor Author

Hi @duglin
Yes, the initial PR for adding CloudEvents semantic conventions in OTel was merged. The current status is: There is a group that is working on messaging in general, and trying to bring the messaging conventions to a stable 1.0 release. Because a lot of the things related to this overlaps with CloudEvents, I expect some things might change, specially around the trace structure. I'm part of the group, so I'm keeping tabs on it.

@duglin
Copy link
Collaborator

duglin commented Oct 13, 2022

@joaopgrassi cool - thanks for the update!

@github-actions
Copy link

github-actions bot commented Mar 2, 2023

This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

@joaopgrassi
Copy link
Contributor Author

Not stale, we are still working on bringing messaging conventions in OTel to stable.

@duglin
Copy link
Collaborator

duglin commented Mar 10, 2023

/remove-lifecycle stale

@github-actions
Copy link

github-actions bot commented May 7, 2023

This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

@joaopgrassi
Copy link
Contributor Author

/remove-lifecycle stale

@duglin
Copy link
Collaborator

duglin commented Jul 5, 2023

@joaopgrassi just a periodic ping to see if anything has changed

@joaopgrassi
Copy link
Contributor Author

joaopgrassi commented Jul 5, 2023

Hi @duglin ! So something did happen actually! We have been working on a OTEP that re-defines the general messaging systems semantic convention and that was merged last week. Now the work is to start porting the text from the OTEP proposal to the OTel specification and hopefully mark it stable.

Since CloudEvents have a lot of overlap with it, we have a tracking issue open-telemetry/semantic-conventions#654 to investigate how the new conventions overlap/conflict and how we should proceed.

So, some movement, but nothing concrete for this yet. I will keep you/all posted on the progress.

@duglin
Copy link
Collaborator

duglin commented Jul 6, 2023

@joaopgrassi great - thanks for the update

@github-actions
Copy link

This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

@joaopgrassi
Copy link
Contributor Author

/remove-lifecycle stale

I didn't forget about this. We are moving forward with the stabilization of messaging conventions in OTel, so getting closer to dig into this :)

@duglin
Copy link
Collaborator

duglin commented Sep 25, 2023

@joaopgrassi thanks for the update!

@joaopgrassi
Copy link
Contributor Author

joaopgrassi commented Jun 27, 2024

Hi all,

I'm back at this after a long time. We in the OTel messaging working group are getting close to stabilizing the messaging conventions.

Now that we have a clear idea of how the messaging conventions will be, we believe that applications producing CloudEvents can simply benefit and follow the OTel conventions for either HTTP or Messaging - depending on the protocol used to send the events.

For per-to-per "simple" communications, the existing HTTP semantic conventions should be enough and the existing trace context propagation via HTTP headers should also be enough. For more complex, asynchronous cases when the events are send via message brokers for example (where also batching can happen), then the messaging conventions are a better fit. Such conventions also define and use the messaging system appropriate way of context propagation.

All is to say, we don't really see the reason for the OTel conventions to rely on the distributed tracing extension from CloudEvents. But of course, if the messaging system for example does not have a defined mechanism for context propagation, the CloudEvents instrumentations can still use the distributed tracing extension for it. This won't change the overall trace conventions for CloudEvents.

Thus I have now a draft PR on the OTel side that I'd like folks here to take a look and leave their opinions/questions: open-telemetry/semantic-conventions#1182.

If you prefer a live discussion, we also have a community meeting every Thursday at 8:00 PT. You can join via this zoom link and find the agenda docs here.

@lmolkova
Copy link

I believe CloudEvents are used in many scenarios, not just messaging or peer-to-peer.
Even with messaging scenarios, CloudEvents could be used as a message payload.

For example, with Azure Event Grid, someone could send a batch of CloudEvents to the service and the service would dispatch them independently of each other.
Some protocols don't allow to pass metadata (at least some of MQTT versions), there are services like Azure Storage Queues that don't support propagating metadata along with the payload.

HTTP context propagation in not sufficient in this case - the could be multiple hops between the publisher and the consumer of cloud event.

So I do see a huge value in having distributed context extension on CloudEvents regardless of OTel messaging conventions. Cloud Events have broader scope and allow to propagate context when the underlying protocol does not support it.

@joaopgrassi
Copy link
Contributor Author

joaopgrassi commented Jun 28, 2024

So I do see a huge value in having distributed context extension on CloudEvents regardless of OTel messaging conventions. Cloud Events have broader scope and allow to propagate context when the underlying protocol does not support it.

Yes I also agree. My goal is to point out that when possible, instrumentations should use the already existing context propagation mechanism of the protocol used and when that doesn't exist, they MAY use the distributed context extension from CloudEvents. No matter though which way is chosen, the existing span conventions should be used in combination, be it HTTP or Messaging trace conventions/structure.

I think we are aligned with that vision, just need to iron the wording out :)

@clemensv
Copy link
Contributor

clemensv commented Jul 1, 2024

"For per-to-per "simple" communications, the existing HTTP semantic conventions should be enough and the existing trace context propagation via HTTP headers should also be enough."

Well, seems like the context setting for messaging in the OTel messaging group got lost. We've had quite a few discussions around this as the messaging group in OTel got started.

No, it's not enough.

Messaging flows generally have at least two legs. A producer leg and a consumer leg. A direct webhook call from the producer straight to the consumer is an total outlier in messaging scenarios.

A CloudEvent flows from a producer to zero or more consumers. There's an application-level context that flows from that producer to those consumers, which is "blind" to the infrastructure that deals with getting the message routed.

The message may be routed over multiple hops, starting with a push-style delivery via HTTP to an event broker, then an AMQP replication from that message broker to another, and an MQTT pull-style delivery to a consumer. All those transfers are separate operations. They are sometimes initiated by a pulling party that does not know the event it will be getting it starts pulling and sometimes initiated by the party that has the event in hands. In all cases, I would expect these transfer hops of the message flow to be able to hook up to the context of the event. All those contexts are derivatives of the context set by the producer.

So, no, the peer-to-peer contexts for the transport hop are not sufficient.

@joaopgrassi
Copy link
Contributor Author

joaopgrassi commented Jul 1, 2024

Hi @clemensv sure that is all there, nothing got lost, I'm part of the group since the start :).

Like I said, when CEs are used in the context of messaging, then of course context propagation via HTTP headers are not enough and that's what the PR is aiming to tell - then context is propagated via the message, just like the messaging conventions specify.

It seems I'm having trouble passing my thoughts across. For producer/consumer/messaging scenarios we have it all covered to behave like you wrote above. My point is for those that are NOT using messaging systems and simply doing simple HTTP "peer-to-peer", client/server communication, for example talking to a REST API. I assume nothing prevents people to use CEs in those synchronous scenarios where I simple use CloudEvents to encode my data. That's what the point I'm trying to get across - you don't need any different context propagation, nor extension if there's no hops and no multiple consumers like you have in async scenarios. The existing traceparent header is enough.

That's the goal of the PR in the semconv for CloudEvents. But if you say we shouldn't care about those synchronous scenarios and we just talk about messaging/producer/consumers then we can also do that. I just feel we leave the rest of users without much clue on what should be done.

@duglin
Copy link
Collaborator

duglin commented Jul 1, 2024

Are you assuming that the receiver always knows that it's the one and only hop for the incoming data? I can't help but think that it would be nice if they didn't know, or care, and could get the info of interest the same way regardless of the number of hops.

@joaopgrassi
Copy link
Contributor Author

Hi all,

We have now merged the semantic conventions PR for CloudEvents https://github.com/open-telemetry/semantic-conventions/blob/main/docs/cloudevents/cloudevents-spans.md. This addresses I believe all the concerns we discussed here in the last days.

This PR is very old and I had a lot of conflicts when trying to rebase it with main. So I opened a new one. (sorry I know it sucks).
#1304

Please read the description with my rationale of the new changes. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants