Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter][batcher] MergedContext implemented with SpanLink #12318

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-sili
Copy link
Contributor

@sfc-gh-sili sfc-gh-sili commented Feb 7, 2025

Description

This PR implements a component called mergedContext that is linked to all incoming context via a spanLink.

For example: let's say req1, req2, req3 are batched in the same request, then

  • batch = { requests: [req1, req2, req3], mergedContext: ctx1}
  • ctx1.span is linked to ctx2.span and ctx3.span via a SpanLink

Link to tracking issue

#12212
#8122

return mc
}

func (mc mergedContext) Deadline() (time.Time, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be called multiple times, so I think we should pre-calculate this.

Comment on lines 52 to 53
func (mc mergedContext) Done() <-chan struct{} {
return mc.ctx.Done()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to propagate this.

Comment on lines 56 to 63
func (mc mergedContext) Err() error {
var mergedErr error
for _, c := range mc.ctxArr {
if err := c.Err(); err != nil {
mergedErr = multierr.Append(mergedErr, c.Err())
}
}
return mergedErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

"go.uber.org/multierr"
)

type mergedContext struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope is:

  1. Deadline is max of all.
  2. Keep a list of all SpanContext.

We just need these 2 things, not the whole contexts.

@sfc-gh-sili sfc-gh-sili force-pushed the sili-fix-new-batching-context branch 4 times, most recently from b7080a2 to 3f6082e Compare February 12, 2025 00:10
@sfc-gh-sili sfc-gh-sili changed the title An attempt to implement merged context [exporter][batcher] MergedContext implemented with SpanLink Feb 12, 2025
@sfc-gh-sili sfc-gh-sili marked this pull request as ready for review February 12, 2025 00:57
@sfc-gh-sili sfc-gh-sili requested review from dmitryax and a team as code owners February 12, 2025 00:57
@sfc-gh-sili sfc-gh-sili marked this pull request as draft February 12, 2025 00:58
@sfc-gh-sili sfc-gh-sili force-pushed the sili-fix-new-batching-context branch 2 times, most recently from c2082a8 to 6a425b5 Compare February 12, 2025 01:41
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 88.46154% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.54%. Comparing base (2dc95de) to head (1765a6d).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
...r/exporterhelper/internal/batcher/batch_context.go 76.92% 2 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (88.46%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #12318   +/-   ##
=======================================
  Coverage   91.54%   91.54%           
=======================================
  Files         467      468    +1     
  Lines       25623    25677   +54     
=======================================
+ Hits        23456    23507   +51     
- Misses       1768     1770    +2     
- Partials      399      400    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sfc-gh-sili sfc-gh-sili force-pushed the sili-fix-new-batching-context branch from 87f5b85 to ce96200 Compare February 12, 2025 04:41
@sfc-gh-sili sfc-gh-sili marked this pull request as ready for review February 12, 2025 04:41
Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the logic in this PR will work as intended.

  • The pipeline in the exporterhelper is queue → batcher → obsReportSender, so exporterhelper only creates its own span AFTER queuing and batching. This means the spans you are creating links between are the spans of the parent component (for example the span emitted by a batchprocessor), which will be hard to interpret for users.
  • Let's say that four dequeue operations D1, D2, D3, and D4 are performed, with associated parent spans S1, S2, S3, and S4. Let's say that the data from D1, D2, and part of D3 are put into a batch B1, and the rest of D3 + D4 are put into a second batch B2. Your code will attempt to add links from S1 to S2 and S3, then from S3 to S4. This will be also be difficult to interpret.
  • Most importantly, by the time S2 and S3 are added to B1, span S1 will likely already have ended, and will likely already have been exported, without any span links. I tried your code by enabling the batcher in the OTLP exporter, and I suspect this is the reason why I don't see any span links being exported in my tests, even though AddLink is called with appropriate parameters.

I think the proper solution to adding span links across the queue and batcher would involve:

  • Creating a span for the enqueue operation in the queueSender, to avoid relying on or adding links to spans from parent components
  • Adding our span links to the span created by the obsReportSender at span creation time, to ensure they are all exported. This would involve not adding the links in the batcher, but only recording them and passing them to the obsReportSender, presumably through a Context key.

@sfc-gh-sili
Copy link
Contributor Author

sfc-gh-sili commented Feb 12, 2025

@jade-guiton-dd Thanks for the feedback! I updated the implementation to link to batch spans from obsReportSender. I agree it's cleaner that way.

Regarding spans created from the queue, you proposed to create a span for the enqueue operation, but the enqueue and dequeue uses different contexts if the queue is persistent storage queue. Did you mean we want to create a new span for every dequeue operation?

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Feb 13, 2025

No, I did mean "enqueue". The ultimate goal of the span links is to link the "export" trace with the "input" trace that pushed the data into the exporter. Since only the enqueue operation is a synchronous part of the input trace, I think we will need to create a span for it. To make the link work across the persistent queue, we will need to persist the SpanContext across it, but that should hopefully be doable.

However for now, it's probably simpler to add a span for the dequeue operation and link to that across the batching operations. We can leave linking the enqueue and dequeue operations across the queue as future work.

@sfc-gh-sili sfc-gh-sili force-pushed the sili-fix-new-batching-context branch from a17c1eb to 7f04365 Compare February 14, 2025 06:58
@sfc-gh-sili sfc-gh-sili force-pushed the sili-fix-new-batching-context branch from 7f04365 to 5998f4e Compare February 14, 2025 06:59
@sfc-gh-sili
Copy link
Contributor Author

@jade-guiton-dd Thanks! Just updated this PR with a new iteration that propagates span context with a thinner layer. Now trace links are stored in a dummy background context for obs_report_sender to read from.

Regarding timeout: it's not hard to implement. One way is to keep track of deadline time through the batching process and create context with deadline when the batch flushes. I removed it from this PR because I am not sure how useful the timeout would be. It feels awkward to have an additional timeout when we have timeout in batching and timeout_sender already.

Regarding start of span: This PR now starts a new span whenever a item is read from the queue, but I think the proper way is to

  1. Like you mentioned, move it to enqueue
  2. and delegate the operation to obs_report_sender (or something similar)
obs_report_sender -> queue_sender -> [ queue ] -> batch_sender -> obs_report_sender -> retry_sender -> timeout_sender

Hopefully this PR is enough for the purpose of linking span context in the batcher. Let me know what you think.

Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your work! Just one more fix needed, and a nitpick. (Although I'm not an approver, so a second review will be necessary anyway)

exporter/exporterhelper/internal/queue_sender.go Outdated Show resolved Hide resolved
if links, ok := ctx.Value(batchSpanLinksKey).([]trace.Link); ok {
return links
}
return []trace.Link{trace.LinkFromContext(ctx)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the batcher is disabled (or when it is enabled but only one request ends up in the batch), contextWithMergedLinks is never called, and we pass the parent context through directly to the obsreportsender. I think this is fine, but because the latter calls LinksFromContext, the parent span ends up as both the parent AND as a link.

It's not a big deal, but I think it would be better to create links only when we cut the trace, which is to say, when calling contextWithMergedLinks.

@sfc-gh-sili sfc-gh-sili force-pushed the sili-fix-new-batching-context branch from df83318 to cfdfd23 Compare February 15, 2025 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants