Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment from Lambda triggered by SQS is never sampled #570

Open
DanielSidhion opened this issue Feb 10, 2023 · 3 comments
Open

Segment from Lambda triggered by SQS is never sampled #570

DanielSidhion opened this issue Feb 10, 2023 · 3 comments
Labels

Comments

@DanielSidhion
Copy link

DanielSidhion commented Feb 10, 2023

Hey folks, I've been scratching my head against this one for a while, I created this issue hoping that someone can help figure out what's going on.

I have a Lambda function with an alias, and the alias can be triggered by SQS. Events are coming in from EventBridge and routed to SQS. The function has Active tracing enabled.

When the function gets triggered by SQS, I see the following in my logs (values from one specific run are shown below, but this is consistent behaviour on every attempt):

  • The SQS message itself contains the following in its attributes:
    "AWSTraceHeader": "Root=1-63e54f60-54c0dec03aa822cf0eb80f6b;Parent=7026c82d12761ec6;Sampled=0"
    
  • The X-Ray SDK ends up apparently ignoring that though, and uses this facade segment:
    Lambda trace data found: Root=1-63e54f76-613d90840ef5611617d19f78;Parent=5b3f323a471e441f;Sampled=0
    Segment started: 
    {
        "root": "1-63e54f76-613d90840ef5611617d19f78",
        "parent": "5b3f323a471e441f",
        "sampled": "0",
        "data": {}
    }
    
  • At the end of the function, the SDK ends up deciding not to flush any subsegments, because (as I showed), the segment is not sampled:
    Ignoring flush on subsegment 3b13dabf96025041. Associated segment is marked as not sampled.
    

Note how both of those segments are not sampled (the one from SQS and the one Lambda creates). I've tried recreating the whole infrastructure (SQS queues, EventBridge triggers, Lambda function, IAM roles) from scratch, and I tried disabling Active tracing and re-enabling it, but this behaviour keeps happening. The sample policy I'm using is the default one, and the Lambda was invoked only to test this, which means it was definitely within the policy of 1 sample per second.

However, if I manually invoke the Lambda function (with the same message it gets from SQS, no changes at all, including the unsampled AWSTraceHeader), then it creates a sampled segment (as expected!). Note that this is a manual invocation of the Lambda, it isn't being triggered by SQS this time:

Lambda trace data found: Root=1-63e5ac97-07c68b9b2d943eb56a158f57;Parent=e42fbe5a5a09d152;Sampled=1
Segment started: 
{
    "root": "1-63e5ac97-07c68b9b2d943eb56a158f57",
    "parent": "e42fbe5a5a09d152",
    "sampled": "1",
    "data": {}
}

And it properly flushes the subsegments, and I can see them in the X-Ray console:

Subsegment sent: {"trace_id:"1-63e5ac97-07c68b9b2d943eb56a158f57","id":"b2f07bfe66ccf783"}
UDP message sent: 
{
    "id": "b2f07bfe66ccf783",
    "name": "<redacted>",
    "start_time": 1675996312.821,
    "namespace": "remote",
    "http": {
        <redacted>
    },
    "end_time": 1675996313.035,
    "type": "subsegment",
    "parent_id": "e42fbe5a5a09d152",
    "trace_id": "1-63e5ac97-07c68b9b2d943eb56a158f57"
}

Does anyone know why apparently the Lambda function is never sampled when invoked through the SQS trigger? I noticed you folks have recently worked on an SQS->Lambda and SNS->SQS->Lambda trace continuation features (congrats btw! I'm definitely going to make use of them), so I'm wondering if something along the way changed and broke this scenario I'm describing?

@willarmiros
Copy link
Contributor

Hi @DanielSidhion,

Events are coming in from EventBridge and routed to SQS. The function has Active tracing enabled.

Can you be more specific about how Events are generated in EB? The sampling decision is made at the root of your request, whatever that is, as long as that root has active tracing enabled. So if there is anything upstream of EB that may be injecting the unsampled decision. EB itself does not have active tracing, so it should not be making a sampling decision (neither should SQS).

@DanielSidhion
Copy link
Author

DanielSidhion commented Feb 10, 2023

Hi @willarmiros,

Thank you for the reply. In this particular case, I have the following request path:
<external service> -> API Gateway -> Lambda 1 -> EventBridge -> SQS -> Lambda 2 (the one I was investigating)

Both API Gateway and Lambda 1 don't have X-Ray/Active tracing enabled. I just experimented with turning on Active tracing on Lambda 1 and that didn't work either, but enabling X-Ray on API Gateway AND enabling Active tracing on Lambda 1 finally makes Lambda 2 get a sampled segment.

This was a bit unexpected, since it looks like I'll have to enable X-Ray along the whole chain just so I can get insights on Lambda 2 (which is the thing I'm most interested in). Is there no way to work around this? I can enable X-Ray on everything, but I'm curious whether this is forcing a specific workflow or if there's a way to work around it using the X-Ray SDK.

An interesting behaviour is that the ADOT layer seems to basically ignore the unsampled segment and just generates its own in that case. I haven't tested the ADOT layer with the sampled segment yet, but I'll experiment to see how it goes.

@willarmiros
Copy link
Contributor

willarmiros commented Feb 16, 2023

I just experimented with turning on Active tracing on Lambda 1 and that didn't work either, but enabling X-Ray on API Gateway AND enabling Active tracing on Lambda 1 finally makes Lambda 2 get a sampled segment.

That is surprising that enabling it on Lambda 1 did not start to introduce sampled segments. For context, it is a known issue that Lambda functions will generate Sampled=0 if they do not have active tracing enabled. This is a bug, as they should not be adding any sampling decision, and we are working w/ Lambda to get it fixed. So getting Sampled=0 with Lambda1 active tracing disabled is expected, but if you enable active tracing on it I would expect it to go away. The approach with minimal tracing is you could enable active tracing on only APIGW and Lambda2, then the other services should just pass the trace context (with Sampled=1) through. Can you share what region you're operating in?

An interesting behaviour is that the ADOT layer seems to basically ignore the unsampled segment and just generates its own in that case. I haven't tested the ADOT layer with the sampled segment yet, but I'll experiment to see how it goes.

Can you clarify what you mean here? Are you saying if there is Sampled=0 in the header, you still see a segment from the ADOT Lambda layer, but no segments from the Lambda function itself? Are you using Node.js Lambda layer?

To summarize, it seems like there are 2 possible issues:

  1. APIGW possibly injecting a Sampled=0 when tracing is disabled
  2. ADOT lambda layer (for JS?) is not respecting the sampling decision when generating segments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants