OTEL Python does not always flush metrics to awsemf #851

sarwaan001 · 2023-08-15T21:33:39Z

Describe the bug
OTEL Python Layer does not always flush metrics at the end of lambda invocation.

Steps to reproduce

Deploy a lambda with the following python code:
handler.py

"""Sample Lambda for testing"""
from opentelemetry.metrics import get_meter
from opentelemetry import trace

trace.get_tracer_provider()
tracer = trace.get_tracer(__name__)

meter = get_meter(__name__)

counter = meter.create_counter(name="invocation_counter", description="A counter metric", unit="invocations")


def lambda_handler(event, _):
    """Sample Lambda for testing"""
    counter.add(1)
    return {"status_code": 200}

config.yaml

#collector.yaml in the root directory
#Set an environemnt variable 'OPENTELEMETRY_COLLECTOR_CONFIG_FILE' to '/var/task/collector.yaml'

receivers:
  otlp:
    protocols:
      grpc:
      http:
exporters:
  logging:
    verbosity: detailed
  awsxray:
  awsemf:
    namespace: ${env:OTEL_NAMESPACE}
    dimension_rollup_option: 1
    resource_to_telemetry_conversion:
      enabled: false
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [awsxray]
    metrics:
      receivers: [otlp]
      exporters: [logging,awsemf]

Ensure that the following configuration for the lambda is set:

Environment
-- AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument
-- OPENTELEMETRY_COLLECTOR_CONFIG_FILE: /var/task/config.yaml
-- OTEL_INSTRUMENTATION_AWS_LAMBDA_FLUSH_TIMEOUT: 900
-- OTEL_NAMESPACE: SampleNamespace
-- OTEL_PROPAGATORS: xray
-- OTEL_PYTHON_ID_GENERATOR: xray
Runtime - 3.9
Architecture - x86_64
handler: handler.lambda_handler
layers: arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-18-0:1

Ensure the lamdba has the following permissions:

xray:PutTelemetryRecords
xray:PutTraceSegments
cloudwatch:GetMetricData
cloudwatch:GetMetricStatistics
cloudwatch:GetMetricStream
cloudwatch:PutMetricData
cloudwatch:PutMetricStream
cloudwatch:StartMetricStreams
logs:CreateLogGroup
logs:CreateLogStream
logs:PutLogEvents

Obtain the lambda arn
Ensure that you are logged in to aws cli
Create the following pytest and replace the lambda arn with the lamdba that was just created.
test.py

"""
    Tests the following Lambda by invoking the lambda 100 times and expecting the counter to return 100.
"""
import boto3
import json
from datetime import datetime
import time
def test_sample_lambda():
    lambda_arn = "<insert lambda arn>"

    lambda_client = boto3.client('lambda')
    event = json.dumps({})

    start_time = datetime.now()

    for i in range(100):
        response = lambda_client.invoke(
            FunctionName=lambda_arn,
            InvocationType='Event',
            LogType='None',
            Payload=event
        )
        assert response['StatusCode'] == 202
    
    # Wait 2 minutes for metrics to propagate + wait for last lambda
    time.sleep(2*60 + 2)

    cloudwatch_client = boto3.client('cloudwatch')

    metric_data = cloudwatch_client.get_metric_data(
        MetricDataQueries = [
            {
                'Id': 'integration_test',
                'MetricStat': {
                    'Metric': {
                        'Namespace': "SampleNamespace",
                        'MetricName': "invocation_counter",
                        'Dimensions': [{'Name': 'OTelLib', 'Value': 'handler'}]
                    },
                    'Period': 300,
                    'Stat': "Sum",
                }
            }
        ],
        StartTime=start_time,
        EndTime=datetime.now(),
    )

    otel_values = sum(metric_data['MetricDataResults'][0]['Values'])

    assert otel_values == 100

ensure you have boto3 installed

run pytest

What did you expect to see?
There should be 100 values in cloudwatch. pytest should pass

What did you see instead?
Less than 100 values sent to cloudwatch, sometimes 100 on warm lambdas and the test passes.

What version of collector/language SDK version did you use?
arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-18-0:1

What language layer did you use?
Python

Additional context
I believe that sometimes the lambda layer does not flush emf metrics before the lambda freezes.

stevemao · 2024-02-03T06:44:46Z

I do not see anything going to awsemf at all. I am able to see logs when using logging exporter with the same code.

serkan-ozal · 2024-09-02T17:12:00Z

Hi @sarwaan001, I see that you set the flush timeout to 900 ms and I think his might not be enough (for functions will small memory limit) on coldstart because total flush timeout is shared between traces first and then metrics.

Are you still seeing missing metrics with higher flush timeout configs?
And also I couldn't see your collector layer ARN. Are you exporting to the collector outside?
Additionally, you should use decouple processor (https://github.com/open-telemetry/opentelemetry-lambda/blob/main/collector/processor/decoupleprocessor/README.md) in the collector to be aligned with Lambda lifecycle. Otherwise, because of container freeze, some metrics might be missing or delayed.

sarwaan001 added the bug Something isn't working label Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTEL Python does not always flush metrics to awsemf #851

OTEL Python does not always flush metrics to awsemf #851

sarwaan001 commented Aug 15, 2023

stevemao commented Feb 3, 2024

serkan-ozal commented Sep 2, 2024 •

edited

Loading

OTEL Python does not always flush metrics to awsemf #851

OTEL Python does not always flush metrics to awsemf #851

Comments

sarwaan001 commented Aug 15, 2023

stevemao commented Feb 3, 2024

serkan-ozal commented Sep 2, 2024 • edited Loading

serkan-ozal commented Sep 2, 2024 •

edited

Loading