Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All shutdown events are reported with a failure outcome #411

Closed
julianolf opened this issue Sep 27, 2023 · 8 comments · Fixed by #412
Closed

All shutdown events are reported with a failure outcome #411

julianolf opened this issue Sep 27, 2023 · 8 comments · Fixed by #412
Labels
aws-λ-extension AWS Lambda Extension bug Something isn't working community Issues and PRs created by the community triage Issues and PRs that need to be triaged

Comments

@julianolf
Copy link

julianolf commented Sep 27, 2023

APM AWS Lambda extensiom version (elastic-apm-extension-ver-1-5-0-x86_64):

Describe the bug
When AWS Lambda service sends a shutdown event to a function the extension captures it and no matter the reason for the shutdown the event outcome is set as failure.

To Reproduce
Steps to reproduce the behavior:

  1. Spin up multiple instances of a Lambda function (two or more).
  2. Wait a few minutes for the Lambda service start to scale down (shutting down due spin down).

Expected behavior
Elastic receives transactions where the event.outcome is failure and the transaction.result is spindown.

Provide logs (if relevant): https://gist.github.com/julianolf/33dcd000be2363300879e434080b4c7f


Assumption
I believe the previous versions of the extension wasn't actually handling the shutdown event, this started to happen after upgrading to version 1.5 and it seems to be related to some working in progress related to this issue #118.

According to AWS documentation there are three possibilities for a shutdown, spindown is just a normal shutdown.

As a temporary fix I've patched this single line.

diff --git a/accumulator/invocation.go b/accumulator/invocation.go
index 2895046..40e2419 100644
--- a/accumulator/invocation.go
+++ b/accumulator/invocation.go
@@ -79,7 +79,7 @@ func (inc *Invocation) MaybeCreateProxyTxn(status string, time time.Time) ([]byt
 	if err != nil {
 		return nil, err
 	}
-	if status != "success" {
+	if status != "success" && status != "spindown" {
 		txn, err = sjson.SetBytes(txn, "transaction.outcome", "failure")
 		if err != nil {
 			return nil, err

PS.

If my steps to reproduce are too vague I can provide a small example projecto using AWS SAM to help easily reproduce what I've described.

Also, I could help on fixing it if it makes sense and you guys are willing to give some orientation on how to proceed, just let me know.

@julianolf julianolf added the bug Something isn't working label Sep 27, 2023
@github-actions github-actions bot added aws-λ-extension AWS Lambda Extension community Issues and PRs created by the community triage Issues and PRs that need to be triaged labels Sep 27, 2023
@lahsivjar
Copy link
Contributor

@julianolf Thanks for opening the issue and the investigation.

When AWS Lambda service sends a shutdown event to a function the extension captures it and no matter the reason for the shutdown the event outcome is set as failure.

This is unexpected however, I am not sure if marking all pending transactions as non-failure is optimal when we receive the spindown event. A pending transaction at shutdown would mean that the agent has failed to successfully flush a transaction. I think we would need some more investigation here before finalizing the correct solution. Can you share the below information:

  1. Which agent are you using? (language + version)
  2. Agent logs with debug mode enabled (example for python agent)

@lahsivjar
Copy link
Contributor

I found a probable bug that might explain this case. Currently, the extension processes the shutdown event without flushing available data (ref). Due to this, it is possible that the transactions in the buffer are marked as failed (as the flushed event from the agent would be waiting in the buffer before shutdown is processed). I am working on a reproduction and a possible fix.

@julianolf
Copy link
Author

This is unexpected however, I am not sure if marking all pending transactions as non-failure is optimal when we receive the spindown event. A pending transaction at shutdown would mean that the agent has failed to successfully flush a transaction. I think we would need some more investigation here before finalizing the correct solution.

I completely agree.

Can you share the below information:

  1. Which agent are you using? (language + version)

The Python version from the public Lambda Layer: arn:aws:lambda:us-east-1:267093732750:layer:elastic-apm-python-ver-6-18-0:1

  1. Agent logs with debug mode enabled (example for python agent)

I can see you are already working on a solution, but if it still helps I can link here for you, just let me know.

@julianolf
Copy link
Author

I found a probable bug that might explain this case. Currently, the extension processes the shutdown event without flushing available data (ref). Due to this, it is possible that the transactions in the buffer are marked as failed (as the flushed event from the agent would be waiting in the buffer before shutdown is processed). I am working on a reproduction and a possible fix.

Great! I will keep following along. Let me know if I can be of any help.

@lahsivjar
Copy link
Contributor

@julianolf I have a fix in progress: https://github.com/elastic/apm-aws-lambda/pull/412/files. This seems to fix the case I mentioned earlier. Would it be possible for you to test the branch to check if it fixes the issue you are facing too? You can follow the below steps:

  1. Clone the branch in https://github.com/elastic/apm-aws-lambda/pull/412/files
  2. Run make build && make zip, this will produce a zip file in ./dist/*-linux-amd64.zip.
  3. Upload the zip as a new layer in AWS console and update your function to use the layer.

@julianolf
Copy link
Author

@julianolf I have a fix in progress: https://github.com/elastic/apm-aws-lambda/pull/412/files. This seems to fix the case I mentioned earlier. Would it be possible for you to test the branch to check if it fixes the issue you are facing too? You can follow the below steps:

  1. Clone the branch in https://github.com/elastic/apm-aws-lambda/pull/412/files
  2. Run make build && make zip, this will produce a zip file in ./dist/*-linux-amd64.zip.
  3. Upload the zip as a new layer in AWS console and update your function to use the layer.

Of course, I'm doing it right away.

@julianolf
Copy link
Author

@lahsivjar it seems it's fixed!

It took me quite some time to make sure because the behavior now is a little different. The spindown event is not being sent to Elastic anymore, it appears on CloudWatch logs though. I wanted to make sure that timeout and failure events were still being tracked, and they are. So everything looks good.

I've published here on GitHub my test project in case you want to check it out by yourself.

Thanks for looking at this in such a short time!

@lahsivjar
Copy link
Contributor

Thanks for testing.

The spindown event is not being sent to Elastic anymore, it appears on CloudWatch logs though

Do you mean transactions with transaction.result as spindown are not sent to Elastic anymore? If this then it should be as expected as this would imply that the transactions were in fact correctly reported by the python-agent previously and it was due to the shutdown handling that they were processed as a proxy transaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-λ-extension AWS Lambda Extension bug Something isn't working community Issues and PRs created by the community triage Issues and PRs that need to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants