This is a sample Step Functions CDK Construct for ECS Run Task, which automatically retries Run Task step when ECS or your application returns a retryable error.
Practically, ECS RunTask sometimes fails due to temporary errors such as ResourceInitializationError
or CannotPullContainerError
(Stopped tasks error codes).
Becuase these errors are not returned synchronously when you call RunTask API, you must check the task result after starting a task. And if any retryable errors are found, you might want to retry the task again.
This sample shows how to easily automate the above process with Step Functions. (See How it works section)
We'll cover how you can deploy and use them in the following sections first.
NOTE: This sample's architecture is for demonstration purpose only. You might need further investigation for errors or stopped reasons and revision of error handler code.
Before deploying this sample, you must install AWS Cloud Development Kit prerequisites. Please refer to this document for the detailed instruction. Make sure you've successfully completed cdk bootstrap
step.
After that, clone this repository and cd
to its root directory.
You must first install Node.js dependencies for CDK code by the following commands:
npm ci
Now you can deploy this sample's stack by the following command:
npx cdk deploy --require-approval never
Initial deployment usually takes about 5 minutes.
After a successful deployment, you can check the ARN for Step Functions statemachine.
To see how this Step Functions state machine works, you can execute it with the following AWS CLI command:
aws stepfunctions start-execution --state-machine-arn STATE_MACHINE_ARN
Please replace STATE_MACHINE_ARN
with the actual ARN you can see when you deploy the CDK stack.
ECS errors don't happen very frequently, which makes it difficult to test the behavior.
For testing purpose, we also regard exit code 2
from application as a retryable error, so you can test error handling behavior by modifying application code. (See app/main.py
)
The below image is a diagram for our Step Functions state machine.
First, it executes ECS RunTask with Run a Job
integration pattern (see Service Integration Patterns). In this way, Step Functions automatically waits until a task finishes, and then returns the task result including stopped reasons or container exit code.
If there isn't any errors, it just notify success to Amazon SNS. If an error happens, we invoke a Lambda function to handle errors. It does the following tasks:
- Determine if the error is retryable or not. It checks stopped Reasons and application exit code.
- Calculate current retry count and how long does it wait before next retry using exponential backoff algorithm.
If the error is retryable and current retry count is below the limit, it executes ECS RunTask again. Otherwise it regards the error is fatal and notify it to Amazon SNS.
You can specify maximum retry count in RetryableRunTaskProps
.
To avoid incurring future charges, clean up the resources you created.
You can remove all the AWS resources deployed by this sample running the following command:
npx cdk destroy --force
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.