The guidance provides an example of how to use AWS Resource Groups Tagging API to retrieve a specific tag and then based on found resources pull additional information from respective service APIs to generate a configuration file (JSON) to build a CloudWatch Dashboard with _reasonable_ metrics and alarms. Optionally users can also deploy a central alarm dashboard to monitor alarms across their AWS Organization, AWS Organization OU or arbitrary number of AWS accounts.
Figure 1: High level Deployment automation process for the Guidance
- A group of AWS Cloud resources continuously store related metrics in the Amazon CloudWatch data store.
- The user initiates the
Guidance Resource Collector
script that uses the config file. - The Guidance Resource Collector fetches resources matching the config file from the AWS Resource Groups Tagging API Reference.
- The Guidance Resource Collector saves resource data in a JSON file.
- The user initiates the AWS Cloud Development Kit (AWS CDK) to synthesize an AWS CloudFormation template. The CloudFormation template is using AWS monitoring best practices.
Figure 2: Deployment automation to generate and deploy the "Event Forwarder Stack" required for configuring the AWS accounts where the resources being monitored reside
- The user runs the
cdk deploy
command to generate the CloudFormation template and deploy the infrastructure within the designated “monitoring” account. - The user records the output of the deployment, which contains the Amazon Resource Names (ARNs) of the central custom Amazon EventBridge event bus and the AWS Lambda function execution role.
- The user provides the ARNs obtained from the previous step to generate the CloudFormation template for the
Event Forwarder Stack
which is required for configuring the source accounts. - The user deploys the CloudFormation template for the
Event Forwarder Stack
to the intended source accounts, either individually or across multiple accounts and Regions, using CloudFormation StackSets.
Figure 3: Flow of events when a CloudWatch alert is triggered and processed by AWS Lambda functions
- An AWS Cloud resource sends a metric that breaches a threshold defined in a CloudWatch alarm.
- When the alarm is triggered, CloudWatch emits a “CloudWatch Alarm State Change” event on the EventBridge default bus within the respective account.
- An EventBridge Rule on the default bus forwards the event to the central custom EventBridge event bus.
- An EventBridge Rule defined within the central event bus dispatches the event to the ”Event Handler” Lambda function that analyzes the event
- The ”Event Handler” Lambda function assumes an AWS Identity and Access Management(IAM) role that has been deployed by the “Event Forwarder” CloudFormation stack set in the source account. It then queries the monitored resource and the CloudWatch alarm for additional details
- The “Event Handler” Lambda function consolidates the additional details with the event and stores the combined information in an Amazon DynamoDB alarms table.
- The CloudWatch dashboard, which includes custom CloudWatch widgets, triggers the execution of two Lambda functions—"View" and "List" — upon each dashboard would refresh.
- The “View” and “List” Lambda functions retrieve and filter the alarm data, then generate HTML code for rendering within the respective CloudWatch custom widgets.
- The “View” and “List” Lambda functions return the HTML code to the CloudWatch widgets, which then render it, including the relevant metrics, on the CloudWatch user interface.
This guidance focuses on automation of Network services monitoring even though other AWS services listed above are supported by code as well.
- Event-driven for scalability and speed
- Supports arbitrary source accounts within an AWS Organization (different teams can have own dashboards)
- Supports automatic source account configuration through stack-sets
- Supports visualization and sorting of alarm priority (CRITICAL, MEDIUM, LOW) through alarm tags in source accounts.
Simply add tag with key
priority
and values critical, medium or low. - Supports tag data for EC2 instances in source accounts
This contains of two distinct features. Metric Dashboards and Alarm Dashboard.
The cost is mostly generated by the number of CloudWatch Dashboards in the account, where first three Dashboards are free, and the number of Alarms.
The guidance code will try to respect the best design practices, convenience of use and hard limits of CloudWatch (no more than 500 widgets per Dashboard) and create
additional Dashboards to place the widgets on. Some configuration parameters may cause more Dahsboards to be created, like GroupingTagKey
or Compact
-mode.
You can learn the estimated cost of the metric Dashboard deployment by running cdk synth
. The code will construct the CloudFormation template and estimate the cost
based on number of Dashboards and Alarms generated without deploying it. Instead you will see the estimated cost on the screen.
These are the only cost drivers. Number of metrics or existing resources tagged do not affect the cost directly.
The Alarm Dashboard is deployed as serverless and event driven architecture with on-demand cost model. There are two main drivers of cost:
-
Alarms changing state - As an alarm changes state, an event is emitted and workflow is triggered. The workflow will forward the event to a central monitoring account and execute a Lambda function that will look up more information about the resource monitored for additional context and lookup more information about the Alarm itself. Such as Alarm tags. This to be able to visualize "priority" of an Alarm. Then it will store that object into DynamoDB table. Cost drivers here are number of existing alarms and frequency at which they change state.
-
Alarm Dashboard refreshing - Depending on "refresh" setting of the Alarm Dashboard, the Dashboard will invoke the two Lambda functions that are part of the two CloudWatch custom widgets. The Lambda functions will fetch objects from the DynamoDB table and render HTML to display Alarms in alarm state and a list of Alarms. Cost drivers here are the refresh setting which drives the number of Lambda function invocations, object size and which drives the amount of DynamoDB Read Request Units. For the calculation below, the most pessimistic (expensive) settings were used for refresh (10s).
You are responsible for the cost of the AWS services used while running this guidance. As of April 2024, the cost for running this guidance with the default settings in the US East (N. Virginia) Region is approximately $1 per month, assuming 3,000 transactions.
This guidance uses Serverless services, which use a pay-for-value billing model. Costs are incurred with usage of the deployed resources. Refer to the Sample cost table for a service-by-service cost breakdown.
We recommend creating a budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.
The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the us-east-1
(US East - N. Virginia) Region for one month assuming "non-production" level metrics volume.
AWS service | Dimensions | Cost [USD] |
---|---|---|
Amazon CloudWatch | 1 Charged dashboard | $ 3.00 |
Amazon DynamoDB | 1Gb data storage, Standard table class on-demand capacity, 1 million writes/month, 2 million reads/month | $ 3.00 |
AWS Lambda | 618 400 requests per month with 3000 ms avg duration, 256 MB memory, 512 MB ephemeral storage | $ 7.85 |
Amazon EventBridge | 1 million custom events per month and 1 million cross region events | $ 2.00 |
Total estimated cost per month: | $15.85 |
A sample cost breakdown for production scale load (10 000 Alarms, each triggering 10 times a month) can be found in this AWS Pricing Calculator estimate and is estimated around $15.85 USD/month
Any operating system (Windows, Linux, MacOS X) that supports running Python 3 and NodeJS 18+.
These deployment instructions are optimized to best work on Amazon Linux 2023. Deployment in another OS may require additional steps.
For easiest and fastest deployment experience, we recommend running the deployment from AWS CloudShell
- Python 3 with boto3
- NodeJS 18+
- CDK V2 (
npm -g install aws-cdk@latest
)
This Guidance uses aws-cdk. If you are using aws-cdk for first time, please perform the below bootstrapping.
cdk bootstrap
See more here: https://docs.aws.amazon.com/cdk/v2/guide/bootstrapping.html.
In case you don't want to bootstrap read Deploying without boostraping CDK.
Please refer to FULL IMPLEMENTATION GUIDE for detailed deployment instructions and options.
- Alarm Dashboard is event driven. This means, that cross account/region alarms won't be visible unless they change state and thus "auto-register". This is convenience and scalability feature. Side effect of that is that no alarms are visible initially but appear as they are created or triggered.
- Alarm Dashboard is serverless cross account feature that uses IAM Roles to allow least privilege access. During the operation, a component of Alarm Dashboard will collect additional information, (describe-, list-tags-),about the affected resource to provide additional context. Use IAM Access Analyzer to refine permissions if higher confidentiality is required.
- Use tags "priority" and values "critical", "medium" and "low", to set prioritization for Alarms when using Alarm Dashboard.
BaseName
(String:required) - Base-name of your dashboards. This will be the prefix of the dashboard names.
ResourceFile
(String:required) - The path for the file where resources are stored. Used by the resource_collector.py
when generating resource config and by the CDK when generating the CF template.
TagKey
(String:required) - Configuration of the tag key that will select resources to be included.
TagValues
(Array:required) - List of values of TagKey
to include.
Regions
(Array:required) - List of regions from which resources are displayed.
GroupingTagKey
(String:optional) - If set, separate Lambda and EC2 dashboards will be created for every value of that
tag. Every value groups resources by that value.
CustomEC2TagKeys
(Array:optional) - If set, the tag info will show in the EC2 header widget in format
Key:Value. Useful to add auxilary information to the header.
CustomNamepsaceFile
(String:required) - Detected custom namespaces. Not yet used.
Compact
(boolean (true/false):required) - When set to true, multiple Lambda functions will be put in a single widget
set. Useful when there are many Lambda functions.
CompactMaxResourcesPerWidget
(Integer:required) - When Compact
is set to true, determines how many Lambda functions
will be in each widget set.
AlarmTopic
(String:optional) - When AlarmTopic
contains a string with an ARN to a SNS topic, all alarms will be
created with an action to send notification to that SNS topic.
MaxWidgetsPerDashboard
(Integer:optional) - When MaxWidgetsPerDashboard
is defined, it will limit the number of
widgets that are placed on the dashboard. Multiple dashboards may be created depending on the setting. This can be
used to improve performance and usability of the dashboards. (Currently only supports Network dashboards)
AlarmDashboard.enabled
(boolean (true/false):optional) - When set to true deploys the alarm dashboard in the account.
AlarmDashboard.organizationId
(String: required when AlarmDashboard.enabled
is true) - Required in order to set
resource policy on the custom event bus to allow PutEvents from the AWS Organization.
MetricDashboards.enabled
(boolean (true/false):optional) - If not defined or set to true, deploy metric dashboards.
Recommended if only alarm dashboard is being deployed.