Skip to content

mit-dci/terraform-aws-opencbdc-tctl

Repository files navigation

Introduction

This repository provides IaC (Infrastructure as Code) to replicate the environment used to produce the results of the research paper and can serve as a starting point if you're looking to do so. All the necessary resources are created in Amazon Web Services (AWS) cloud infrastructure via Terraform. The Terraform configuration is wrapped into a single module that leverages a number of sub-modules. The root module primarily deploys the OpenCBDC test controller along with numerous supporting resources. You can follow along with the steps of this README in order to deploy the test controller. If you are new to Terraform, when you reach Provision, it is reccomended that you use the pre-created configuration linked there as the entrypoint for your deployment.

Architecture

This module will deploy the test controller via an AWS ECS task. The ECS service can be configured to use either EC2 instances or Fargate. The main function of the test controller is to schedule agent processes across one to three regions for testing Project Hamilton's architectures. Agents processes are scheduled on AWS EC2 instances and provisioned via EC2 launch templates. The test controller is configured to provision in the us-east-1 region. A subset of resources are replicated in the us-east-2 and us-west-2 regions in order to schedule multi regional test runs. A VPC is provisioned in each of these three regions along with VPC peering connections and VPC endpoints for internal communication between resources. A pipeline is setup via AWS Codepipeline which will clone the test controller's source code, then build/push several services. These services are a container image for the test controller, a container image used to seed the environment with data for test runs, and the binary used to schedule agents during test runs. Both of the container images are pushed to AWS ECR registries, and the agent binary is pushed to an S3 bucket. Seeding initial outputs is handled via an AWS Batch job that when necessary is scheduled by the test controller before a test run. An AWS Batch compute environment, job definition, and job queue are all provisioned by default to support this. Upon being schdeuled, agents instances pull the agent binary from S3, then execute it to communicate with the test controller and recieve instructions. This process for the agents is defined in thier EC2 launch template. Two AWS Network Load balancers are deployed by the module. One forwards traffic to the test controller's UI, the other supports communication between agents and the test controller. Upon completion of a test run, results are sent to AWS Opensearch via a Amazon Kinesis Firehose delivery stream. A bastion host is provided for troubleshooting the environment as well as pulling down raw test data if you wish to gather your own insights. To access to the bastion host you can either use ssh, which is configured by this module, or you can use AWS Session Manager.

Diagram

Required Software

The module requires that you have Terraform installed. Specifics about versioning are listed here. Also useful, but not completely necessary is the AWS CLI. If you have other Terraform projects with different version requirements, you can manage them with tfenv. This project is pre-configured to pull the proper terraform version via tfenv. Simply run tfenv install. Docker must be installed and running on your local machine. You won't need to run any Docker commands, just be sure that it's running. If you're unfamiliar with Docker and curious, you can take a look at their getting started page.

Pre-Provision

Generate and Add an SSH Key

This module requires you provide an ssh public key which will be used to generate an Amazon EC2 key pair. AWS can use either ED25519 or 2048-bit SSH-2 RSA keys. There are a number of third party tools that can be used to generate an approrpiate keypair. One way is via the ssh-keygen command provided by OpenSSH.

$ ssh-keygen -t RSA -f /path/to/key/file/id_rsa

Installation for OpenSSH will depend on the OS of your machine.

  • On MacOS OpenSSH should be installed by default.
  • On Windows you may need to follow addional steps.
  • On Ubuntu/Debian/Linux Mint:
$ sudo apt-get install openssh-client
  • On RHEL/Centos/Fedora:
$ sudo yum -y install openssh-clients

After doing so, provide the contents of the public key (id_rsa.pub) file to the module's public_key var. The ssh private key should remain private.

Register a Domain

New Domain - Currently, the test controller requires that you own a domain with a registrar and a hosted zone configured in Route53. The name of the hosted zone should be set as the base_domain var and the necessary DNS records will be created by this Terraform module. If you don't currently own a domain, you can purchase one via the Route53 registrar, doing so creates a hosted zone in Route53 automatically. This is our recommended approach.

BYO Domain - If you already own a domain that you wish to use you can do so, however you'll still need to create a hosted zone in Route53. The module output route53_endpoints.name_servers will provide a list of name servers associated with the hosted zone. Use these to delegate DNS resolution for the domain to Route53. Usually this is done by creating an NS record wherever the base domain is hosted. For BYO domains, we recommend using a sub-domain (test.foo.com) as base_domain rather than using a top level domain (foo.com) and delegating name server resolution to route53 for that subdomain. This module will create several certificates in AWS Certificate Manager which use DNS for validation. Be sure that your base domain is updated before you run terraform apply or else the certificates will fail to validate.


Note - Depending on where your domain is registered, the certbot lambda may fail depending on the response from your authoritative DNS server. Specifically, if a query for CAA records returns SERVFAIL instead of NOERROR the lambda will exit with an error as let's encrypt does not accept this response. If you see an error message in the lambda logs along the lines of Detail: DNS problem: SERVFAIL looking up CAA for <domain-name> - the domain's nameservers may be malfunctioning this is likely the case. To fix this you can add a CAA record under the subdomain in route53. The record should be formatted like <sub-domain-name> CAA 0 issue letsencrypt.org.

BYO Network

This module includes all the necessary networking resources for the test controller to communicate with agents across three regions. It also supports the ability to integrate with an existing network topology if you happen to have one. To use your own, set the flag create_networking=false in your call to the module. You will then be required to set inputs for the network resources that you wish to connect.

Generate and Add a Github Access Tokens

To properly deploy the test controller acces must be granted to several repos which can be managed via personal access tokens. One is required for the input var test_controller_github_access_token to clone the test controller repo you chose to use. Once deployed, this module will create a pipeline in AWS Codepipeline, which builds and pushes several container images related to the test controller. In order to perform this Codepipeline will clone the test controller codebase. Codepipeline must be connected to a Github account to clone from a Github repo. Additionally, transaction_processor_github_access_token may be necessary to set based on the permissions of the repo you wish to clone. If your transaction processor repo is not available to the public, you'll need to specify read permissions for this token. Depending on the repos you use, these inputs may take the same token or require different ones.

Configure IAM Permissions

Terraform will require permission to access multiple services in AWS. Permissions in AWS are managed via the IAM service. Generally speaking you want to provide the smallest set of permissions possible to a role. This is known as the Principle of Least Privilege. Since Terraform here will be interacting with such a wide array of services to deploy the test controller, for simplicity you can grant Administrator Access. This can be attached to an IAM user that Terraform can authenticate against. If you'd like to restrict Terraform's access with a fine toothed comb however you certainly can.

Provision

This repo contains Terraform configuration mirroring that of the research paper here. This is intended to serve as your main entrypoint for your deployment. Deployment instructions are located here. If you want to configure the environment for your own tests this module provides a number of inputs for doing so.

Post-Provision

Invoke the Certbot Lambda

The test controller requires an SSL certificate to allow for client connections via HTTPS. This module will provision a Lambda capable of generating an appropriate cert issued via Let's Encrypt. The lambda is configured to fire off every twelve hours to check that the cert has yet to expire. If you wish to run tests in your environment immediately provisioning, you will need to invoke the lambda yourself. You do this via the AWS CLI. Using the credentials you configured for your environment, run:

$ aws lambda invoke --region us-east-1 --function-name test-controller-certbot-lambda /dev/stdout


Note - The lambda usually takes a few minutes to complete it's execution.
Note - The lambda will create a certificate in AWS Certificate Manager. This is not tied to the terraform automation, so you will need to delete it manually after running a terraform destroy. You should delete it only after you've destroyed everything else. To do so, simply select the certificate with the test controller domain name test-controller.<base_domain> and hit "delete".

Monitor Codepipeline

The test controller pipeline should run automatically. All pipeline phases must succeed before you can run any tests. You can verify this by checking the most recent execution status of test-controller-pipeline in the AWS Codepipeline service.

Diagram

Codepipeline will poll for the latest changes to the test controller repo. This way you will recieve updates automatically without any manual intervention. Occasionally, Codepipeline may fail during the deployment process. These are usually transient errors which will resolve by simply running the pipeline again. Using the credentials you configured for your environment, run:

$ aws lambda invoke --region us-east-1 --function-name test-controller-certbot-lambda /dev/stdout

Monitor Health Checks

Both the test controller's UI and API exist inside of a single ECS task. The task must be running and healthy before you can schedule test runs in your environment. Three sets of target groups are configured against the task, one as an entrypoint for agents, one for authentication, and one for the test controller's UI. The task will be scheduled under the test-controller service, which belongs to a cluster with the same name as whatever the Terraform var environment is set to. It's easiest to verify these in the AWS console. When the environment is healthy, these services should look like the following:

Running ECS Task

Healthy Target Group

Access the Test Controller

The module will generate some DNS records in AWS Route53 for you. A CNAME record is created in Route53 which will point to the UI load balancer. The format of this will be test-controller.<base_domain>. The environment and base_domain values will be set to whatever you configured to the corresponding Terraform vars. Assuming your environment is up and configured properly, you should be able to access by typing the url into any browser. In a fresh environment, you will need to add a client certificate into the environment in order to authenticate with the test controller. The process for this is documented in the test controller's README.
Note - This module configures the port 8443 to route to the auth endpoint via the network load balancer. This means the port must be specified in the url you enter into the browser https://test-controller.<base_domain>:8443/auth. The appropriate record is also provided as an output route53_endpoints.ui_endpoint.

Configure Opensearch Permissions

In order for Amazon Kinesis Firehose to push OpenSearch, you will need to configure permissions for it inside of the OpenSearch cluster. You can do so via the following steps:

  1. Login to the OpenSearch dashboard. You can find the url under General information of your cluster
  2. From the navigation pane, choose Security.
  3. Choose Roles.
  4. Search for the all_access role.
  5. Choose the Mapped users tab.
  6. On the Mapped users dialog page, choose Manage mapping.
  7. Under Backend roles, enter the role ARN created for Amazon Kenisis Firehose.
  8. Choose Map. Your Amazon Kenisis Firehose should now be able to forward data to your OpenSearch Service domain.

Additionally, you may want to give your admin user the all_access permission along with the IAM user you use to access the AWS console. This will allow cluster attributes visibility from the console.

Request Limit Increases (Optional)

Some plots shown in the paper require a great deal of compute power to reproduce. The default quotas for EC2 instances set on AWS accounts will likely be insufficient in some cases. The test controller will schedule instances using available vCPUs based on the service quota API, meaning it will run what it can instead of reporting errors. To reproduce entire plots, you will need to submit requests limit increases on several EC2 service quotas. Specifically:

Quota Name us-east-1 us-east-2 us-west-2
All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests 32,000 32,000 32,000
Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances 32,000 32,000 32,000

Requirements

Name Version
terraform = 1.4.6
aws >= 4.4

Providers

Name Version
aws >= 4.4
aws.use2 >= 4.4
aws.usw2 >= 4.4

Modules

Name Source Version
bastion ./modules/bastion n/a
ec2_profile terraform-aws-modules/ecs/aws//modules/ecs-instance-profile 3.0.0
ecs terraform-aws-modules/ecs/aws 3.5.0
ecs_cluster_asg terraform-aws-modules/autoscaling/aws 3.9.0
ecs_cluster_security_group terraform-aws-modules/security-group/aws 3.1.0
opensearch ./modules/opensearch n/a
route53_dns ./modules/route53_dns n/a
test_controller_agent_use1 ./modules/test-controller-agent n/a
test_controller_agent_use2 ./modules/test-controller-agent n/a
test_controller_agent_usw2 ./modules/test-controller-agent n/a
test_controller_deploy ./modules/test-controller-deploy n/a
test_controller_service ./modules/test-controller n/a
uhs_seed_generator ./modules/uhs-seed-generator n/a
vpc terraform-aws-modules/vpc/aws 5.0.0
vpc_endpoints_use1 ./modules/vpc-endpoints n/a
vpc_endpoints_use2 ./modules/vpc-endpoints n/a
vpc_endpoints_usw2 ./modules/vpc-endpoints n/a
vpc_peering_connection_use1_use2 ./modules/vpc-peering-connection n/a
vpc_peering_connection_use1_usw2 ./modules/vpc-peering-connection n/a
vpc_peering_connection_use2_usw2 ./modules/vpc-peering-connection n/a
vpc_use2 terraform-aws-modules/vpc/aws 5.0.0
vpc_usw2 terraform-aws-modules/vpc/aws 5.0.0

Resources

Name Type
aws_cloudwatch_log_group.agents_use1 resource
aws_cloudwatch_log_group.agents_use2 resource
aws_cloudwatch_log_group.agents_usw2 resource
aws_iam_service_linked_role.ecs resource
aws_s3_bucket.agent_outputs resource
aws_s3_bucket.binaries resource
aws_s3_bucket_server_side_encryption_configuration.agent_outputs resource
aws_s3_bucket_server_side_encryption_configuration.binaries resource
aws_s3_bucket_versioning.binaries resource
aws_availability_zones.use1 data source
aws_availability_zones.use2 data source
aws_availability_zones.usw2 data source
aws_caller_identity.current data source
aws_region.current data source
aws_ssm_parameter.ecs_optimized_ami data source

Inputs

Name Description Type Default Required
agent_instance_types The instance types used in agent launch templates. list(string)
[
"c5n.large",
"c5n.2xlarge",
"c5n.9xlarge",
"c5n.metal"
]
no
base_domain Base domain to use for ACM Cert and Route53 record management. string "" no
cert_arn A custom ACM cert arn to use; only valid when create_networking is false. string "" no
cluster_instance_type If test controller launch type is EC2, the instance size to use. string "c5ad.12xlarge" no
create_certbot_lambda Boolean to create the certbot lambda to update the letsencrypt cert for the test controller. bool true no
create_networking Flag to create VPCs and related resources string true no
create_opensearch Boolean to create Opensearch domain and related resources bool false no
create_uhs_seed_generator Determines whether or not to create uhs seed generator resources bool true no
ec2_public_key SSH public key to use in EC2 instances. string "" no
environment AWS tag to indicate environment name of each infrastructure object. string n/a yes
fire_hose_buffering_interval Interval time between sending Fire Hoe buffer data to Open Search number 60 no
fire_hose_index_rotation_period The Elasticsearch index rotation period. Index rotation appends a timestamp to the IndexName to facilitate expiration of old data. string "OneDay" no
hosted_zone_id Id of hosted zone in Route53 string null no
lambda_build_in_docker Determines whether or not to build certbot lambda function in docker. bool true no
lets_encrypt_email Email to associate with let's encrypt certificate string n/a yes
opensearch_ebs_volume_size Size of EBS volume to back Open Search domain string "10" no
opensearch_ebs_volume_type Type of EBS volume to back Open Search domain string "gp2" no
opensearch_engine_version The engine version to use for the OpenSearch domain string "OpenSearch_1.3" no
opensearch_instance_count Number of instances to include in OpenSearch domain string "1" no
opensearch_instance_type Instance type used for Open Search cluster string "r6g.large.search" no
opensearch_master_user_name Master username of opensearch user string "admin" no
opensearch_master_user_password Master password of opensearch user string "" no
opensearch_route53_record_ttl TTL for CNAME record of opensearch domain string "600" no
private_subnet_tags Tags associated with private subnets map(string) {} no
private_subnets_use1 Private subnets in VPC us-east-1 (required if create_networking==false) list(string) null no
private_subnets_use2 Private subnets in VPC us-east-2 (required if create_networking==false) list(string) null no
private_subnets_usw2 Private subnets in VPC us-west-2 (required if create_networking==false) list(string) null no
public_subnet_tags Tags associated with public subnets map(string) {} no
public_subnets_use1 Public subnets in VPC us-east-1 (required if create_networking==false) list(string) null no
public_subnets_use2 Public subnets in VPC us-east-2 (required if create_networking==false) list(string) null no
public_subnets_usw2 Public subnets in VPC us-west-2 (required if create_networking==false) list(string) null no
resource_tags Tags to set for all resources map(string) {} no
route_tables_use1 Route tables in VPC us-east-1 (required if create_networking==false) list(string) null no
route_tables_use2 Route tables in VPC us-east-2 (required if create_networking==false) list(string) null no
route_tables_usw2 Route tables in VPC us-west-2 (required if create_networking==false) list(string) null no
s3_interface_endpoint_use1 S3 endpoint for VPC in us-east-1 (required if create_networking==false) string null no
s3_interface_endpoint_use2 S3 endpoint for VPC in us-east-2 (required if create_networking==false) string null no
s3_interface_endpoint_usw2 S3 endpoint for VPC in us-west-2 (required if create_networking==false) string null no
subnet_prefix_extension CIDR block bits extension to calculate CIDR blocks of each subnetwork. number 4 no
test_controller_app_container_base_image An optional custom container base image for the test controller and releated services string "ubuntu:20.04" no
test_controller_cpu The ECS task CPU string "4096" no
test_controller_github_access_token Access token for cloning test controller repo string n/a yes
test_controller_github_repo The Github repo base name string "opencbdc-tctl" no
test_controller_github_repo_branch The repo branch to use for the Test Controller deployment pipeline. string "trunk" no
test_controller_github_repo_owner The Github repo owner string "mit-dci" no
test_controller_golang_container_build_image An optional custom container build image for test controller Golang depencies string "golang:1.16" no
test_controller_health_check_grace_period_seconds The ECS service health check grace period in seconds number 300 no
test_controller_launch_type The ECS task launch type to run the test controller. string "FARGATE" no
test_controller_memory The ECS task memory string "30720" no
test_controller_node_container_build_image An optional custom container build image for test controller Nodejs depencies string "node:14" no
transaction_processor_github_access_token Access token for the transaction repo if permissions are required string "" no
transaction_processor_main_branch Main branch of transaction repo string "trunk" no
transaction_processor_repo_url Transaction repo cloned by the test controller for load generation logic string "https://github.com/mit-dci/opencbdc-tx.git" no
uhs_seed_generator_batch_job_timeout Memory required for a seed generator batch job string 1209600 no
uhs_seed_generator_job_memory Memory required for a seed generator batch job string "8192" no
uhs_seed_generator_job_vcpu Vcpus required for a seed generator batch job string "4" no
uhs_seed_generator_max_vcpus Max vcpus allocatable to the seed generator environment string "50" no
use1_main_network_block Base CIDR block to be used in us-east-1. string "10.0.0.0/16" no
use2_main_network_block Base CIDR block to be used in us-east-2. string "10.10.0.0/16" no
usw2_main_network_block Base CIDR block to be used in us-west-2. string "10.20.0.0/16" no
vpc_azs_use1 AZs of VPC in us-east-1 (required if create_networking==false) list(string) null no
vpc_azs_use2 AZs of VPC in us-east-2 (required if create_networking==false) list(string) null no
vpc_azs_usw2 AZs of VPC in us-east-2 (required if create_networking==false) list(string) null no
vpc_id_use1 ID of VPC in us-east-1 (required if create_networking==false) string null no
vpc_id_use2 ID of VPC in us-east-2 (required if create_networking==false) string null no
vpc_id_usw2 ID of VPC in us-west-2 (required if create_networking==false) string null no
zone_offset CIDR block bits extension offset to calculate Public subnets, avoiding collisions with Private subnets. number 8 no

Outputs

Name Description
azs_use1 Availability zones used by VPC located in us-east-1 region
azs_use2 Availability zones used by VPC located in us-east-2 region
azs_usw2 Availability zones used by VPC located in us-west-2 region
ecs_cluster_id ECS cluster id
ecs_cluster_name ECS cluster name
private_subnets_use1 Private subnet Ids associated with VPC in us-east-1 region
private_subnets_use2 Private subnet Ids associated with VPC in us-east-2 region
private_subnets_usw2 Private subnet Ids associated with VPC in us-west-2 region
public_subnets_use1 Public subnet Ids associated with VPC in us-east-1 region
public_subnets_use2 Public subnet Ids associated with VPC in us-east-2 region
public_subnets_usw2 Public subnet Ids associated with VPC in us-west-2 region
route53_endpoints Route53 endpoints generated by test controller services
s3_vpc_interface_endpoint_use1 S3 service interface endpoint asscoiated with VPC in us-east-1 region
s3_vpc_interface_endpoint_use2 S3 service interface endpoint asscoiated with VPC in us-east-2 region
s3_vpc_interface_endpoint_usw2 S3 service interface endpoint asscoiated with VPC in us-west-2 region
vpc_id_use1 Id of VPC in us-east-1 region
vpc_id_use2 Id of VPC in us-east-2 region
vpc_id_usw2 Id of VPC in us-west-2 region