Optimize AWS costs, focused on lower-traffic operation #2117

jpd236 · 2024-05-03T06:53:35Z

The current CloudFormation scheme scales nicely for higher traffic events, but is a bit overkill for running with low traffic throughout the rest of the year. I'd like to optimize the costs a bit to make keeping it running a bit more palatable, even if it comes at the cost of some overall reliability.

I think the main win here would be to support a single-instance of EC2 so the NLB could be dropped, as that is the most expensive component. If we assume coturn is necessary for full voice chat reliability, I think we have no choice but to support terminating TLS within the EC2 instance and managing our own certificate. (If it were just HTTP(S) traffic, I think we could use Cloudfront, which has a pretty large always-free tier). Amazon's certificate manager can't be used directly within EC2.

So for single instance, I believe we need to:

Provide a way to pass in a certificate, e.g. from Let's Encrypt. We could potentially try to get fancy and request+renew the certificate within JR, with automated configuration for the linked domain, and to support cycling the cert in the running server, but as a starting point I think I'm fine with letting it be managed externally and bouncing the server every 2-3 months on renewal. Our secure storage (credstash or otherwise) can provide the cert.
Update haproxy to use the cert when serving traffic on port 443.
Request an elastic IP and use it for the EC2 instance.
Update the DNS to point directly to the elastic IP; remove NLB and the scaling group.

As I understand it, this would eliminate a ~$16.20/month cost for the NLB. Data transfer costs look higher from EC2 but only after the first 100 GB which is free.

From there, two other ideas I have for optimization:

Replace credstash with AWS Systems Manager Parameter Store. The KMS key that credstash depends on is $1/month. While the parameter store uses KMS under the covers, I believe that the only charges for it are the per-request ones which should be negligible here.
Reduce the number of metrics tracked in Cloudwatch. I haven't actually enabled Cloudwatch yet, so I'm less sure how this works, but I believe that every metric beyond the 10th is $0.30/month. IIUC, with single instance, we'd just have 13 metrics, so $0.90/month, but we could probably drop 3 less essential ones to make it free.

Overall, I think that puts baseline costs at $3.60 for the IP + $0.50 for the Route 52 config + the EC2 costs ($8.35 for t2.micro) = $12.45/month. Higher than a simple Heroku instance, but much more reasonable than the ~$30 I think it would run to right now.

I'm not super familiar with AWS so the above may have some flaws/misunderstandings, and/or there may be better ways of optimizing here. Thoughts welcome :)

ebroder · 2024-05-05T02:53:11Z

Overall, I'm broadly on board with this. I hit a mental block at "we can't use ACM anymore", so I'm glad you're pushing this forward.

A few specific notes:

I'm not wild about requiring manual cert uploads + kicks. That just sounds like a good way for things to break every 3 months. I'd much rather solve this with a startup script + a crontab - especially because we can assume in "low-cost mode" that we don't have to coordinate across multiple instances.
In a somewhat similar vein, I want to make sure that we don't lose the ASG integration, so that the infrastructure is mostly self-healing. That (somewhat annoyingly) means that we need to make sure DNS gets updated when the ASG replaces an instance for some reason. It's possible to wire up ASG events to an AWS Lambda, so we can use that to trigger a Route53 update.
I kind of hate Cloudwatch and I wouldn't be sad to dump it. A long time ago, we were using Datadog, which has a much more generous and useful free tier, but the Datadog agent had a tendency to cause us to run out of RAM on our instances, which motivated the switch. In general I find it useful to be able to see some CPU/memory stats for instance-sizing and load-management purposes, but I would be thrilled to find a better way to get it.

In this mode, only memory and swap usage is captured. At two metrics per instance, this usage should remain in the free tier (10 custom metrics total). Total CPU usage is available as a built-in metric in EC2, so this should give a reasonable picture of CPU/memory usage for basic monitoring. The additional metrics around detailed CPU, disk, and network are available in the "detailed" mode (the default). This also fixes an issue with the indentation in the YAML when CloudWatch is enabled. We also have to be careful when creating the CloudWatch JSON config via string substitution, which unfortunately makes for a somewhat cludgier YAML file. See deathandmayhem#2117

jpd236 mentioned this issue May 4, 2024

Move from credstash to AWS Systems Manager Parameter Store. #2126

Merged

jpd236 mentioned this issue May 6, 2024

Add the ability to disable load balancing to CloudFormation config. #2132

Merged

jpd236 mentioned this issue Jun 29, 2024

Enable a "minimal" mode for CloudWatch monitoring. #2197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AWS costs, focused on lower-traffic operation #2117

Optimize AWS costs, focused on lower-traffic operation #2117

jpd236 commented May 3, 2024

ebroder commented May 5, 2024

Optimize AWS costs, focused on lower-traffic operation #2117

Optimize AWS costs, focused on lower-traffic operation #2117

Comments

jpd236 commented May 3, 2024

ebroder commented May 5, 2024