Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AWS costs, focused on lower-traffic operation #2117

Open
jpd236 opened this issue May 3, 2024 · 1 comment
Open

Optimize AWS costs, focused on lower-traffic operation #2117

jpd236 opened this issue May 3, 2024 · 1 comment

Comments

@jpd236
Copy link
Contributor

jpd236 commented May 3, 2024

The current CloudFormation scheme scales nicely for higher traffic events, but is a bit overkill for running with low traffic throughout the rest of the year. I'd like to optimize the costs a bit to make keeping it running a bit more palatable, even if it comes at the cost of some overall reliability.

I think the main win here would be to support a single-instance of EC2 so the NLB could be dropped, as that is the most expensive component. If we assume coturn is necessary for full voice chat reliability, I think we have no choice but to support terminating TLS within the EC2 instance and managing our own certificate. (If it were just HTTP(S) traffic, I think we could use Cloudfront, which has a pretty large always-free tier). Amazon's certificate manager can't be used directly within EC2.

So for single instance, I believe we need to:

  • Provide a way to pass in a certificate, e.g. from Let's Encrypt. We could potentially try to get fancy and request+renew the certificate within JR, with automated configuration for the linked domain, and to support cycling the cert in the running server, but as a starting point I think I'm fine with letting it be managed externally and bouncing the server every 2-3 months on renewal. Our secure storage (credstash or otherwise) can provide the cert.
  • Update haproxy to use the cert when serving traffic on port 443.
  • Request an elastic IP and use it for the EC2 instance.
  • Update the DNS to point directly to the elastic IP; remove NLB and the scaling group.

As I understand it, this would eliminate a ~$16.20/month cost for the NLB. Data transfer costs look higher from EC2 but only after the first 100 GB which is free.

From there, two other ideas I have for optimization:

  • Replace credstash with AWS Systems Manager Parameter Store. The KMS key that credstash depends on is $1/month. While the parameter store uses KMS under the covers, I believe that the only charges for it are the per-request ones which should be negligible here.

  • Reduce the number of metrics tracked in Cloudwatch. I haven't actually enabled Cloudwatch yet, so I'm less sure how this works, but I believe that every metric beyond the 10th is $0.30/month. IIUC, with single instance, we'd just have 13 metrics, so $0.90/month, but we could probably drop 3 less essential ones to make it free.

Overall, I think that puts baseline costs at $3.60 for the IP + $0.50 for the Route 52 config + the EC2 costs ($8.35 for t2.micro) = $12.45/month. Higher than a simple Heroku instance, but much more reasonable than the ~$30 I think it would run to right now.

I'm not super familiar with AWS so the above may have some flaws/misunderstandings, and/or there may be better ways of optimizing here. Thoughts welcome :)

@ebroder
Copy link
Member

ebroder commented May 5, 2024

Overall, I'm broadly on board with this. I hit a mental block at "we can't use ACM anymore", so I'm glad you're pushing this forward.

A few specific notes:

  • I'm not wild about requiring manual cert uploads + kicks. That just sounds like a good way for things to break every 3 months. I'd much rather solve this with a startup script + a crontab - especially because we can assume in "low-cost mode" that we don't have to coordinate across multiple instances.
  • In a somewhat similar vein, I want to make sure that we don't lose the ASG integration, so that the infrastructure is mostly self-healing. That (somewhat annoyingly) means that we need to make sure DNS gets updated when the ASG replaces an instance for some reason. It's possible to wire up ASG events to an AWS Lambda, so we can use that to trigger a Route53 update.
  • I kind of hate Cloudwatch and I wouldn't be sad to dump it. A long time ago, we were using Datadog, which has a much more generous and useful free tier, but the Datadog agent had a tendency to cause us to run out of RAM on our instances, which motivated the switch. In general I find it useful to be able to see some CPU/memory stats for instance-sizing and load-management purposes, but I would be thrilled to find a better way to get it.

jpd236 added a commit to jpd236/jolly-roger that referenced this issue Jun 29, 2024
In this mode, only memory and swap usage is captured. At two metrics per
instance, this usage should remain in the free tier (10 custom metrics
total). Total CPU usage is available as a built-in metric in EC2, so
this should give a reasonable picture of CPU/memory usage for basic
monitoring.

The additional metrics around detailed CPU, disk, and network are
available in the "detailed" mode (the default).

This also fixes an issue with the indentation in the YAML when
CloudWatch is enabled. We also have to be careful when creating the
CloudWatch JSON config via string substitution, which unfortunately
makes for a somewhat cludgier YAML file.

See deathandmayhem#2117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants