[WIP] More resilient setup #4276

marcoacierno · 2024-12-25T15:38:22Z

As we now rely on AWS for both FE and BE, which is hosted on a single server in a single region, we are more vulnerable to aws outages, our mistakes and similar.

It might make sense to look into what we can do to reduce the risks.

A few notes / past experiences:

Large amount of requests that exhaust the DB connections. This happened a few times when someone spammed us with requests but it also happened because Lambda scaled to handle the amount of requests. With our EC2 setup, the server will likely give up before. It might be good to do some load testing.
What are the components that are essential during the conference?
1. Schedule: People need to be able to see the schedule. In a scenario where our server is offline: We could setup a static schedule version website that updates every X minutes/hours. Host it on a S3 bucket+CF and in a domain like schedule.static.pycon.it. We could consider some other hosting (if we want to resist the unlikely AWS outages)
2. Tickets: Mostly for tshirt and social events. Not much we can do otherwise than creating the server in a new AZ

We should investigate

Migrate from static server to an auto scaling group. Our current setup is an EC2 instance created via terraform that is attached to the ECS cluster. This solution is very limiting as we can't easily replace the ec2 instance if it is fails and we can't automate it
Setup our cluster and task definition so we can "scale up" our tasks using fargate capacity - this would allow us to scale up if needed for some reason by using fargate over manually setting up ec2 instances. Should also be much faster
See if we can reduce traefik single-point of failure. To save costs we are not using an ALB in front and instead relying on traefik to do a similar job, but this means that if we need to update traefik we will have downtime, if are updating it and it crashes we will be down. All of this because we can't have 2 services listening on port 80, so we can't start traefik multiple times

github-project-automation bot added this to PyCon Italia Dec 25, 2024

github-project-automation bot moved this to Backlog in PyCon Italia Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] More resilient setup #4276

[WIP] More resilient setup #4276

marcoacierno commented Dec 25, 2024

[WIP] More resilient setup #4276

[WIP] More resilient setup #4276

Comments

marcoacierno commented Dec 25, 2024