Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] More resilient setup #4276

Open
marcoacierno opened this issue Dec 25, 2024 · 0 comments
Open

[WIP] More resilient setup #4276

marcoacierno opened this issue Dec 25, 2024 · 0 comments

Comments

@marcoacierno
Copy link
Member

As we now rely on AWS for both FE and BE, which is hosted on a single server in a single region, we are more vulnerable to aws outages, our mistakes and similar.

It might make sense to look into what we can do to reduce the risks.

A few notes / past experiences:

  1. Large amount of requests that exhaust the DB connections. This happened a few times when someone spammed us with requests but it also happened because Lambda scaled to handle the amount of requests. With our EC2 setup, the server will likely give up before. It might be good to do some load testing.
  2. What are the components that are essential during the conference?
    1. Schedule: People need to be able to see the schedule. In a scenario where our server is offline: We could setup a static schedule version website that updates every X minutes/hours. Host it on a S3 bucket+CF and in a domain like schedule.static.pycon.it. We could consider some other hosting (if we want to resist the unlikely AWS outages)
    2. Tickets: Mostly for tshirt and social events. Not much we can do otherwise than creating the server in a new AZ

We should investigate

  • Migrate from static server to an auto scaling group. Our current setup is an EC2 instance created via terraform that is attached to the ECS cluster. This solution is very limiting as we can't easily replace the ec2 instance if it is fails and we can't automate it
  • Setup our cluster and task definition so we can "scale up" our tasks using fargate capacity - this would allow us to scale up if needed for some reason by using fargate over manually setting up ec2 instances. Should also be much faster
  • See if we can reduce traefik single-point of failure. To save costs we are not using an ALB in front and instead relying on traefik to do a similar job, but this means that if we need to update traefik we will have downtime, if are updating it and it crashes we will be down. All of this because we can't have 2 services listening on port 80, so we can't start traefik multiple times
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant