You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we now rely on AWS for both FE and BE, which is hosted on a single server in a single region, we are more vulnerable to aws outages, our mistakes and similar.
It might make sense to look into what we can do to reduce the risks.
A few notes / past experiences:
Large amount of requests that exhaust the DB connections. This happened a few times when someone spammed us with requests but it also happened because Lambda scaled to handle the amount of requests. With our EC2 setup, the server will likely give up before. It might be good to do some load testing.
What are the components that are essential during the conference?
Schedule: People need to be able to see the schedule. In a scenario where our server is offline: We could setup a static schedule version website that updates every X minutes/hours. Host it on a S3 bucket+CF and in a domain like schedule.static.pycon.it. We could consider some other hosting (if we want to resist the unlikely AWS outages)
Tickets: Mostly for tshirt and social events. Not much we can do otherwise than creating the server in a new AZ
We should investigate
Migrate from static server to an auto scaling group. Our current setup is an EC2 instance created via terraform that is attached to the ECS cluster. This solution is very limiting as we can't easily replace the ec2 instance if it is fails and we can't automate it
Setup our cluster and task definition so we can "scale up" our tasks using fargate capacity - this would allow us to scale up if needed for some reason by using fargate over manually setting up ec2 instances. Should also be much faster
See if we can reduce traefik single-point of failure. To save costs we are not using an ALB in front and instead relying on traefik to do a similar job, but this means that if we need to update traefik we will have downtime, if are updating it and it crashes we will be down. All of this because we can't have 2 services listening on port 80, so we can't start traefik multiple times
The text was updated successfully, but these errors were encountered:
As we now rely on AWS for both FE and BE, which is hosted on a single server in a single region, we are more vulnerable to aws outages, our mistakes and similar.
It might make sense to look into what we can do to reduce the risks.
A few notes / past experiences:
We should investigate
The text was updated successfully, but these errors were encountered: