😈zero‐downtime deployment (EN)

Implementing Zero-Downtime Deployment for Coding Duo

Motivation for Adopting Zero-Downtime Deployment

Every time we deploy, there is a downtime where the system is not operational, which can degrade the user experience. For this reason, we have adopted zero-downtime deployment to deploy new versions without interrupting the service.

Criteria for Choosing a Zero-Downtime Deployment Method

There are several methods for zero-downtime deployment, with the most common being:

Rolling
Blue/Green
Canary

Here’s a brief explanation of each deployment method:

Rolling deployment is a method that gradually shifts traffic from the old version to the new version.

Blue/Green deployment is a method that shifts traffic all at once from the old version to the new version.

Canary deployment is similar to rolling deployment in that it gradually shifts traffic from the old version to the new version. However, it operates the new version for a limited user base over a certain period to ensure there are no issues before shifting all traffic to the new version.

To conclude, Coding Duo has chosen Blue/Green deployment using Nginx and Docker.

I will explain the process of deciding to implement Blue/Green deployment by comparing it with our previous deployment methods.

Existing Infrastructure Structure

The above image illustrates the approximate operational infrastructure of our service.

We have a total of 2 web servers running on EC2 instances, Instance A and Instance B.

The deployment proceeds with the deployment of A being completed first, followed by B.

Even if downtime occurs on Instance A, Instance B is operational, which means we can consider this a Rolling zero-downtime deployment.

Existing Coding Duo Deployment Method

Let's take a look at how EC2 instances are managed in our existing CI/CD process.

Initially, through a self-hosted runner in GitHub Actions, we update the old version of the web server Docker container on the first EC2 instance (production A) to the new version.

After the deployment on production A is complete, the CD script for the second instance (production B) is executed. Production B waits until the web server of production A passes a health check before beginning its own deployment to minimize downtime.

The following image shows that production B has received a health check response from A, and the deployment of production B has been completed.

However, as seen in the following deployment process, there are some issues with the Rolling deployment.

The deployment of Instance B depends on the successful deployment of Instance A, which could lead to compatibility issues between the old and new versions at the moment before deployment starts.

Additionally, traffic tends to concentrate on just one instance.

We decided to hold off on Canary deployment as it can also lead to compatibility issues, similar to Rolling deployment.

Thus, we adopted Blue/Green zero-downtime deployment.

The environment currently in operation is referred to as blue, while the environment for the new deployment is referred to as green. When the new deployment environment is ready, all traffic previously routed to blue is directed to green.

The blue environment is then removed for the next green environment.

By directing traffic all at once to the new environment green, we avoid compatibility issues with the previous version and solve the problem of traffic concentrating on a single instance.

Blue/Green Deployment through Docker and Nginx

To implement Blue/Green zero-downtime deployment, Coding Duo has set up Nginx as a reverse proxy server on each instance.

When deployment begins, the old version and new version of the web server run on different ports (8080, 8081) on each instance.

Each instance performs health checks on the new version web server every 5 seconds to ensure it is running correctly.

Once the health check is successful, HTTP traffic coming to Nginx is routed from the old version to the new version web server.

As a result, unlike Rolling deployment, the two instances are deployed in parallel.

Another difference from Rolling deployment is that the old version web server is terminated after the deployment of the new version web server is complete.

The reasons for structuring Coding Duo's zero-downtime deployment this way are as follows:

The existing Blue/Green method requires double the number of instances to maintain both the Green and Blue environments.

Adding more instances increases costs, which is a downside.

We were able to address this issue using Nginx.

During the deployment process, the web servers on port 8080 and 8081 are each managed as Docker Containers.

Using Nginx, we can route requests coming to port 80 to either 8080 or 8081, effectively establishing a Blue/Green zero-downtime infrastructure.

The final state of the zero-downtime deployment is shown below.

Areas for Future Improvement

The currently implemented zero-downtime deployment has two issues.

The first is that if Instance A fails to deploy for any reason while Instance B deploys successfully, Instance A will operate in the Blue environment, and Instance B in the Green environment. If a rollback is not implemented, compatibility issues will arise.

The second issue is that the Nginx port forwarding within Instances A and B does not occur simultaneously.

Nginx within each instance performs port forwarding after passing the health check on localhost, and since these moments may not align, compatibility issues can arise as well.

To resolve these two issues, we need to ensure that when switching from the Blue environment to the Green environment, the health checks for other instances are completed before deploying to the new environment.

This means that after each instance finishes preparing its Green environment, it should health check the others to ensure they are ready before proceeding with deployment to the Green environment.

Of course, even with this approach, we cannot ensure that the timing of the port forwarding completion for Nginx is perfectly aligned, but we can match the timing for initiating the port forwarding.

Currently, we have a total of two instances that can health check each other, but if we scale out, it may be inefficient to check each other, so we should consider establishing a dedicated health check server as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly