Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-399260: Keep both new and old certs during the switchover #6223

Conversation

changlei-li
Copy link
Contributor

In host-refresh-server-certificate, host generates and applies new pool certificate after stunnel restart. There is 5s that other hosts don't trust the new certificate. Then operations related with xapi:pool SNI tls connection will fail. Both the old and new certificates shall be trusted during the switchover to avoid this. At the end of the refresh procedure, remove_stale_cert will rename the new pem to pem and update ca bundle.

@changlei-li changlei-li marked this pull request as draft January 14, 2025 08:00
scripts/update-ca-bundle.sh Fixed Show fixed Hide fixed
scripts/update-ca-bundle.sh Fixed Show fixed Hide fixed
scripts/update-ca-bundle.sh Fixed Show fixed Hide fixed
scripts/update-ca-bundle.sh Fixed Show fixed Hide fixed
scripts/update-ca-bundle.sh Fixed Show fixed Hide fixed
scripts/update-ca-bundle.sh Fixed Show fixed Hide fixed
In host-refresh-server-certificate, host generates and applies
new pool certificate after stunnel restart. There is 5s that
other hosts don't trust the new certificate. Then operations
related with xapi:pool SNI tls connection will fail. Both
the old and new certificates shall be trusted during the
switchover to avoid this. At the end of the refresh procedure,
remove_stale_cert will rename the new pem to pem and update
ca bundle.

Signed-off-by: Changlei Li <[email protected]>
@changlei-li changlei-li force-pushed the private/changleli/CA-399260 branch from 039c0e0 to dd3c2de Compare January 14, 2025 08:08

rm -f "$BUNDLE.tmp"
touch "$BUNDLE.tmp"
for TMP_CERT in $TMP_CERTS; do
if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use if to ignore cat xxx.new.pem error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for dealing with error conditions, it needs to be more explicit. When is the then branch taken here and when not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement checks whether the cat command succeeds. If the cat command successfully appends the contents of the temporary certificate file to the bundle file, the script then appends an empty line to the bundle file using the echo "" >> "$BUNDLE.tmp" command.
If the cat command fails, it won't stop and go on the next command even if set -e.

As I know, we can ignore the command error in shell script in two ways

  1. cat "$TMP_CERT" >> "$BUNDLE.tmp" || true
  2. if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then

If we don't ignore the cat error, the script will fail and stop, return error to xapi. Then we may get error "message: Subprocess exited with unexpected code 1; stdout = [ ]; stderr = [ cat: 21a92da1-c874-48e2-be70-e49e8ab15af1.new.pem: No such file or directory" in xapi. That is what happend in CA-362358.

Copy link
Contributor

@lindig lindig Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the .. || true idiom better. But what causes the failure in the first place and should we check those conditions? I assume $TMP_CERT did not exist because >> will succeed if the resulting file can be written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I am asking for help in the channel. There is a race condition in CA-362358, new.pem is found but it is renamed by other thread when cat. The convention is to add lock, but it seems no one wanted to add lock for this trifle. Then the new.pem was filtered out in #4893. I want to ignore the shell script error here following this thought.

Copy link
Member

@psafont psafont Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these certificates are handled in a special way if they are meant to be included in the bundle in any case.

This issue was introduced by #4893

This PR was trying to fix a race condition:

Script to refresh CA bundle fails because a pem is renamed
This is a race condition between the code that refreshes directory of certificates and the generation of new certificates.
What happens is that the latter generates a temporary file ending in "new", the former reads the directory, which contains this file, the latter removes the temporary file and when the former tries to read the file it fails.

The PR removed the inclusion of .new.pem certificates, which we see that is wrong, but we need better coordination between the code that generates new certificates and this refresh of the bundle.

I believe we need to revert the change done to this script isntead of adding more code; and then fix the race between the two actions: we don't want to allow new pool certificates being generated while the pool is already in the process of changing certificates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Certificates are refreshed every hour during testing but not in production - making this problem more visible
  • The call is "xe host-refresh-server-certificate host=$HOST"
  • Is this the collision between two instances of the same API call or a collision between unrelated API calls?

@lindig
Copy link
Contributor

lindig commented Jan 17, 2025

I suggest to close this PR and analyze the situation on the ticket. Xapi implements a lock that ensures only one instance of the bundling script is running. So we need to understand how this race happens. The script could be also easily improved to protect against using names that collide when multiple instances are running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants