-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA-399260: Keep both new and old certs during the switchover #6223
CA-399260: Keep both new and old certs during the switchover #6223
Conversation
In host-refresh-server-certificate, host generates and applies new pool certificate after stunnel restart. There is 5s that other hosts don't trust the new certificate. Then operations related with xapi:pool SNI tls connection will fail. Both the old and new certificates shall be trusted during the switchover to avoid this. At the end of the refresh procedure, remove_stale_cert will rename the new pem to pem and update ca bundle. Signed-off-by: Changlei Li <[email protected]>
039c0e0
to
dd3c2de
Compare
|
||
rm -f "$BUNDLE.tmp" | ||
touch "$BUNDLE.tmp" | ||
for TMP_CERT in $TMP_CERTS; do | ||
if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use if
to ignore cat xxx.new.pem error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for dealing with error conditions, it needs to be more explicit. When is the then
branch taken here and when not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement checks whether the cat command succeeds. If the cat command successfully appends the contents of the temporary certificate file to the bundle file, the script then appends an empty line to the bundle file using the echo "" >> "$BUNDLE.tmp"
command.
If the cat command fails, it won't stop and go on the next command even if set -e
.
As I know, we can ignore the command error in shell script in two ways
- cat "$TMP_CERT" >> "$BUNDLE.tmp" || true
- if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then
If we don't ignore the cat error, the script will fail and stop, return error to xapi. Then we may get error "message: Subprocess exited with unexpected code 1; stdout = [ ]; stderr = [ cat: 21a92da1-c874-48e2-be70-e49e8ab15af1.new.pem: No such file or directory" in xapi. That is what happend in CA-362358.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the .. || true
idiom better. But what causes the failure in the first place and should we check those conditions? I assume $TMP_CERT
did not exist because >>
will succeed if the resulting file can be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I am asking for help in the channel. There is a race condition in CA-362358, new.pem is found but it is renamed by other thread when cat. The convention is to add lock, but it seems no one wanted to add lock for this trifle. Then the new.pem was filtered out in #4893. I want to ignore the shell script error here following this thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why these certificates are handled in a special way if they are meant to be included in the bundle in any case.
This issue was introduced by #4893
This PR was trying to fix a race condition:
Script to refresh CA bundle fails because a pem is renamed
This is a race condition between the code that refreshes directory of certificates and the generation of new certificates.
What happens is that the latter generates a temporary file ending in "new", the former reads the directory, which contains this file, the latter removes the temporary file and when the former tries to read the file it fails.
The PR removed the inclusion of .new.pem certificates, which we see that is wrong, but we need better coordination between the code that generates new certificates and this refresh of the bundle.
I believe we need to revert the change done to this script isntead of adding more code; and then fix the race between the two actions: we don't want to allow new pool certificates being generated while the pool is already in the process of changing certificates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Certificates are refreshed every hour during testing but not in production - making this problem more visible
- The call is "xe host-refresh-server-certificate host=$HOST"
- Is this the collision between two instances of the same API call or a collision between unrelated API calls?
I suggest to close this PR and analyze the situation on the ticket. Xapi implements a lock that ensures only one instance of the bundling script is running. So we need to understand how this race happens. The script could be also easily improved to protect against using names that collide when multiple instances are running. |
In host-refresh-server-certificate, host generates and applies new pool certificate after stunnel restart. There is 5s that other hosts don't trust the new certificate. Then operations related with xapi:pool SNI tls connection will fail. Both the old and new certificates shall be trusted during the switchover to avoid this. At the end of the refresh procedure, remove_stale_cert will rename the new pem to pem and update ca bundle.